milen.me
milen.me

Distributed Caching & Compilation

Permalink | RSS

TL;DR

  • Determinism plays an important role in the ability to cache output of compilation.
  • Distributed caching and compilation enable quick iteration on large scale software.
  • Computation of distributed cache keys is non-trivial as not all inputs are explicit.
  • A suitable distributed cache fill strategy is required for consistently fast local builds.

Deterministic Builds

Deterministic builds can be defined as:

A build is called deterministic or reproducible if running it twice produces exactly the same build outputs.

The LLVM project outlines several levels of determinism:

  • Basic: Doing a full build of the same source code in the same directory on the same machine produces exactly the same output every time.
  • Incremental: Like basic determinism, but the output binaries also do not change in partial rebuilds.
  • Local: Like incremental basic determinism, but builds are also independent of the name of the build directory. Builds of the same source code on the same machine produce exactly the same output every time, independent of the location of the source checkout directory or the build directory.
  • Universal: Like local but builds are also independent of the machine the build runs on.

Large Scale Software

Deterministic builds provide multiple benefits, mostly relating to security and caching. In this post, I cover how determinism enables the usage of techniques to speed up development of large scale software.

As software gets increasingly more complex, building all of the source code of a single program locally becomes impractical, especially as we want to support a fast development cycle. It’s too inefficient to expect engineers to wait hours to build the latest commit every morning.

Large scale software can arise in different ways:

  • Self Contained Software: The software itself is a single, self-contained product but it's very complex and large in scope. For example, Google Chrome and Photoshop would qualify as such software.
  • Vertical Systems: Even though an individual program might be small, the whole vertical system, including frameworks and system libraries, can be very large. Compiling whole vertical systems from source has multiple benefits (e.g., quicker vertical signal, lower integration cost, faster system iteration cycle) but compile times can escalate due to the sheer size of code being compiled.

Distributed Caching

One of the major benefits of having universal deterministic builds is that our CI and local development machines will be producing exactly the same artifacts, which allows us to leverage a distributed cache. This changes compile times from being a function of the total code size to being a function of the local code changes (i.e., O(total code) to O(local code changes)).

If you checkout a commit which has been built and cached by CI, almost all local computation would be replaced by fetching the remote artifacts. So, rather than waiting hours to compile everything, you can have a build ready in a few minutes.

Note that distributed caching is also known as remote caching.

Distributed Compilation

A problem that arises when working on large codebases is making local changes that require a large computational power. For example, imagine there’s a header file included almost everywhere and you make a change to that file locally (e.g., Logging.h). A distributed cache will not help because CI machines have not built that revision: the file contains unique local changes. It would be extremely inefficient to have engineers stuck waiting for hours.

There’s another technique that can speedup development in such cases: distributed compilation. The technique is also known as remote execution as it's not just limited to compilation but applies more generally to execution of tools as part of the build process.

Distribution compilation is about leveraging the power of multiple machines to remotely execute compile commands. It’s like having 1000s of CPU cores instead of just a few dozen. Usually, build systems would schedule a certain amount of compilation jobs locally and distribute the rest over remote machines which will compile the input files and return the object files as a result.

Both Buck and Bazel support remote execution.

Distributed Caching vs Distributed Compilation

While both distributed caching and distributed compilation help speed up the development cycle of large scale software, they are fundamentally different techniques. Distributed caching is about storing the results of a computation and retrieving those same results later from other machines. Distributed compilation is about leveraging the power of multiple machines to perform a computation that's too expensive to perform locally.

The techniques are orthogonal and complementary. You could have a setup with just distributed caching or just distributed compilation (ideally both).

In a pure distributed compilation setting, the total amount of work stays roughly the same, it’s just spread across multiple machines. The main benefit is that we have the ability to perform expensive computations quickly but we are still bound by the total computational capacity across the fleet.

On the other hand, distributed caching is a classic example of the space-time tradeoff applied across a network of machines. In the end, as long cache hit rates are sufficiently high, we save a lot of CPU time in exchange for paying a storage cost.

Cache Keys

Determinism

Before we cover key computation, it’s important to note the requirements imposed by having a cache: cached command outputs must be deterministic. This property allows us to safely compute results on one machine and reuse them on others.

If cached commands are not deterministic, then this would lead to a very strange property: the output of a build would depend on the cache hits. Such a system would be very fragile and hard to reason about.

Key Computation

Caches are indexed by a key, so let’s explore how such keys could be computed. Assume we want to cache the results of a command:

compiler --some-option -L /some/directory --input-file /path/to/file @/path/to/argfile

If we were to hash the command itself, it would certainly not be enough to guarantee the output would be deterministic because the contents of the referenced directories and files can affect the output. At a minimum, we must include the following as part of the key:

  • Referenced Files: Need to include the contents of input files. For response files (argfiles), this needs to happen recursively.
  • Directory Contents: Need to include the state of referenced directories. For example, such directories could be used for header resolution.
  • Command Parameters: All parameters must be included, in the order they appear.
  • Compiler Identity: The exact compiler being used (e.g., hash of the binary).

Compiler Identity

The compiler identity needs to be included in the computation of the cache key because the output can vary between versions. For example, Clang 5 would most likely behave differently compared to Clang 11.

Unfortunately, determining the identity of the compiler is not as easy as just hashing the binary. That’s because the binary might just be a shim that redirects to another binary or even a set of binaries.

Furthermore, the shim might operate in a dynamic manner: e.g., redirect to different compiler versions depending on parameters, environmental variables or state of the filesystem. It’s practically impossible for the build system to guess how the compiler works internally.

Implicit Inputs

In addition to the visible inputs as part of the command, there are two other major sources of implicit inputs:

  • Environmental Variables: the compiler’s behaviour could be influenced by environment variables. Those variables might have been set at the shell level, build system level or somewhere else.
  • Filesystem: the compiler will inevitably end up using many more files than those explicitly specified. There might be implicit search paths, SDK paths and so on.

Non-Existent Files

One tricky aspect that can be easily missed are files which do not exist on the filesystem but which affect the output of the compiler. Imagine if the compiler had code like:

const char* lib_path = "/some/predefined/path.a";
if (static_lib_exists_as(lib_path)) {
  append_to(linked_libraries, lib_path);
}

This means the compiler will generate different outputs depending on the existence of /some/predefined/path.a. Consequently, the file must always be part of the cache key, even when it does not exist.

Explicit Inputs

Ultimately, the compiler itself truly knows all the inputs that were used to determine the output. Ideally, compilers would provide a way to output that information in a structured way, so that build systems can utilise it.

For example, Clang and GCC support dependency files which list all the user and system header files that were used. MSVC supports a similar /showIncludes option.

Parameters Not Affecting Determinism

Sometimes, tools might support options which do not affect the output but change the internal operation of the tool (e.g., algorithms used, concurrency, etc).

While it’s possible to add custom handling to exclude such parameters from cache key computation, it’s safer to assume that all parameters affect the input and have CI machines (or a subset) always build the exact same local development configuration.

Cache Availability

An important aspect of a distributed cache is its hit rate: it should be very high. Assuming no local changes, the hit rate would usually depend on the particular commit being checked out, the cache retention policy and cache fill strategy.

For example, if the cache gets filled once an hour by a CI job, checking out the latest commit on master might result in a very slow build if a previous commit invalidated most artifacts.

In terms of fill and retention strategies, several aspects require careful consideration. For example:

  • Does the cache get filled at every commit or only certain ones?
  • Does the cache get filled before a commit gets pushed or afterwards?
  • In a monorepo, for a specific commit, are caches for all targets filled or only a subset?
  • Is the cache per-target or a global one?
  • What eviction policy does the cache use? What’s the size of the cache?

There are no right or wrong answers: all of the above aspects represent different tradeoffs and will have associated benefits and costs. It will be down to the constraints and requirements within a company/project to make the appropriate choices and deliver the desired experience.

Metrics

Ultimately, the distributed cache should have a high hit rate for as many builds as possible. For example, that might be aiming for p95 miss rate of 1%: i.e., 95% of all builds will experience a miss rate of 1% or lower (i.e., hit rate of 99% or higher).

Adopting a data-driven approach in optimising the cache hit rate is an appropriate strategy.

← Back to Writings

  1. Building everything from source is a separate topic which will not be covered in detail here.

  2. Incremental build time is incredibly important for developer productivity. The difference between a 2s and 30s incremental build is very significant due to a simple fact: the wait time is not long enough to be utilised productively but it's long enough to add up to a large total. Doing 100 builds per day leads to 46 minutes of wasted time.

  3. Another reason for the importance of fast iteration time is that it directly affects the ability to write good software. When tackling a new problem space, the freedom to easily explore and quickly iterate on solutions becomes key to success.

  4. E.g., changing a widely included header file.

  5. There will be some additional overhead associated with cost of distributing the compilation across multiple machines.

Any opinions and viewpoints expressed, explicitly or implicitly, are not endorsed by and do not represent any of my previous, current or future employers.