Engineering at Google

Somewhere along the line between a one-off program and a project that lasts for decades, a transition happens: a project must start to react to changing externalities. For any project that didn’t plan for upgrades from the start, that transition is likely very painful for three reasons, each of which compounds the others:

You’re performing a task that hasn’t yet been done for this project; more hidden assumptions have been baked-in
The engineers trying to do the upgrade are less likely to have experience in this sort of task
The size of the upgrade is often larger-than-usual, doing several years worth of upgrades at once instead of a more incremental upgrade

And thus after actually going through such an upgrade once (or giving up part way through), it’s pretty reasonable to overestimate the cost of doing a subsequent upgrade and decide “Never again.” Companies that come to this conclusion end up committing to just throwing things out and rewriting their code, or deciding to never upgrade again. Rather than take the natural approach by avoiding a painful task, sometimes the more responsible answer is to invest in making it less painful. It all depends on the cost of your upgrade, the value it provides, and the expected lifespan of the project in question.

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

As a result, if you ask any expert “Can I assume a particular output sequence for my hash container?” they will presumably say “No.” By and large that is correct, but perhaps simplistic. A more nuanced answer is “If your code is short-lived, with no changes to your hardware, language runtime, or choice of data structure, such an assumption is fine. If you don’t know how long your code will live, or you cannot promise that nothing you depend upon will ever change, such an assumption is incorrect.”

This is a very basic example of the difference between “it works” and “it is correct.”

Code that depends on brittle and unpublished features of its dependencies is likely to be described as “hacky” or “clever”, while code that follows best practice and has planned for the future is more likely to be described as “clean” and “maintainable.” Both have their purposes.

It’s programming if ‘clever’ is a compliment, but it’s software engineering if ‘clever’ is an accusation.

C does a great job of providing stability — in many respects that is its primary purpose.

Every task your organization has to do repeatedly should be scalable (linear or better) in terms of human input. Policies are a wonderful tool for making process scalable.

Process inefficiencies and other software-development tasks tend to scale up slowly. Be careful about boiled frog problems.

Expertise pays off particularly well when combined with economies of scale.

Fundamentally, it might not seem like what’s happening here is that much different than what happened when using a task-based build system. Indeed, the end result is the same binary, and the process for producing it involved analyzing a bunch of steps to find dependencies among them, and then running those steps in order. But there are critical differences. The first one appears in step 3: since Bazel knows that each target will only produce a Java library, it knows that all it has to do is run the Java compiler rather than an arbitrary user-defined script, so it knows that it’s safe to run these steps in parallel. This can produce an order of magnitude performance improvement over building targets one-at-a-time on a multicore machine, and is only possible if the artifact-based approach leaves the build system in charge of its own execution strategy so that it can make stronger guarantees about parallelism.

The benefits extend beyond parallelism, though. Since Bazel knows about the properties of the tools it runs at every step, it’s able to rebuild only the minimum set of artifacts each time while guaranteeing that it won’t produce stale builds.

Reframing the build process in terms of artifacts rather than tasks is subtle but powerful. But reducing the flexibility exposed to the programmer, the build system can know more about what is being done at every step of the build. It can use this knowledge to make the build far more efficient by parallelizing build processes and reusing their outputs. But this is really just the first step, and these building blocks or parallelism and reuse will form the basis for a distributed and highly scalable build system that will be discussed later.

Automatically-managed dependencies can be convenient for small projects, but they’re usually a recipe for disaster on projects of non-trivial size or that are being worked on by more than one engineer. The problem with automatically-managed dependencies is that you have no control over when the version is updated. There’s no way to guarantee that external parties won’t make breaking updates (even when they claim to use semantic versioning), so a build that worked one day might be broken the next with no easy way to detect what changed or to roll it back to a working state. Even if the build doesn’t break, there may be subtle behavior or performance changes that are impossible to track down.

In contrast, since manually-managed dependencies require a change in source control, they can easily be discovered and rolled back, and it’s possible to check out an older version of the repository to build with older dependencies. Bazel requires that versions of all dependencies to be specified manually. At even moderate scales, the overhead of manual version management is well worth it for the stability it provides.

Different versions of a library are usually represented by different artifacts, so in theory there’s no reason that different versions of the same external dependency couldn’t both be declared in the build system under different names. That way, each target could choose which version of the dependency it wanted to use. Google has found this to cause a lot of problems in practice, and so we enforce a strict one-version rule for all third-party dependencies in our internal codebase.

A build system is one of the most important parts of an engineering organization. Each developer will interact with it potentially dozens or hundreds of times per day, an in many situations it can be the rate-limiting step in determining their productivity. This means that it’s worth investing time and thought into getting things right.

Limiting engineers’ power and flexibility can improve their productivity
Artifact-based build system instead of task-based
Distributed build system to leverage the resources of an entire compute cluster
Fine-grained modules scale better than coarse-grained modules when it comes to managing dependencies.
One-version rule
All dependencies should be versioned manually and explicitly

A haunted graveyard is a system that is so ancient, obtuse or complex that no one dares enter it. Haunted graveyards are often business critical systems which are frozen in time because any attempt to change them could cause the system to fail in incomprehensible ways, costing the business real money. They pose a real existential risk and can consume an inordinate amount of resources.

At Google, we’ve found the counter to this to be good, old fashioned testing.