Skip to main content

Tuning and Operating your Bazel Deployment

Once you've deployed Bazel to a large monorepo with hundreds of developers, you'll have to keep things fast and support product engineers who do not care about build systems and want to to Just Work ™️.

By the end of this section, you might feel like an undervalued superhero municipal sewer engineer. 😁

Metaphor

You can imagine developers in a polyrepo setup. Each team maintains their own build and test setup. This is like villagers, taking their laptops to the well to draw water. At their scale, this is fine.

The monorepo is a big city. You can't have engineers spending time walking to a well! They should expect a faucet in their apartment, and water just comes out of it.

Except, there's actually a small team of municipal water supply engineers in the city. There's a lot of work required to keep the plumbing working for everyone.

VillageCity

These people are motivated by the value of the work they do, even though their effort is rarely recognized. Except when things go wrong, and the faucet has no water or the water is polluted.

This is a pretty good metaphor for a Developer Infrastructure team. Product engineers won't notice that work is required to maintain the services that keep them productive in the metropolis that is a monorepo.

Leverage

A key selling point of Bazel is that it provides a uniform interface for Build and Test that allows a small DevInfra team to support a large number of engineers. The business wants to minimize the overhead of infrastructure, and your job is to keep it that way.

This "economy of scale" justifies the difficulty in learning and operating Bazel.

Thus, one goal of a DevInfra team must be to protect your ability to operate at scale, by preventing teams from fragmenting the codebase and making "unique snowflakes" that force you to provide custom support.

takeaway

Resist efforts by product engineers to be "special" and diverge from the monorepo standards.

Ensuring high cache-hit rates

The DevInfra team must ensure that CI remains fast by avoiding any unneeded work. Monitor your cache hit rate and be vigilant about repairing regressions.

For example, someone introduces a protoc plugin that stamps a date into the output. This is non-deterministic and means that the files produced will be different on each build, so anything (transitively) depending on them will be a cache miss.

Non-determinism higher in the build graph causes more cascading cache misses and should be addressed first.

Note that Remote Build Execution is sometimes sold as a solution, because it spreads the work over many machines, but if the work is unneeded this is the wrong solution!

We plan to add a non-determinism detector in our CI/CD product, workflows.

Become expert at Bazel profiles

You should configure developer builds and CI builds to collect the Bazel profile. It is always written, even if you don't pass any flags to Bazel specifying the location. The profile can be opened in https://ui.perfetto.dev. You can also see the critical path with bazel analyze-profile

You'll often spot some easy low-hanging fruit in the profile, such as an action taking much longer than it should, or running when it doesn't need to.

Tuning resources per-action

Bazel has a heuristic-based scheduler that tries to maximize how much work can happen on the computer without overloading the system.

Resources that might be overused:

  • RAM: Bazel schedules too many compilation actions and exhausts system memory, the OS swaps and the machine is unusable or hangs.
  • CPU: Bazel schedules too many intensive tests in parallel and they all fail to complete within their timeout because they run too slowly on a loaded system.
  • Network throughput: Memory: TODO

Prevent "weeds"

As a monorepo gardener, you'll constantly fight against accumulation of undesirable things that interfere with proper operation. This is just like weeds which compete with the plants in a garden that you meant to grow.

Weeds are easy to pull out when they are tiny sprouts, but require a lot of work once the roots take hold. The same is true for monorepo maintenance - as a bad pattern gets adoption, it will have dependencies and coupling that make it much harder to remove.

The ideal solution prevents these being checked in at all. The next best is to warn in a code review that a bad pattern is being introduced. Ultimately you'll also need practices for detecting them, such as scanning user feedback, and then a culture of good hygiene where engineers and their managers agree on pulling weeds early.

Prevent build and test actions from accumulating dependencies on the network

This is easy to prevent early in a repository, but very difficult once engineers have depended on lax rules!

Bazel's test sandbox can prevent tests connecting to the network: --sandbox_default_allow_network=false.

Individual test targets can be allowed with the requires-network tag.

To prevent arbitrary fetches from the internet is harder. We suggest setting up iptables-based network blocking on CI workers.

Prevent accidental dependency edges

You can't build a service with high reliability with a dependency on one that has low availability.

The same is true for dependency edges in the graph. If engineers from an important, business-critical service have a dependency within the monorepo on poorly maintained library code, then the library developers are stuck trying to meet an SLA they cannot. They never meant to sign up to support the needs of this service using their library.

Bazel's visibility system is an excellent way to force users to first "sign up" before they depend on your library (or, if you have a service with a client library, prevent them from using the service).

A good pattern is for the visibility to start out minimal, just for the current project. Then add a package_group when you need to expand the visibility. Applications that want to depend on the library have to first send a PR to add their package(s) to that group, and the code review process should require a review from a library maintainer. This is your chance to discuss whether the dependency should be allowed.