Skip to main content

Migrating to Bazel

Now we'll talk about the human side of effectively migrating from a legacy build setup to Bazel. We find this is the hardest part of using Bazel.

By the end of this section, you'll have a high-level sense for sequencing a migration that's less disruptive and risky, and gives early benefits.

There's no codelab here, but we can have some discussion about how this can apply within your org.

aspect.dev

Want help with your Bazel migration effort? We offer consulting and support services at https://aspect.dev.

Incubate net promoters

Software is written by humans, so human psychology matters. Humans are tribal, drawn to cosmetics, and jump to conclusions with their "system one" mind.

The initial reaction to Bazel will form a first impression that’s hard to correct later. People will feel their tribe is offended: "that might work in language X, but us Y developers would never do it that way". They'll mistake something cosmetic for something inherent: "the way the errors get presented in CI don't make any sense". They'll infer that one slow step means the whole thing is slower.

Even uninformed opinions can matter a lot. Early influential users will spread these around the engineering culture, causing people who had no strong opinion to become entrenched, making your later job either easier or harder.

For example, when rolling out a new invariant as a CI check, make the "how to fix" instruction obvious. If the person feels that the check caught a real issue that was trivial to get past, they'll feel it was an enjoyable experience. If they think you're throwing a roadblock in their way, they'll resent it.

Don't disrupt workflows

Keep the Makefile!

If the developer types make test or yarn build or npm run serve today, maybe they still can. Avoid changing the topmost user-facing part of the tooling where possible.

Most product developers aren't interested in build system details and don't care about whatever change you've made, they don't want it reflected in what they need to type. Retraining is expensive and burns goodwill.

Also remember that many workflows take place in an editor. Bazel disrupts the paths on disk where editor extensions look for libraries. Be aware of this problem and be proactive to find and advertise workarounds for keeping editors happy.

tip

Aspect's rules_py is a good example: it creates a standard Python virtualenv with a site-packages folder so that tools like editors find the structure they expect and keep working.

Change one thing at a time

Build system changes can cause subtle regressions, where cause and effect are at a distance, and the developer encountering the problem and the build system engineer who caused it are totally unfamiliar with the domain of the other. This makes it hard to diagnose problems.

Just like a git bisect workflow, a sane, linear history of events makes it much easier to reason about what happened. Based on the delta of what changed, you can make assertions about what is the possible blast radius, and whether that can explain the problem.

Therefore, try to change only one thing at a time. Use pre-factoring changes to the old build system to do things like break up a cycle in the dependency graph (but avoid such code changes if they're not load-bearing for Bazel migration, see "don't change the code" below). Use post-factorings to make related cleanups you noticed during the migration. Resist the urge to combine these at all costs!

One ideal outcome from this principle: you can use bazel build --subcommands to see the flags passed to some tool X, then compare with how that tool was called by the legacy build system, and any differences should be intentional and required by this migration step.

Ratchet mechanism

The ratchet is a mechanical tool to ensure "no backsliding". image.png

Whenever you tighten the semantics of the build, by ensuring some new invariant holds, you should have a ratchet in place to make sure it stays that way For example, if you fix some type-check errors, you should make sure the CI system will mark any subsequent changes red if they re-introduce those type-check errors.

Combined with "change one thing at a time" this can give you incredible power to work in a huge codebase. For example if you just introduce one type-check error code at a time, you can fix only those, and use your ratchet to make sure those don't come back. You can then rinse-and-repeat with low risk changes, while ensuring that the system as a whole is eventually converging on the correct behavior. It feels slow, but this is the same as a gear ratio on a bike: you have incredible "mechanical advantage" to move heavy objects at that slow speed.

Gradient ascent

A migration can't leave a developer experience in a bad state before improving it.

Alex and Greg explain this in a BazelCon talk: https://youtu.be/UwuRGpVpmbo?t=398 As explained in that talk, we want to maximize the benefits of Bazel, while deferring costs and known risks. We have to keep making improvements.

Don't say "we'll just leave a TODO here to come back and fix the performance regression". That TODO will be there longer than you think, maybe forever. In the meantime dissatisfaction might cause escalation to decision makers who de-fund the migration work.

This is related to the Youtube results for "changing a tire while driving". Even though a big migration is underway, it's critical to the business that it be non-disruptive.

Close the loop

Before calling a task "done", think about "acceptance criteria". What does "done" mean, and did you fix the root cause, or only one proximate cause?

As an example, answering a technical question for one user helps that user today. They may ask the same question again later, along with a hundred of their colleagues. Answering their question doesn't close the loop, instead you should figure out what documentation they would have naturally consulted, or what error message they were presented with. Go fix those things, then just send them a pointer to that fix.

For Bazel this often means adding validation steps, constraining legal values for attributes, and fixing error messages, in addition to getting into a habit of always improving documentation.

A migration task often requires a three-step process where you introduce a "new way", migrate usages one-at-a-time, then delete the "old way". You must apply extra effort to finish the third step! Leaving the old way doesn't just incur technical debt. It also allows the problem to get worse. Like the ratchet suggestion, you can sometimes prevent new usages of the "old way" which helps to avoid a moving target as you work to burn down the usages. Often, there's signficant simplification possible only after the "old way" has been completely removed, and that simplification allows your DevInfra team to keep your overall "complexity budget" balanced, so it's important to advocate for prioritizing the "close the loop" work here.

Leave the code alone

Sometimes it’s Bazel that should change - things like writing to the source folder, the choice of working directory for a test, having a filesystem layout in a certain way.

We want to avoid changes that break the legacy build, and don’t want developers to have an impression that Bazel requires code changes that are really just different idioms.

If the code does have to change, maybe it can be in a superficial way. For example you could add comments like Gazelle directives that inform the tooling without making any load-bearing changes that could break things.

Dependencies are an important case as well. We shouldn't change versions of any third-party library just because Bazel is managing them.

Leave few fingerprints

As an infrastructure team, you want to make sure product engineers own their own code during and after a migration. You also want to touch only the build system, while leaving the owners of the code to make modifications.

You can run a tool which changes the code in known-safe ways (like running a formatter), then attribute changes to that tool rather than yourself.

If you're editing a bunch of code, you probably didn't follow the "leave the code alone" principle.

Patch and PR

You'll often find that some upstream code needs some minor fixes to work for your use case. The naive thing is to fork that upstream repo, either explicitly by creating a GitHub repo in your org which copies all the files, or implicitly by vendoring files into your monorepo. Forking is very easy to perform, but it adds a constant maintenance burden to your team. You now own a copy of that project, along with the need to test changes to it. Your copy will diverge over time, so the ability to rebase or cherry-pick changes from the upstream will rot.

A better approach is to use Bazel's ubiquitous support for applying patches to dependencies as they are fetched. A nice workflow is:

  1. Clone the dependency locally. If it's a Bazel module, use --override_repository to point to your local copy without having to change any files in the monorepo. If it's a library, use whatever mechanism exists in the language's package manager. For example, pnpm has https://pnpm.io/cli/patch
  2. Make edits like print-debugging in the dependency.
  3. When you get to a working state, you commit your changes with a good commit message.
  4. Make a PR to the upstream with your commit (unless it's confidential internal-only code). Do this even if you don't really care whether they accept your PR!! This way you'll share your solution with others, and get comments from the project maintainers that might help you improve your patch or even discover it's not needed. Furthermore, if the PR is eventually merged, you'll be able to remove the patch from your repo which lowers your maintenance burden and "complexity budget".
  5. Run git show > /[path/to/monorepo]/[some/bazel]/rule.pr123.patch to make a copy of your commit in the monorepo. You should have some convention for where your Bazel setup belongs, these patches can just go there. By including the PR number in the patchfile name, you make it possible for someone browsing the repository to discover the comment thread from upstream maintainers about the patch.
  6. Add the patch_args=["-p1"] and patches=["//some/bazel:rule.pr123.patch"] attributes to whatever spot is fetching the dependency.