Skip to main content
Version: 5.1.x

Service Level Indicators

Workflows promises best-case Bazel performance. If your pipeline is slow, Aspect's job is to respond and fix it.

info

Some of the following SLIs are still under development.

TTA: Time to first action

Goal: Detect analysis cache busts, slow invalidated and eager-fetched repository rules, etc. which are on critical loading-phase path and no remote-cache/remote-execution can help us.

Indicator: Time from

  • T0: Scheduler dispatch to warm runner
  • T1: git clone is up-to-date
  • T2: First action spawn according to Bazel profile

Objective: P50: 10sec, P95: 30sec

QT: Developer-perceived Queue time

Goal: developers shouldn't be blocked by lack of CI resources.

Indicator: Cumulative time across all Pull Request runs where a request was queued, over some time period.

Objective: 60 seconds / developer / day

GR: Main branch greenness ratio

Goal: main should be green most of the time.

Indicator: Ratio of the time in a given period where main was green to the total time of that period.

Objective: 98%, (red less than 3h20m per week, see https://en.wikipedia.org/wiki/High_availability)

note

There's a component of this where a slow on-call response can't be corrected by Aspect.

TTF: Developer-perceived Time to Failure

Goal: when the developer needs to fix their changeset, they are notified before they change context or leave their desk.

Indicator: Time from

  • T0: Broken commit, Scheduler dispatch to runner
  • T1: Failure status reported back to developer

Objective: [Time of critical path for longest test] + 1 min

note

Might report directly to GitHub, if the CI platform doesn't allow a "failing but not yet finished" status.

LTA: Land-to-artifact

Goal: a commit that's needed in production quickly can go through the same process as less urgent ones. I can ship to production more than once during a SEV.

Indicator: time from a commit merged to main until all release artifacts are delivered for deployment. The client controls the actions which must run, including long tests or big uploads, so we can only control the parts outside of the Critical Path reported by Bazel.

Objective: [Critical Path] + 5 min

IR: Invalidations rate

Goal: Bazel's expensive computations: analysis cache, external/ folders, are not frequently occuring in users critical path. (This is included in Time to first Action).

Indicator: number of times we saw each kind of invalidation per number of builds.

Objective: 5 per 100 builds (Aspirational)

note

PR builds can invalidate the caches on a runner, and if it takes another request that will invalidate back again. In the future we will quarantine that runner to avoid it being used by anyone else. It is now pinned to the branch and acts like a remote dev environment for that developer.

FPR: False positive breakage rate

Goal: We shouldn't bother a human unless the CI system requires manual repair.

Indicator: Buildcop reports a false positive through interaction with Aspect's system, typically via the Slack thread.

Objective: 5 out of 100 breakages