Workflows Metrics
Some of the following metrics gathering is still under development.
CR: Cost Information
Data showing the overall cost over time for Workflows infrastructure.
TTA: Time to first action
Preventing analysis cache discards, slow invalidated and eager-fetched repository rules, etc. which are on the critical loading-phase path. As Bazel does this work before computing Action Cache keys, no amount of remote cache or remote execution improves this indicator.
- T0: Scheduler dispatch to warm runner
- T1: git clone is up-to-date
- T2: First action spawn according to Bazel profile
QT: Developer-perceived Queue time
Developers shouldn't be blocked by lack of CI resources. The amount of time that each pull request build remains in the queue
GR: Main branch greenness ratio
main
should be green most of the time.
Ratio of the time in a given period where main was green to the total time of that period.
While Aspect can alert the BuildCop, the customer is responsible for a response such as reverting a commit
TTF: Developer-perceived Time to Failure
When the developer needs to fix their pull request, they are notified before they change context or leave their desk.
- T0: Scheduler dispatch to runner
- T1: Failure status reported back to developer
If the CI platform doesn't allow a "failing but not yet finished" status, Workflows reports the failure as a comment on the pull request.
LTA: Land-to-artifact
A commit that's needed in production quickly can go through the same process as less urgent ones. A developer might say "I need to ship to production more than once during an outage."
Time from a commit merged to main
until all release artifacts are delivered for deployment.
The customer controls the actions which must run, including long tests or big uploads, so Aspect can only control the parts outside the Critical Path reported by Bazel.
IR: Invalidations rate
Bazel's expensive computations: analysis cache, external/ folders, are not frequently occurring in a user's critical path. (This is included in Time to first Action, above).
Number of times we saw each kind of invalidation per number of builds.
PR builds can invalidate the caches on a runner, and if it takes another request that will invalidate back again. In the future we plan to lame-duck a runner which has invalidated caches to avoid it being used by anyone else.
We recommend enabling the rebase feature in Workflows so that PRs which cleanly rebase against the target branch will do so.
FPR: False positive breakage rate
We shouldn't bother a human unless the CI system requires manual repair. BuildCop reports a false positive through interaction with Aspect's system, typically via the Slack thread.