Monitoring
Aspect Workflows Grafana dashboards
Aspect Workflows uses Grafana, a popular graphical dashboard that displays different charts based on collected metrics. In this case, from Workflows tasks.
If you are using version 5.13.0 or later of Aspect Workflows, visit the /grafana path of your Aspect Workflows instance to view the Grafana. Login using your authentication provider as defined by the OIDC provider defined in your configuration file.
Versions of Workflows before 5.13.0
Self and provider hosted Grafana
The resources deployed from the Aspect Workflows module write metrics to two places: CloudWatch and Amazon Managed Prometheus (AMP). The AMP Prometheus metrics can be visualized using a Grafana workspace. This section describes how to seamlessly add Grafana dashboards to a Grafana workspace to visualize these metrics out of the box.
Prerequisites
- A Grafana instance URL
- A valid Grafana workspace API key
If you do not have a Grafana instance ready, you can follow the next section on creating one using the Amazon Managed Grafana.
Optional: create an Amazon Managed Grafana workspace
Amazon Managed Grafana provides customers a fully managed Grafana workspace, capable of supporting the majority of the same features as self-hosted Grafana. Managed Grafana also integrates directly with SSO providers for authentication and authorization, including AWS Identity Center (IdC). This provisions a Grafana workspace and the API keys necessary to access said workspace.
Note: not all sections of the configuration specified above are required, and the configuration is not
limited to the above reference. The only required sections include the PROMETHEUS
data source and at least one of
either the Editor or Admin API keys (for use in the next section). The automated rotation of the API keys is useful
when dealing with API key expiry. For more explanation on this point, see Deployment considerations below.
Deploy the Grafana dashboards
If using the preceding Managed Grafana example, install the Aspect Grafana dashboards with the vendored child module, as follows:
module "aspect_workflows_dashboards" {
source = "<Aspect Workflows source ZIP>//monitoring/dashboards"
grafana_auth = module.managed_grafana.grafana_admin_api_key
grafana_url = module.managed_grafana.grafana_endpoint
managed_prometheus_endpoint = module.aspect_workflows.managed_prometheus_endpoint
}
If you are bringing your own Grafana workspace, you can similarly add the dashboards as follows.
module "aspect_workflows_dashboards" {
source = "<Aspect Workflows source ZIP>//monitoring/dashboards"
grafana_auth = "<Grafana workspace API key with Editor permissions or above>"
grafana_url = "<Grafana workspace URL>"
managed_prometheus_endpoint = module.aspect_workflows.managed_prometheus_endpoint
}
Deployment considerations
The Terraform Grafana provider requires valid Grafana auth credentials to refresh existing resources.
This means that on the plan
stage, valid credentials need to be present in Terraform state. If using the managed
Grafana instance, the workspace keys are created as part of the module. However, those credentials will expire at some point
before a future plan
/apply
. Therefore, it may be necessary to do a targeted apply on just the Grafana
workspace module, in order to refresh the workspace API keys. The target string for the example above is:
terraform apply -target module.managed_grafana
Once the API keys are refreshed, a regular plan
/apply
should function as expected.
(GCP only) Debugging locally
Install kubectl
; https://kubernetes.io/docs/tasks/tools/#kubectl
Install https://cloud.google.com/sdk/docs/install-sdk
After installation make sure that you have authorized the CLI to connect the cloud platform.
See https://cloud.google.com/sdk/docs/install-sdk#initializing_the,
With CLI properly configured, run the following command to add access to the kubectl context
.
gcloud container clusters get-credentials cluster --region "<REGION_HERE>" --project "<PROJECT_ID>"
WARNING: If you are seeing "CRITICAL: ACTION REQUIRED: gke-gcloud-auth-plugin, which is needed for continued use of
kubectl
, was not found or is not executable. Install gke-gcloud-auth-plugin for use withkubectl
by following https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke", rungcloud components install gke-gcloud-auth-plugin
and run thegcloud
command above again.
Finally, run;
grafana_pod=$(kubectl get pods -n observability -l "app.kubernetes.io/name=grafana" -o jsonpath="{.items[0].metadata.name}")
grafana_port=$(kubectl get pod "$grafana_pod" -n observability -o jsonpath='{.spec.containers[?(@.name=="grafana")].ports[?(@.name=="grafana")].containerPort}')
kubectl port-forward "pods/$grafana_pod" "3000:$grafana_port" -n observability
Navigate to http://localhost:3000
on the browser to see the dashboard.
Provided dashboards
Agents & Runners
- Remote Cache:
- Cache Hits (Duration)
- Cache Hit Rate
- Cache Misses (Duration)
- Bazel Invocations:
- Main Branch Test Invocation Duration
- Main Branch Test Invocation Averages
- Test Invocation Duration
- Test Invocation Averages
- Bazel Invocation Errors
- Main Branch Bazel Errors
- Bazel Workspace Status Duration
- Bazel Server Minor GC Collections
- Bazel Repository Rule Duration
- Bazel Target Pattern Evaluation Duration
Aspect Workflows
- Cache Hit Rate
- Table of Contents for all other dasboards
BB Scheduler
- Task rate by stage transition
- Nonexistent → Queued
- Queued → Executing
- Executing → Completed
- Completed → Removed
- Task count by stage
- Queued
- Executing
- Completed
- Task duration by stage
- Queued
- Executing
- Completed
- Miscellaneous
- Task execution retries
BlobAccess
- Operation rate by operation
- Action Cache
- Content Addressable Storage
- Operation rate by gRPC status code
- Action Cache
- Content Addressable Storage
- Get() duration
- Action Cache
- Content Addressable Storage
- Put() duration
- Action Cache
- Content Addressable Storage
- FindMissing() duration
- Action Cache
- Content Addressable Storage
- Get() object size
- Action Cache
- Content Addressable Storage
- Put() object size
- Action Cache
- Content Addressable Storage
- FindMissing() batch size
- Action Cache
- Content Addressable Storage
BuildExecutor
- Operations
- Operation rate
- FetchingInputs stage duration
- Running stage duration
- UploadingOutputs stage duration
- Virtual execution duration
- POSIX resource usage
- CPU user time
- CPU system time
- Maximum resident set size
- Page reclaims
- Page faults
- Swaps
- Block input operations
- Block output operations
- Messages sent
- Messages received
- Signals received
- Voluntary context switches
- Involuntary context switches
- File pool resource usage
- Files created
- Peak file count
- Peak file size
- Read operations
- Write operations
- Truncate operations
- Total read size
- Total write size
Centralized storage
- Per-replica worst shard retention: whether it is safe to restart the other replica
- ac
- cas
- Per-shard best replica retention: amount of data accessible right now
- ac
- cas
- Per-shard worst replica retention: amount of data to remain accessible if a replica were to crash
- ac
- cas
- Key-location map: Get()
- ac operation rate
- cas operation rate
- ac operation attempts
- cas operation attempts
- Key-location map: Put()
- ac operation rate
- cas operation rate
- ac operation iterations
- cas operation iterations
ECS BuildBarn Performance Dashboard
- Resource Usage
- Frontend CPU Usage
- Frontend Memory Usage
- Status History
- Remote Cache Status History
- Memory Usage
- Memory Usage Over Time
Eviction sets
- AuthorizationHeaderParser
- Operation rate
- Hit rate = Touch / (Insert + Touch)
- CachingDirectoryFetcher
- Operation rate
- Hit rate = Touch / (Insert + Touch)
- DataIntegrityValidationCache
- Operation rate
- Hit rate = Touch / (Insert + Touch)
- ExistenceCachingBlobAccess
- Operation rate
- Hit rate = Touch / (Insert + Touch)
- HardlinkingFileFetcher
- Operation rate
- Hit rate = Touch / (Insert + Touch)
- QueuedBlobReplicator
- Operation rate
- Hit rate = Touch / (Insert + Touch)
gRPC clients
- Number of in-flight operations
- By gRPC method
- By ECS service
- Operation rate
- By gRPC method
- By gRPC status code
- By ECS service
- Messages sent
- By gRPC method
- By ECS service
- Messages received
- By gRPC method
- By ECS service
- Timing
- RPC duration
gRPC servers
- Number of in-flight operations
- By gRPC method
- By ECS service
- Operation rate
- By gRPC method
- By gRPC status code
- By ECS service
- Messages sent
- By gRPC method
- By ECS service
- Messages received
- By gRPC method
- By ECS service
- Timing
- RPC duration