Skip to main content
Version: 5.13.x

Monitoring

Aspect Workflows Grafana dashboards

Aspect Workflows uses Grafana, a popular graphical dashboard that displays different charts based on collected metrics. In this case, from Workflows tasks.

If you are using version 5.13.0 or later of Aspect Workflows, visit the /grafana path of your Aspect Workflows instance to view the Grafana. Login using your authentication provider as defined by the OIDC provider defined in your configuration file.

Versions of Workflows before 5.13.0

Self and provider hosted Grafana

The resources deployed from the Aspect Workflows module write metrics to two places: CloudWatch and Amazon Managed Prometheus (AMP). The AMP Prometheus metrics can be visualized using a Grafana workspace. This section describes how to seamlessly add Grafana dashboards to a Grafana workspace to visualize these metrics out of the box.

Prerequisites

  • A Grafana instance URL
  • A valid Grafana workspace API key

If you do not have a Grafana instance ready, you can follow the next section on creating one using the Amazon Managed Grafana.

Optional: create an Amazon Managed Grafana workspace

Amazon Managed Grafana provides customers a fully managed Grafana workspace, capable of supporting the majority of the same features as self-hosted Grafana. Managed Grafana also integrates directly with SSO providers for authentication and authorization, including AWS Identity Center (IdC). This provisions a Grafana workspace and the API keys necessary to access said workspace.

Note: not all sections of the configuration specified above are required, and the configuration is not limited to the above reference. The only required sections include the PROMETHEUS data source and at least one of either the Editor or Admin API keys (for use in the next section). The automated rotation of the API keys is useful when dealing with API key expiry. For more explanation on this point, see Deployment considerations below.

Deploy the Grafana dashboards

If using the preceding Managed Grafana example, install the Aspect Grafana dashboards with the vendored child module, as follows:

module "aspect_workflows_dashboards" {
source = "<Aspect Workflows source ZIP>//monitoring/dashboards"

grafana_auth = module.managed_grafana.grafana_admin_api_key
grafana_url = module.managed_grafana.grafana_endpoint
managed_prometheus_endpoint = module.aspect_workflows.managed_prometheus_endpoint
}

If you are bringing your own Grafana workspace, you can similarly add the dashboards as follows.

module "aspect_workflows_dashboards" {
source = "<Aspect Workflows source ZIP>//monitoring/dashboards"

grafana_auth = "<Grafana workspace API key with Editor permissions or above>"
grafana_url = "<Grafana workspace URL>"
managed_prometheus_endpoint = module.aspect_workflows.managed_prometheus_endpoint
}

Deployment considerations

The Terraform Grafana provider requires valid Grafana auth credentials to refresh existing resources. This means that on the plan stage, valid credentials need to be present in Terraform state. If using the managed Grafana instance, the workspace keys are created as part of the module. However, those credentials will expire at some point before a future plan/apply. Therefore, it may be necessary to do a targeted apply on just the Grafana workspace module, in order to refresh the workspace API keys. The target string for the example above is:

terraform apply -target module.managed_grafana

Once the API keys are refreshed, a regular plan/apply should function as expected.

(GCP only) Debugging locally

Install kubectl; https://kubernetes.io/docs/tasks/tools/#kubectl

Install https://cloud.google.com/sdk/docs/install-sdk

After installation make sure that you have authorized the CLI to connect the cloud platform.

See https://cloud.google.com/sdk/docs/install-sdk#initializing_the,

With CLI properly configured, run the following command to add access to the kubectl context.

gcloud container clusters get-credentials cluster --region "<REGION_HERE>" --project "<PROJECT_ID>"

WARNING: If you are seeing "CRITICAL: ACTION REQUIRED: gke-gcloud-auth-plugin, which is needed for continued use of kubectl, was not found or is not executable. Install gke-gcloud-auth-plugin for use with kubectl by following https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke", run gcloud components install gke-gcloud-auth-plugin and run the gcloud command above again.

Finally, run;

grafana_pod=$(kubectl get pods -n observability -l "app.kubernetes.io/name=grafana" -o jsonpath="{.items[0].metadata.name}")
grafana_port=$(kubectl get pod "$grafana_pod" -n observability -o jsonpath='{.spec.containers[?(@.name=="grafana")].ports[?(@.name=="grafana")].containerPort}')
kubectl port-forward "pods/$grafana_pod" "3000:$grafana_port" -n observability

Navigate to http://localhost:3000 on the browser to see the dashboard.

Provided dashboards

Agents & Runners
  • Remote Cache:
    • Cache Hits (Duration)
    • Cache Hit Rate
    • Cache Misses (Duration)
  • Bazel Invocations:
    • Main Branch Test Invocation Duration
    • Main Branch Test Invocation Averages
    • Test Invocation Duration
    • Test Invocation Averages
    • Bazel Invocation Errors
    • Main Branch Bazel Errors
    • Bazel Workspace Status Duration
    • Bazel Server Minor GC Collections
    • Bazel Repository Rule Duration
    • Bazel Target Pattern Evaluation Duration
Aspect Workflows
  • Cache Hit Rate
  • Table of Contents for all other dasboards
BB Scheduler
  • Task rate by stage transition
    • Nonexistent → Queued
    • Queued → Executing
    • Executing → Completed
    • Completed → Removed
  • Task count by stage
    • Queued
    • Executing
    • Completed
  • Task duration by stage
    • Queued
    • Executing
    • Completed
  • Miscellaneous
    • Task execution retries
BlobAccess
  • Operation rate by operation
    • Action Cache
    • Content Addressable Storage
  • Operation rate by gRPC status code
    • Action Cache
    • Content Addressable Storage
  • Get() duration
    • Action Cache
    • Content Addressable Storage
  • Put() duration
    • Action Cache
    • Content Addressable Storage
  • FindMissing() duration
    • Action Cache
    • Content Addressable Storage
  • Get() object size
    • Action Cache
    • Content Addressable Storage
  • Put() object size
    • Action Cache
    • Content Addressable Storage
  • FindMissing() batch size
    • Action Cache
    • Content Addressable Storage
BuildExecutor
  • Operations
    • Operation rate
    • FetchingInputs stage duration
    • Running stage duration
    • UploadingOutputs stage duration
    • Virtual execution duration
  • POSIX resource usage
    • CPU user time
    • CPU system time
    • Maximum resident set size
    • Page reclaims
    • Page faults
    • Swaps
    • Block input operations
    • Block output operations
    • Messages sent
    • Messages received
    • Signals received
    • Voluntary context switches
    • Involuntary context switches
  • File pool resource usage
    • Files created
    • Peak file count
    • Peak file size
    • Read operations
    • Write operations
    • Truncate operations
    • Total read size
    • Total write size
Centralized storage
  • Per-replica worst shard retention: whether it is safe to restart the other replica
    • ac
    • cas
  • Per-shard best replica retention: amount of data accessible right now
    • ac
    • cas
  • Per-shard worst replica retention: amount of data to remain accessible if a replica were to crash
    • ac
    • cas
  • Key-location map: Get()
    • ac operation rate
    • cas operation rate
    • ac operation attempts
    • cas operation attempts
  • Key-location map: Put()
    • ac operation rate
    • cas operation rate
    • ac operation iterations
    • cas operation iterations
ECS BuildBarn Performance Dashboard
  • Resource Usage
    • Frontend CPU Usage
    • Frontend Memory Usage
  • Status History
    • Remote Cache Status History
  • Memory Usage
    • Memory Usage Over Time
Eviction sets
  • AuthorizationHeaderParser
    • Operation rate
    • Hit rate = Touch / (Insert + Touch)
  • CachingDirectoryFetcher
    • Operation rate
    • Hit rate = Touch / (Insert + Touch)
  • DataIntegrityValidationCache
    • Operation rate
    • Hit rate = Touch / (Insert + Touch)
  • ExistenceCachingBlobAccess
    • Operation rate
    • Hit rate = Touch / (Insert + Touch)
  • HardlinkingFileFetcher
    • Operation rate
    • Hit rate = Touch / (Insert + Touch)
  • QueuedBlobReplicator
    • Operation rate
    • Hit rate = Touch / (Insert + Touch)
gRPC clients
  • Number of in-flight operations
    • By gRPC method
    • By ECS service
  • Operation rate
    • By gRPC method
    • By gRPC status code
    • By ECS service
  • Messages sent
    • By gRPC method
    • By ECS service
  • Messages received
    • By gRPC method
    • By ECS service
  • Timing
    • RPC duration
gRPC servers
  • Number of in-flight operations
    • By gRPC method
    • By ECS service
  • Operation rate
    • By gRPC method
    • By gRPC status code
    • By ECS service
  • Messages sent
    • By gRPC method
    • By ECS service
  • Messages received
    • By gRPC method
    • By ECS service
  • Timing
    • RPC duration