CI/CD Pipelines¶

This document explains the structure and flow of the .gitlab-ci.yml pipeline of this project.

1. Core Concepts¶

Before looking at the job-by-job flow, you must understand these core concepts.

Parallel Environment Matrix¶

Almost every job in this pipeline runs in parallel for each environment defined in the .matrix-aws-accounts block:

.matrix-aws-accounts: &matrix-aws-accounts
  parallel:
    matrix:
      - ENV: [dev]
      - ENV: [ops]

This means that a single main branch pipeline will run jobs for dev and ops simultaneously (e.g., tf/infrastructure (dev) and tf/infrastructure (ops)).

Central CI/CD Library¶

This pipeline is built with standardized, centralized jobs, using our central library. Jobs like .tf/format and .docker/bake are templates imported from this central project.

Bastion Host Connection¶

Jobs that need to communicate with the Grafana API (which is in a private private subnet) extend the .connect-bastion job. This template, defined in /.gitlab/ci/setup-bastion-connection.gitlab-ci.yml, handles establishing the secure connection automatically. For further information on the bastion setup, see bastion.md

my-job:
  extends: .connect-bastion
  script:
    - # Your commands here - connection is already established
    - gcx dashboard list

Terraform Child Pipelines¶

Terraform is not run directly in this pipeline. Instead, jobs like tf/infrastructure use the trigger: keyword to launch a child pipeline from the central ci-cd-library.

This child pipeline handles the full terraform plan and terraform apply logic. We just pass it the variables it needs, like the working directory (CCL_TEMPLATES_TERRAFORM_BASE_JOBS_FOLDER_PATH) and the environment name.

Authentication to AWS¶

Important to note; any job that has to interact with AWS expects one of the role_arn variables to be set (CCL_AWS_ASSUME_ROLE_ARN or CCL_AWS_ASSUME_ROLE_ARN_READONLY). Do not set these variables manually in the project. Rather create the necessary role (with the correct project and environment scope) in the AWS-Account-Management project. The role_arn variables will then automatically be managed from our centralized repository.

The following example shows a template job that can interact with AWS:

example/aws:
  id_tokens: &aws_assume_role_tokens
    CCL_AWS_ASSUME_ROLE_GITLAB_ID_TOKEN: <--------------------------------- # Set the correct id token for OIDC
      aud: sts.amazonaws.com
  before_script:
    - !reference [.cicd_utils/aws-assume-role-with-web-identity, script] <- # This line will import the utils function to assume a role with OIDC
    - cicd_utils__aws_assume_role_with_web_identity  <--------------------- # This line actually assumes the necessary role with OIDC
  <<: *matrix-aws-accounts

ECS Deployments with `ecspresso`¶

We use ecspresso to deploy new container images to ECS. This gives us precise control over the deployment. The flow is:

The docker/bake job builds a new image and tags it with the $CI_COMMIT_SHA.
The deploy_... job (e.g., deploy_grafana) runs.
It pulls the cluster's current task definition using ecspresso init ....
It uses sed to find the "image" key in the task definition and replace its tag with the $CI_COMMIT_SHA.
It runs IMAGE_TAG=${CI_COMMIT_SHA} ecspresso deploy ... to register and deploy the new task definition.

Keeping Terraform and CI/CD in Sync¶

You may wonder: "If CI/CD deploys a new task definition, what happens when Terraform runs next?" This is handled by a key setting in our Terraform ECS service modules: track_latest = true. This tells Terraform, "Do not manage the task_definition ARN. Always assume the latest active revision is the one you should be tracking."

This prevents a "drift" where Terraform would try to revert the service to an older task definition just because the CI/CD pipeline deployed a new one.

CI/CD (ecspresso): Actively pushes new task definition revisions.
Terraform (track_latest): Passively accepts the latest revision as its state.

2. Pipeline Flow by Stage¶

This is the end-to-end execution order of the pipeline.

`verify` Stage¶

Purpose: Linting and formatting checks. Runs on merge requests.

tf/format: Checks that all .tf files are correctly formatted.
alloy/format: Checks that all .alloy files in alloy/config/ are formatted.
validate/dashboards: Validates all dashboards, after generation, to ensure all dashboard JSON files are valid.

`build` Stage¶

Purpose: Build and push container images. Runs on main branch and MRs.

docker/bake: This job builds and pushes all container images defined in docker-bake.hcl (Alloy, Grafana, etc.).
On MRs: It builds but does not push to ECR.
On main: It builds and pushes, tagging the image with both :latest and the $CI_COMMIT_SHA.

`trigger` Stage¶

Purpose: Run Terraform child pipelines to update infrastructure.

tf/infrastructure: Triggers the Terraform pipeline for the main infrastructure/ directory. This runs first.
tf/grafana_datasources: Triggers the Terraform pipeline for Grafana provisioning. This job needs the tf/infrastructure job, ensuring it only runs after the core infra is applied.

`deploy` Stage¶

Purpose: Deploy applications and configurations. Runs on main branch only.

These jobs use the changes: keyword to ensure they only run when their specific code is modified.

deploy_cloudwatch_metrics_collector:
Trigger: Changes to alloy/
Action: Runs ecspresso deploy for the sre-${ENV}-cloudwatch-metrics-collector service.
deploy_metrics_proxy:
Trigger: Changes to alloy/
Action: Runs ecspresso deploy for the sre-${ENV}-metrics-proxy service.
deploy_grafana:
Trigger: Changes to grafana/
Action: Runs ecspresso deploy for the sre-${ENV}-grafana service.
deploy/dashboards:
Trigger: Changes to dashboards/
Action: Connects to the bastion and runs mise run dashboards:push to sync all dashboards to Grafana.

`cleanup` Stage¶

Purpose: Remove stale or unmanaged resources. Runs on main branch only.

cleanup/grafana:
Trigger: Changes to dashboards/
Action: Connects to the bastion and runs mise run dashboards:prune. This is critical: it deletes any dashboards and empty folders in Grafana that are not present in the dashboards/ directory, ensuring this repo is the single source of truth.