Skip to content

CI/CD Pipelines

This document explains the structure and flow of the .gitlab-ci.yml pipeline of this project.

1. Core Concepts

Before looking at the job-by-job flow, you must understand these core concepts.

Parallel Environment Matrix

Almost every job in this pipeline runs in parallel for each environment defined in the .matrix-aws-accounts block:

1
2
3
4
5
.matrix-aws-accounts: &matrix-aws-accounts
  parallel:
    matrix:
      - ENV: [dev]
      - ENV: [ops]

This means that a single main branch pipeline will run jobs for dev and ops simultaneously (e.g., tf/infrastructure (dev) and tf/infrastructure (ops)).

Central CI/CD Library

This pipeline is built with standardized, centralized jobs, using our central library. Jobs like .tf/format and .docker/bake are templates imported from this central project.

Bastion Host Connection

Jobs that need to communicate with the Grafana API (which is in a private private subnet) extend the .connect-bastion job. This template, defined in /.gitlab/ci/setup-bastion-connection.gitlab-ci.yml, handles establishing the secure connection automatically. For further information on the bastion setup, see bastion.md

1
2
3
4
5
my-job:
  extends: .connect-bastion
  script:
    - # Your commands here - connection is already established
    - gcx dashboard list

Terraform Child Pipelines

Terraform is not run directly in this pipeline. Instead, jobs like tf/infrastructure use the trigger: keyword to launch a child pipeline from the central ci-cd-library.

This child pipeline handles the full terraform plan and terraform apply logic. We just pass it the variables it needs, like the working directory (CCL_TEMPLATES_TERRAFORM_BASE_JOBS_FOLDER_PATH) and the environment name.

Authentication to AWS

Important to note; any job that has to interact with AWS expects one of the role_arn variables to be set (CCL_AWS_ASSUME_ROLE_ARN or CCL_AWS_ASSUME_ROLE_ARN_READONLY). Do not set these variables manually in the project. Rather create the necessary role (with the correct project and environment scope) in the AWS-Account-Management project. The role_arn variables will then automatically be managed from our centralized repository.

The following example shows a template job that can interact with AWS:

1
2
3
4
5
6
7
8
example/aws:
  id_tokens: &aws_assume_role_tokens
    CCL_AWS_ASSUME_ROLE_GITLAB_ID_TOKEN: <--------------------------------- # Set the correct id token for OIDC
      aud: sts.amazonaws.com
  before_script:
    - !reference [.cicd_utils/aws-assume-role-with-web-identity, script] <- # This line will import the utils function to assume a role with OIDC
    - cicd_utils__aws_assume_role_with_web_identity  <--------------------- # This line actually assumes the necessary role with OIDC
  <<: *matrix-aws-accounts

ECS Deployments with ecspresso

We use ecspresso to deploy new container images to ECS. This gives us precise control over the deployment. The flow is:

  1. The docker/bake job builds a new image and tags it with the $CI_COMMIT_SHA.
  2. The deploy_... job (e.g., deploy_grafana) runs.
  3. It pulls the cluster's current task definition using ecspresso init ....
  4. It uses sed to find the "image" key in the task definition and replace its tag with the $CI_COMMIT_SHA.
  5. It runs IMAGE_TAG=${CI_COMMIT_SHA} ecspresso deploy ... to register and deploy the new task definition.

Keeping Terraform and CI/CD in Sync

You may wonder: "If CI/CD deploys a new task definition, what happens when Terraform runs next?" This is handled by a key setting in our Terraform ECS service modules: track_latest = true. This tells Terraform, "Do not manage the task_definition ARN. Always assume the latest active revision is the one you should be tracking."

This prevents a "drift" where Terraform would try to revert the service to an older task definition just because the CI/CD pipeline deployed a new one.

  • CI/CD (ecspresso): Actively pushes new task definition revisions.
  • Terraform (track_latest): Passively accepts the latest revision as its state.

2. Pipeline Flow by Stage

This is the end-to-end execution order of the pipeline.

verify Stage

Purpose: Linting and formatting checks. Runs on merge requests.

  • tf/format: Checks that all .tf files are correctly formatted.
  • alloy/format: Checks that all .alloy files in alloy/config/ are formatted.
  • validate/dashboards: Validates all dashboards, after generation, to ensure all dashboard JSON files are valid.

build Stage

Purpose: Build and push container images. Runs on main branch and MRs.

  • docker/bake: This job builds and pushes all container images defined in docker-bake.hcl (Alloy, Grafana, etc.).
  • On MRs: It builds but does not push to ECR.
  • On main: It builds and pushes, tagging the image with both :latest and the $CI_COMMIT_SHA.

trigger Stage

Purpose: Run Terraform child pipelines to update infrastructure.

  • tf/infrastructure: Triggers the Terraform pipeline for the main infrastructure/ directory. This runs first.
  • tf/grafana_datasources: Triggers the Terraform pipeline for Grafana provisioning. This job needs the tf/infrastructure job, ensuring it only runs after the core infra is applied.

deploy Stage

Purpose: Deploy applications and configurations. Runs on main branch only.

These jobs use the changes: keyword to ensure they only run when their specific code is modified.

  • deploy_cloudwatch_metrics_collector:
  • Trigger: Changes to alloy/
  • Action: Runs ecspresso deploy for the sre-${ENV}-cloudwatch-metrics-collector service.

  • deploy_metrics_proxy:

  • Trigger: Changes to alloy/
  • Action: Runs ecspresso deploy for the sre-${ENV}-metrics-proxy service.

  • deploy_grafana:

  • Trigger: Changes to grafana/
  • Action: Runs ecspresso deploy for the sre-${ENV}-grafana service.

  • deploy/dashboards:

  • Trigger: Changes to dashboards/
  • Action: Connects to the bastion and runs mise run dashboards:push to sync all dashboards to Grafana.

cleanup Stage

Purpose: Remove stale or unmanaged resources. Runs on main branch only.

  • cleanup/grafana:
  • Trigger: Changes to dashboards/
  • Action: Connects to the bastion and runs mise run dashboards:prune. This is critical: it deletes any dashboards and empty folders in Grafana that are not present in the dashboards/ directory, ensuring this repo is the single source of truth.