CI/CD Pipelines¶
This document explains the structure and flow of the .gitlab-ci.yml pipeline of this project.
1. Core Concepts¶
Before looking at the job-by-job flow, you must understand these core concepts.
Parallel Environment Matrix¶
Almost every job in this pipeline runs in parallel for each environment defined in the .matrix-aws-accounts block:
This means that a single main branch pipeline will run jobs for dev and ops simultaneously (e.g.,
tf/infrastructure (dev) and tf/infrastructure (ops)).
Central CI/CD Library¶
This pipeline is built with standardized, centralized jobs, using our central
library. Jobs like .tf/format and .docker/bake are templates
imported from this central project.
Bastion Host Connection¶
Jobs that need to communicate with the Grafana API (which is in a private private subnet) extend the .connect-bastion
job. This template, defined in
/.gitlab/ci/setup-bastion-connection.gitlab-ci.yml, handles
establishing the secure connection automatically. For further information on the bastion setup, see
bastion.md
Terraform Child Pipelines¶
Terraform is not run directly in this pipeline. Instead, jobs like tf/infrastructure use the trigger: keyword to
launch a child pipeline from the central ci-cd-library.
This child pipeline handles the full terraform plan and terraform apply logic. We just pass it the variables it
needs, like the working directory (CCL_TEMPLATES_TERRAFORM_BASE_JOBS_FOLDER_PATH) and the environment name.
Authentication to AWS¶
Important to note; any job that has to interact with AWS expects one of the role_arn variables to be set
(CCL_AWS_ASSUME_ROLE_ARN or CCL_AWS_ASSUME_ROLE_ARN_READONLY). Do not set these variables manually in the project.
Rather create the necessary role (with the correct project and environment scope) in the
AWS-Account-Management project. The role_arn
variables will then automatically be managed from our centralized repository.
The following example shows a template job that can interact with AWS:
ECS Deployments with ecspresso¶
We use ecspresso to deploy new container images to ECS. This gives us precise control over the deployment. The flow
is:
- The
docker/bakejob builds a new image and tags it with the$CI_COMMIT_SHA. - The
deploy_...job (e.g.,deploy_grafana) runs. - It pulls the cluster's current task definition using
ecspresso init .... - It uses
sedto find the"image"key in the task definition and replace its tag with the$CI_COMMIT_SHA. - It runs
IMAGE_TAG=${CI_COMMIT_SHA} ecspresso deploy ...to register and deploy the new task definition.
Keeping Terraform and CI/CD in Sync¶
You may wonder: "If CI/CD deploys a new task definition, what happens when Terraform runs next?" This is handled by a
key setting in our Terraform ECS service modules: track_latest = true. This tells Terraform, "Do not manage the
task_definition ARN. Always assume the latest active revision is the one you should be tracking."
This prevents a "drift" where Terraform would try to revert the service to an older task definition just because the CI/CD pipeline deployed a new one.
- CI/CD (ecspresso): Actively pushes new task definition revisions.
- Terraform (
track_latest): Passively accepts the latest revision as its state.
2. Pipeline Flow by Stage¶
This is the end-to-end execution order of the pipeline.
verify Stage¶
Purpose: Linting and formatting checks. Runs on merge requests.
tf/format: Checks that all.tffiles are correctly formatted.alloy/format: Checks that all.alloyfiles inalloy/config/are formatted.validate/dashboards: Validates all dashboards, after generation, to ensure all dashboard JSON files are valid.
build Stage¶
Purpose: Build and push container images. Runs on main branch and MRs.
docker/bake: This job builds and pushes all container images defined indocker-bake.hcl(Alloy, Grafana, etc.).- On MRs: It builds but does not push to ECR.
- On
main: It builds and pushes, tagging the image with both:latestand the$CI_COMMIT_SHA.
trigger Stage¶
Purpose: Run Terraform child pipelines to update infrastructure.
tf/infrastructure: Triggers the Terraform pipeline for the maininfrastructure/directory. This runs first.tf/grafana_datasources: Triggers the Terraform pipeline for Grafana provisioning. This jobneedsthetf/infrastructurejob, ensuring it only runs after the core infra is applied.
deploy Stage¶
Purpose: Deploy applications and configurations. Runs on main branch only.
These jobs use the changes: keyword to ensure they only run when their specific code is modified.
deploy_cloudwatch_metrics_collector:- Trigger: Changes to
alloy/ -
Action: Runs
ecspresso deployfor thesre-${ENV}-cloudwatch-metrics-collectorservice. -
deploy_metrics_proxy: - Trigger: Changes to
alloy/ -
Action: Runs
ecspresso deployfor thesre-${ENV}-metrics-proxyservice. -
deploy_grafana: - Trigger: Changes to
grafana/ -
Action: Runs
ecspresso deployfor thesre-${ENV}-grafanaservice. -
deploy/dashboards: - Trigger: Changes to
dashboards/ - Action: Connects to the bastion and runs
mise run dashboards:pushto sync all dashboards to Grafana.
cleanup Stage¶
Purpose: Remove stale or unmanaged resources. Runs on main branch only.
cleanup/grafana:- Trigger: Changes to
dashboards/ - Action: Connects to the bastion and runs
mise run dashboards:prune. This is critical: it deletes any dashboards and empty folders in Grafana that are not present in thedashboards/directory, ensuring this repo is the single source of truth.