Skip to content

Bastion Host & Networking

This document explains the architecture and usage of the AWS bastion host.

1. Architecture Overview

Why do we need a bastion?

We use OIDC, with the unpriviliged runners in the central KNMI build account. We have the opinion that the gitlab runners should not be the entity on which we control permissions/access, they should be 'dumb'. Rather we control permissions on the short-lived identities used in OIDC. However, this does create some difficulties where our pipelines need to access resources inside our VPC's. Our core services (Grafana, Prometheus, etc.) are deployed in private subnets with no public internet access. To interact with their APIs (e.g., from CI/CD runners or local developer machines), we must go through a secure entry point. For this we chose to use a simple bastion host as single entrypoint.

How does it work?

We use a "zero-trust" bastion model. Instead of relying on SSH keys, we use AWS Systems Manager (SSM) Session Manager to provide secure port forwarding.

  • Infrastructure: A single EC2 instance (module.bastion_ec2_instance) is deployed by Terraform into the private subnets.
  • No Public IP: The bastion has no public IP address and no open inbound ports.
  • Access: All access is handled only through the AWS SSM API. This is more secure and fully auditable via CloudTrail.

A typical connection flow looks like this:

  1. A CI runner or local user (with valid AWS credentials) authenticates with the AWS SSM API.
  2. The user requests a StartSession for port forwarding.
  3. SSM securely connects to the bastion instance (via the SSM agent).
  4. The bastion opens a connection to the final destination (e.g., the Grafana ALB) on the private network.
  5. The user's local port is now securely tunneled to the private service.

2. CI/CD Integration (.connect-bastion)

In GitLab CI, any job that needs to talk to a private service (like Grafana) simply extends: .connect-bastion. This job template, defined in /.gitlab/ci/setup-bastion-connection.gitlab-ci.yml, performs a very specific setup.

The /etc/hosts Trick

The most critical piece of this job is the before_script. It performs two key actions:

  1. It modifies /etc/hosts:

    1
    2
    3
    HOST_IP="127.0.0.1"
    HOST_ENTRY="$GRAFANA_DOMAIN"
    echo "${HOST_IP} ${HOST_ENTRY}" >> /etc/hosts
    

    Why? This tricks the runner's OS. When a tool like gcx tries to connect to https://sre.dev.knmi.cloud, the OS resolves this domain name to 127.0.0.1.

  2. It starts the tunnel on port 443:

    mise run connect-bastion --local-port 443 --service "$GRAFANA_DOMAIN"
    

    Why? This command (which calls the mise task) starts the SSM port forwarding. It maps the runner's local port 443 to the bastion, which then forwards the traffic to the real Grafana ALB.

The result: When gcx sends a request to https://sre.dev.knmi.cloud, the OS sends it to 127.0.0.1:443. The SSM tunnel picks it up and forwards it. This also correctly preserves the Host: sre.dev.knmi.cloud header, which our AWS ALB needs to route the request to the correct service.

3. Local Development Setup

Maintainers can use the same mise task to connect from their local machines.

Prerequisites

  • mise installed and configured
  • AWS CLI configured with appropriate permissions
  • Session Manager plugin installed (automatically handled by the script)

How to Connect

  1. Run the mise task:

    1
    2
    3
    # This will use the default ports
    # Connects local 4380 -> bastion -> sre.dev.knmi.cloud:443
    mise run connect-bastion
    

    The task will run in the background. You are now connected.

  2. Access Grafana: You can now access Grafana by pointing your tools to https_localhost:4380.

    • Browser: Open https://localhost:4380 (you will need to bypass the SSL certificate warning, as the cert is for sre.dev.knmi.cloud, not localhost).
    • CLI:

      1
      2
      3
      4
      5
      6
      # Set the server URL and token
      export GRAFANA_SERVER="https://localhost:4380"
      export GRAFANA_TOKEN="your-api-token"
      
      # Run your commands
      gcx dashboard list
      

4. Troubleshooting

Port already in use

wait-for-port: timed out waiting for port 4380 to be available

This means another process is already using the local port.

Solution: Check what is using the port, or specify a different local port.

1
2
3
4
5
# Check what's using the port
lsof -i :4380

# Kill the process, or just use a different port
mise run connect-bastion --local-port 4381

Session Manager plugin missing

The script attempts to automatically download and install the plugin, but if you encounter issues, you can install it manually.

1
2
3
4
5
6
7
# For macOS (arm64)
curl "[https://s3.amazonaws.com/session-manager-downloads/plugin/latest/mac_arm64/session-manager-plugin.pkg](https://s3.amazonaws.com/session-manager-downloads/plugin/latest/mac_arm64/session-manager-plugin.pkg)" -o "session-manager-plugin.pkg"
sudo installer -pkg session-manager-plugin.pkg -target /

# For Linux (x86_64)
curl -fsSL "[https://s3.amazonaws.com/session-manager-downloads/plugin/latest/ubuntu_64bit/session-manager-plugin.deb](https://s3.amazonaws.com/session-manager-downloads/plugin/latest/ubuntu_64bit/session-manager-plugin.deb)" -o session-manager-plugin.deb
sudo dpkg -i session-manager-plugin.deb

Access Denied / Instance Not Found

An error occurred (TargetNotConnected) when calling the StartSession operation: InstanceID *bastion* does not exist or is not connected to Systems Manager

This is almost always an AWS authentication error.

Solution: Check that you are authenticated to the correct AWS account (dev or ops).

  • Ensure your assumed role has ssm:StartSession permissions on the bastion instance.
  • Ensure the gotoaws tool is picking up the correct instance ID. You may need to be more specific if wildcards fail: mise run connect-bastion --instance <instance-id>