← Writing

Terraform State at Scale: Isolation Patterns, Drift, and the Monorepo Trap

June 20, 2025


State Is Your Infrastructure''s Source of Truth

Terraform state is not a cache. It is the authoritative record of what Terraform believes exists in your infrastructure. When state diverges from reality, because someone clicked in the console, a resource was deleted out-of-band, or an apply was interrupted mid-run, Terraform''s next plan will be wrong. It will try to create things that already exist, or ignore things that have drifted. At scale, unmanaged drift is the leading cause of infrastructure incidents that take hours to diagnose because no one is sure what the actual state of production is.

Designing state correctly from the start is the highest-leverage decision you make in a Terraform-based platform. It determines your blast radius, your team autonomy, your deploy velocity, and how badly an interrupted apply can hurt you.

The Blast Radius Principle

Every state file defines a blast radius: the set of resources that a single botched terraform apply can destroy or misconfigure. A single state file containing your entire infrastructure, VPC, EKS cluster, RDS, IAM roles, Route53 records, means a single apply error can take down everything simultaneously.

The primary design goal of state isolation is blast radius containment. The rule: no single state file should contain resources whose simultaneous failure would breach your SLO.

In practice, this means splitting state by three dimensions:

  • Environment, dev, staging, prod are never in the same state file. Applying to staging should never touch prod state.
  • Layer, foundation (VPC, subnets, peering) is separate from platform (EKS, RDS) is separate from workload (application deployments, DNS records). Layers have explicit dependency direction: workload depends on platform depends on foundation, never the reverse.
  • Domain, within a layer, separate by team or service boundary. The payments team should not need to coordinate with the auth team to apply changes to their own infrastructure.

A typical layout for a medium-sized platform:

infra/
  foundation/
    prod/     # VPC, subnets, TGW, DNS zones
    staging/
  platform/
    prod/     # EKS, RDS, ElastiCache, IAM roles
    staging/
  workloads/
    payments/
      prod/   # payments-specific resources
      staging/
    auth/
      prod/
      staging/

Remote State References: The Right Way to Cross Boundaries

Once you have multiple state files, you need a way for higher-layer modules to consume outputs from lower-layer modules, for example, a workload module needs the VPC ID and subnet IDs from the foundation state.

The wrong approach is hardcoding IDs. They change, and hardcoded values create invisible dependencies that break on the next foundation apply.

The right approach is Terraform''s terraform_remote_state data source:

data "terraform_remote_state" "foundation" {
  backend = "s3"
  config = {
    bucket = "my-tfstate"
    key    = "foundation/prod/terraform.tfstate"
    region = "eu-west-1"
  }
}

locals {
  vpc_id     = data.terraform_remote_state.foundation.outputs.vpc_id
  subnet_ids = data.terraform_remote_state.foundation.outputs.private_subnet_ids
}

This creates a soft dependency: the workload plan reads the current outputs of the foundation state at plan time. If the foundation hasn''t been applied yet, the plan fails with a clear error. If the foundation outputs change (VPC replaced), the next workload plan will see the new values and prompt for re-apply.

The trade-off: terraform_remote_state exposes the entire state file to the consumer. This is a security concern if state contains secrets (database passwords, API keys). Prefer passing only what downstream modules need via explicit outputs, and consider using SSM Parameter Store or Secrets Manager for secrets rather than Terraform outputs.

Module Versioning: Git Tags Over Registry for Most Teams

Terraform modules should be versioned. The question is how. The Terraform Registry (public or private) is the official answer, but it has overhead: publishing, version bumps, registry authentication. For most platform teams, Git tag references are the pragmatic choice:

module "eks_cluster" {
  source  = "git::https://github.com/my-org/terraform-modules.git//eks?ref=v2.4.1"
  # ...
}

Git tags give you immutable version pinning, familiar tooling, and no registry infrastructure to maintain. The convention: v{major}.{minor}.{patch} with semantic versioning semantics, breaking interface changes bump major, new optional variables bump minor, fixes bump patch.

The monorepo trap: putting all modules and all environment configurations in one repository creates pressure to use local path references (source = "../../modules/eks") rather than versioned references. This feels convenient until you have multiple environments that should be on different module versions. The staging environment tests a new module version before it rolls to prod, that workflow is impossible with local paths because both environments always use whatever is currently on disk.

The solution: separate module repos (or a modules monorepo with its own tag-based versioning) from environment configuration repos. Environment configs pin module versions explicitly. Upgrading a module to a new environment is a deliberate PR, not an accidental side effect of a different change.

Drift Detection: Plan in CI, Always

Drift is inevitable. Engineers will make console changes to unblock an incident. Cloud providers will migrate resources. Auto-scaling will create resources outside Terraform''s control. The question is whether you know about drift before it becomes an incident.

The pattern: run terraform plan against every production state file on a schedule (daily is common, hourly for critical infrastructure). Treat a non-empty plan as an alert. Pipe the plan output to a Slack channel or PagerDuty. Review drifted resources and either reconcile them (terraform import, terraform apply) or formally exclude them (lifecycle { ignore_changes = [...] } if the drift is intentional).

In GitHub Actions:

- name: Drift detection plan
  run: |
    cd infra/platform/prod
    terraform init -backend-config=prod.backend.hcl
    terraform plan -detailed-exitcode -out=plan.tfplan
  continue-on-error: true
  id: plan

- name: Notify on drift
  if: steps.plan.outputs.exitcode == 2
  uses: slackapi/slack-github-action@v1
  with:
    payload: |
      {"text": "⚠️ Drift detected in platform/prod, review required"}

Exit code 2 from terraform plan -detailed-exitcode means changes are pending. Exit code 0 means clean. Exit code 1 means an error (backend unavailable, authentication failure), treat this as a critical alert too.

The State Surgery Toolkit

Even with good design, you will eventually need to perform state surgery: moving resources between state files after a refactor, recovering from a corrupted apply, or importing manually-created resources. The essential commands:

terraform state mv, moves a resource within a state file, or between state files with the -state-out flag. Use this when refactoring module structure without recreating resources.

terraform state rm followed by terraform import, the escape hatch. Remove a resource from state and re-import it with a corrected address. Necessary when a resource was created outside Terraform or when its state entry is corrupted.

terraform state pull and terraform state push, directly read and write the raw state JSON. Use with extreme caution, always on a backup. Never push state manually in production without a tested rollback.

One invariant to never break: always enable state locking. With S3 + DynamoDB backend, Terraform acquires a DynamoDB lock before any operation that modifies state. Without locking, two concurrent applies produce a split-brain state file that is painful to recover from. If a lock is stuck (interrupted apply left a lock entry), verify the apply is truly not running before forcibly releasing with terraform force-unlock.

What Good Looks Like

A mature Terraform platform has state files sized so that a single botched apply affects at most one service in one environment. Remote state references cross layer boundaries with explicit, versioned outputs. Modules are tagged and pinned, upgrading to a new version is a deliberate, reviewable action. Drift detection runs on a schedule and alerts before anyone notices in production. State files live in versioned, encrypted, access-logged backends. And when something goes wrong, because it will, the team has practised state surgery in a lower environment and knows exactly which commands to run.

Infrastructure as Code is only as reliable as the discipline around it. State design is where that discipline starts.


Subscribe

I write about frameworks and principles from the things I build and read. Infra, AI agents, and the occasional detour outside of these topics. ~Monthly, sometimes more often.