Terraform Implementation Guide
Technical implementation reference for helping customers build production-grade Terraform environments. Use this after the Discovery Guide — this is the how once the what has been decided.
Local state is a singleton — only one person can own it. No locking, no collaboration, no recovery if the machine dies. Never advise this for team or production use.
State files contain a map of your infrastructure and sensitive data in plain text. Never commit to version control.
Four Decisions to Address Before Using Local State
- If your laptop dies, Terraform loses its memory entirely
- Without the state file, you must manually
importevery resource back into a new state file - Decision: manual backup strategy or migrate to remote immediately
- State files are plain-text JSON — RDS passwords, API keys stored in clear text on disk
- Does your local machine have full-disk encryption?
- If dealing with RDS, API Keys, or sensitive metadata: use
sopsor migrate to remote backend with encryption
- Will anyone else ever need to run this code?
- If yes: you are building a bottleneck — state files emailed back and forth is a failure mode
- Decision: migrate to remote backend before the team grows
- Local state has no robust state locking
- Two simultaneous
terraform applyruns can corrupt the state file - You must enforce: never run concurrent local operations
Migrating Local to Remote
terraform init.tfstate file.| Option | Best For | Setup Speed | Cost | Visibility |
|---|---|---|---|---|
| Cloud-Native S3, Azure Blob, GCS |
Small-to-medium teams already embedded in a cloud provider | Medium | Minimal — pennies for storage | CLI or Cloud Console only |
| Managed Platform HCP Terraform, Spacelift, Scalr |
Teams needing audit logs, UI-based management, sophisticated access controls | Fast | Free tier → per-resource or per-seat | Rich Web UI with history & diffs |
| Self-Managed Platform Terraform Enterprise |
Air-gapped or highly regulated; no-SaaS policy | Slow | License-based (per workspace) | Rich Web UI with history & diffs |
| HTTP / CI-Integrated GitLab Managed TF State |
Orgs wanting to centralize everything inside their Git provider | Medium | Included in GitLab tier | GitLab UI |
Cloud-Native Backend — Per Provider
- State stored in S3 bucket
- Locking: historically required DynamoDB table — as of late 2024/2025, Terraform introduced native S3 locking, making DynamoDB optional for newer versions
- Enable bucket versioning — required for state recovery
- Enable SSE-KMS encryption at rest
- Block all public access on the bucket
- State stored in Azure Storage Account Blob container
- Locking handled natively via "Blob leases" — no extra service required
- Enable blob versioning for state recovery
- Use a private endpoint so traffic never hits the public internet
- Enable Customer Managed Key (CMK) encryption at rest
- State stored in GCS bucket
- Locking handled natively by the bucket itself — no extra service required
- Enable object versioning for state recovery
- Enable "Uniform" bucket-level access control
- Enable CMEK encryption at rest
HCP Terraform: Handles state, locking, and remote runs. Plan/Apply happens on HashiCorp servers. Changed to resource-based pricing — if you have thousands of small resources, do the math vs an S3 bucket at $0.50/month.
Spacelift / Scalr: More advanced policy-as-code (OPA) and workflow orchestration than HCP. Better support than OpenTofu for enterprise needs.
Ask these four questions to narrow the backend decision:
TFE requires three main external services to remain stateless and resilient. These must be provisioned and managed by the customer's platform team before TFE installation.
| Layer | AWS Service | Configuration Notes |
|---|---|---|
| Compute | EC2 Instance or EKS (Kubernetes) | Min 4 vCPU / 8GB RAM. Production: 8–16 vCPU / 32GB+ RAM. Runtime: Docker Engine or Kubernetes. |
| Database | RDS Multi-AZ (PostgreSQL v12–v16) | Use Multi-AZ for HA. Stores user accounts, workspace settings, run history. Does NOT store state files. |
| Object Storage | S3 Bucket | Versioned objects + SSE-KMS encryption. Stores all .tfstate files, plan files, run logs, and config code. |
| Identity | IAM Roles (Instance Profile) | TFE app writes to S3 and talks with RDS without hardcoded Access Keys. |
| Network | VPC + NLB or ALB | VPC across at least two Private Subnets. TFE requires HTTPS — terminate SSL at the Load Balancer (recommended). |
| Secrets | Vault (recommended) or Secrets Manager + KMS | Service credentials, TFE license, encryption key (enc-password), TLS certs. |
| Redis (Active-Active) | ElastiCache for Redis | Required for multi-node Active/Active. Coordinates the Run Queue between nodes. |
| Layer | Azure Service | Configuration Notes |
|---|---|---|
| Compute | Azure VM or AKS (Kubernetes) | Min 4 vCPU / 8GB RAM. Create a VNet with a dedicated subnet for AKS or VMSS. |
| Database | Azure Database for PostgreSQL Flexible Server | Flexible Server preferred for TFE's performance needs. PostgreSQL v12–v16. |
| Object Storage | Azure Blob Storage | Use a Storage Account with a private endpoint so traffic never hits the public internet. |
| Identity | User-Assigned Managed Identity | Assign to VM/Pod to handle authentication to Storage Account — no hardcoded keys. |
| Network | VNet + AKS subnet or VMSS | HTTPS required. SSL termination at Load Balancer recommended. |
| Secrets | Vault (recommended) or Azure Key Vault | Service credentials, TFE license, encryption key, TLS certs. |
| Redis (Active-Active) | Azure Cache for Redis | Required for multi-node Active/Active. Coordinates the Run Queue between nodes. |
| Layer | GCP Service | Configuration Notes |
|---|---|---|
| Compute | Compute Engine or GKE (Kubernetes) | Min 4 vCPU / 8GB RAM. Create a VPC with Private Service Connect to reach Cloud SQL. |
| Database | Cloud SQL for PostgreSQL | Use a Private IP address. PostgreSQL v12–v16. |
| Object Storage | Cloud Storage (GCS) Bucket | Enable "Uniform" bucket-level access. Stores all state files, plan files, run logs. |
| Identity | Google Service Account | Assign roles/storage.objectAdmin and roles/cloudsql.client permissions. |
| Network | VPC + Private Service Connect | HTTPS required. SSL termination at Load Balancer recommended. |
| Secrets | Vault (recommended) or Secret Manager + KMS | Service credentials, TFE license, encryption key, TLS certs. |
| Redis (Active-Active) | Memorystore for Redis | Required for multi-node Active/Active. Coordinates the Run Queue between nodes. |
- Physical separation of code per environment
- Easiest to use different module versions per environment — controlled updates
- Transparent visibility — browse the repo to see all environments
- Distributed across different backend paths — strong isolation
- Higher code redundancy — shared files copied across folders
- High environment variability if dev genuinely differs from prod (often a good thing)
With directory isolation, you can assign different IAM roles or service accounts per environment directory. The CI/CD pipeline for the prod/ folder can be restricted to only allow the "Production" service account. This is a hard security boundary — workspaces cannot do this.
Workspaces are NOT a security boundary. All workspaces in a directory share the same backend configuration — anyone with access to run Terraform in that folder can access the state of any workspace. If you need strict IAM permissions separating Dev and Prod, directory-based isolation is required.
- Low code redundancy — same code for all environments
- No environment variability by design
- Centralized on one backend path
- High risk of accidental "prod" changes if you forget to switch workspaces
- Hidden visibility — can't see which environments exist without running a command
- Shared credentials across all environments
Valid Workspace Use Cases
Workspace-based environments are better when you need to deploy identical logic multiple times:
- Preview environment for each Pull Request
- Deploy:
terraform workspace new pr-123 - Delete:
terraform workspace delete pr-123when PR is merged
- SaaS provider deploying identical app infra per region
- Set region in provider using
var.region_map[terraform.workspace] - Same logic, different regional execution context
- Customer is the unit of isolation
- Prevents drift between customer environments
- Each workspace = one tenant
- New version of entire environment alongside existing for cutover
- Swap traffic, validate, then destroy old workspace
- Requires careful state management during transition
Goal: Enforce company standards and compliance. Create thin wrappers around single resources or closely related resources that encode the "Golden Resource" — e.g., "Every S3 bucket must have encryption."
Rule: Never create one-to-one wrappers that expose every single provider argument. If you can't find a reason to change a default, don't expose it as a variable.
Owned by: Security / Platform Team — very low change frequency, global blast radius.
Tier 1 Examples
- AES-256 Encryption enforced
- Block Public Access enforced
- 90-day versioning lifecycle
- Enforces TLS 1.2
- Disabled shared key access
- Requires Private Endpoint connectivity
- Automatically attaches a Boundary policy
- Adds standard owner tags
- Configures Federated Identity Credential for OIDC by default
Goal: Provide a "best practice" implementation of a common architecture pattern. Collections of Tier 1 modules that form a complete service.
Rule: These should be opinionated — they define how your company builds a web server or a database cluster. Not every argument should be exposed.
Owned by: Platform / SRE Team — low change frequency, medium (service-level) blast radius.
Tier 2 Examples
- VPC with public/private subnets
- NAT Gateways + Route Tables with Network ACLs
- Connected to Corporate Hub with Transit Gateway / VPC Peering
- IPAM integration
- RDS Instance + Subnet Group
- Security Group Rule
- Randomized credentials stored in Secrets Manager
Goal: These are the "root modules" that developers call. They combine Infrastructure Modules to deploy a full environment. Abstract away all complex logic — a developer should only need to provide app_name and environment.
Owned by: App Developers — high change frequency (daily/weekly), low (app-level) blast radius.
Tier 3 Example: E-Commerce Checkout Service
Tier Comparison
| Feature | Tier 1: Bricks | Tier 2: Walls | Tier 3: Buildings |
|---|---|---|---|
| Owned by | Security / Platform | Platform / SRE | App Developers |
| Change Frequency | Very Low | Low | High (daily/weekly) |
| Blast Radius | Huge (Global Impact) | Medium (Service Impact) | Low (App Impact) |
| Versioning Goal | Strict SemVer | Feature-based releases | Environment-based tags |
Using Git tags (e.g., v1.2.0) is essential so a change to a module doesn't instantly break every project consuming it. If a user upgrades their module, they should know exactly what the risk is by looking at the version number.
Conventional Commits Workflow
Require developers to use a standard Git commit message format. Tools: commitlint (enforce format), Commitizen (generate messages), Semantic-Release (automate version bumps), Husky (git hooks to block non-compliant commits).
Required Module Guardrail — versions.tf
Every module must have a versions.tf file that restricts the Terraform and Provider versions it supports:
Automation Tools for Consistency
- Use terraform-docs via pre-commit hook or GitHub Action
- Every time a variable is added or changed, the README.md updates automatically
- Prevents "documentation drift" where README and code disagree
- Use Renovate Bot to scan Terraform code for newer module versions
- Automatically opens a PR to upgrade — combine with
terraform testor Terratest - "Patch" and "minor" upgrades can be auto-merged with high confidence when tests pass
| Source Mechanism | Recommendation | Why |
|---|---|---|
Local Path./modules/x |
Avoid for shared code | Hard to version — changes affect everyone instantly. Only appropriate for tightly-coupled, single-repo code. |
Git Tagref=v1.2.0 |
Good for Startups | Simple to set up; immutable once tagged. No registry infrastructure required. |
Private RegistryHCP / TFE |
Best for Enterprise | Native versioning UI, security scanning, "official" badges. Single source of truth for all teams. |
HCL identifiers are the names of resources inside your code — not what appears in the AWS/Azure console.
resource "aws_s3_bucket" "s3_bucket_logs" {} → Better: resource "aws_s3_bucket" "logs" {}this or main. This makes it predictable for anyone reading your code.Physical resource names are what appears in the AWS/Azure/GCP console. Use kebab-case for physical names (opposite of HCL identifiers).
[Org]-[Project]-[Env]-[Region]-[Resource]-[Suffix]
This ensures that even if someone sees a resource in the cloud console without any context, they know exactly what it is and who owns it.
- Convention:
terraform-<provider>-<name> - Examples:
terraform-aws-vpc,terraform-azure-aks - Follows Standard Module Structure so Terraform registries can parse automatically
- Convention:
infra-<project>-<business-unit> - Polyrepo for shared modules (independent versioning)
- Monorepo or single repo for application environment configs (Dev/Staging/Prod side-by-side)
Standard Module Directory Structure
Provider Version Pinning — Lock at the Edge
~> 1.2.0 — allows 1.2.1, 1.2.9, but blocks 1.3.0. Use this for providers.
~> 1.2 — allows 1.3.0, 1.9.0, but blocks 2.0.0. Use this for reusable modules.
If you pin a library too tightly, you create dependency hell — Module A requires v1.1 and Module B requires v1.2, you can't have both in the same project. Ranges allow negotiation.
| Location | Pinning Strategy | Why |
|---|---|---|
| Root Module Code you actually run apply on | Pin exactly: v1.2.5 or ~> v1.2.5 | 100% reproducibility — rebuild exactly as it was |
| Reusable Module Shared library / "brick" | Use ranges: >= 1.2 or ~> 1.0 | Avoids dependency conflicts between consumers |
Decide on a Global Tag policy early. Every resource must carry these tags. Create a "Global Tags" variable that is merged to every resource automatically.
- CostCenter: Internal budget code or Department ID (e.g., DEPT-402)
- BusinessUnit: High-level org unit (Marketing, Engineering)
- Project: Billing code or specific initiative name
- Owner: Team, email alias, or Slack channel — never personal names (people leave)
- TechnicalContact: Primary engineer responsible for the service
- Service/Application: Logical name of the application the resource belongs to
- Environment: Standardized values: prod, staging, dev, sandbox
- ManagedBy: Always set to "Terraform"
- ProvisionedBy: Specific Git repository or CI/CD pipeline that built the resource
- DataClassification: public, internal, confidential, pii
- Criticality: low, medium, high, mission-critical (for incident response prioritization)
- Compliance: PCI, HIPAA, SOC2 if resource is subject to regulatory scope
Implementation — AWS and Azure/GCP
Case Consistency: Pick your case requirements (lowercase, PascalCase) and enforce them with TFLint rules.
Normalization: Use a restricted list of valid values for tags like Environment — never allow freeform input.
No Sensitive Data in Tags: Never put IP addresses, passwords, or phone numbers in tags — they appear in cloud console search and billing exports.
Tagging policy requires enforcement at multiple layers. A policy that can be bypassed is not a policy.
Catch violations before the code is even pushed.
- tflint — fails the build if mandatory tags are missing from code
- checkov — security scanning that catches missing required tags
- pre-commit hooks — run
terraform fmtand tflint before a developer can push
Block non-compliant applies at the execution layer.
- HCP Terraform Sentinel — block any
terraform applythat doesn't meet tagging criteria - Open Policy Agent (OPA) — alternative policy-as-code for non-HCP environments
Last resort — prevents resource creation at the cloud provider level.
- AWS Tag Policies — SCP-level enforcement at the Org level
- Azure Policy — physically prevents a resource from being created without required tags
TFLint: Configurable with custom rules to fail a build if a resource name uses dashes instead of underscores, or if a variable is missing a description.
terraform-docs: Automatically generates README.md from variables and outputs. If a developer changes a variable name, documentation updates itself — eliminates documentation drift.
If you use a data source to fetch a password from Vault, that password gets written to terraform.tfstate and persists there permanently. Ephemeral Resources (TF 1.10+) solve this — the value is fetched at apply time and never written to state.
Only your CI/CD runner should have "Owner" / apply permissions. Developers should have "Read-Only" or "Plan-only" access. Humans applying directly from laptops in production is a control failure, not a workflow.
| Role | Access Level | Rationale |
|---|---|---|
| CI/CD Runner (Service Account) | Owner / Apply | The only entity that should run terraform apply in prod |
| Lead Engineer / SRE | Plan + Approve | Can review plans and approve runs, cannot directly apply |
| Developer | Plan-only | Can see what will change, cannot trigger applies in production |
| Security Auditor | Read-only | Can view state and run history for compliance purposes |
State files are plain-text JSON. They contain passwords, IP addresses, access keys, and the complete map of your infrastructure. They must never be committed to version control — ever.
*.tfstate, *.tfstate.backup, .terraform/*, *.tfvars all excluded from GitPrivate Agent Setup
Managed vs Self-Managed Operational Burden
| Task | TFE (Self-Managed) | HCP Terraform |
|---|---|---|
| App Updates | Manual (Monthly/Quarterly) — track release notes, plan upgrades, monitor tfe-migrations logs | Automatic — zero effort |
| DB Backups | High — manage PostgreSQL backups, performance tuning, version upgrades | None |
| Scaling | Manual — K8s node scaling or Auto-scaling Group management | None |
| Security | Full — OS patching, network perimeter, encryption key management, TLS cert rotation | Minimal — identity/RBAC only |
| Redis Management | Required for Active/Active — customer manages Redis cluster | None |
| Custom Worker Images | Required if devs need special tools (jq, Python, AWS CLI) — must build, maintain, and secure images | Not needed |
Multi-Availability Zone setup in a single region.
- Provides 99.9% availability with 10% of the complexity of multi-region
- RDS Multi-AZ, storage replication within region
- This should be the baseline for all production TFE deployments
- Primary Region: TFE running and handling all traffic
- Secondary Region: Infra defined but "scaled to zero"
- Data Sync: PostgreSQL DB and storage buckets continuously replicating to secondary
- Failover: Platform team scales up secondary nodes and updates DNS
When Is Multi-Region Worth Implementing?
As of 2026, the standard deployment method is Flexible Deployment Options (FDO) using containers — Docker, Kubernetes, OpenShift, Nomad, or Podman. The legacy Replicated installer is being phased out. TFE supports AMD and ARM architectures as of v1.0.0, and IPv4, IPv6, and mixed IP environments.
Terraform Enterprise on the Replicated platform will no longer be supported after April 1, 2026. Any customer still on Replicated must migrate to FDO (container-native deployment) immediately.
The Replicated platform was a containerized installation that used Replicated to manage TFE's lifecycle — Replicated Daemon, Replicated UI, and containerized TFE components (ptfe_atlas, ptfe_vault, ptfe_postgres, ptfe_nginx). This architecture is replaced by FDO.
Operational Modes (Legacy Reference)
| Mode | PostgreSQL | Object Storage | Redis |
|---|---|---|---|
| external | External — customer-managed | External — customer-managed | Docker volume on instance |
| active-active | External — customer-managed | External — customer-managed | External — customer-managed |
| disk | Internal directory on instance | Internal directory on instance | Docker volume on instance |
Reserve for Platform Team — manages global settings, policies, providers, and org-level variables.
- Keep this team small (2–3 people)
- Most admins should only have Manage Workspaces — not full org controls
Group related workspaces by Business Unit or environment. Permissions at the project level cascade to all workspaces.
- Lead Engineers manage their project domain
- Teams create their own workspaces without central admin approval every time
Use only for exceptions or highly sensitive standalone resources that don't fit a project grouping.
- Avoid managing the majority of permissions here — it doesn't scale
- Individual contributors get Read / Plan / Write as appropriate
Map your existing team structure into these standardized personas. Then assign TFE permission levels to match — not to individuals, but to IdP Groups (Okta/AD) mapped to TFE Teams.
| Persona | TFE Permission Level | Capabilities |
|---|---|---|
| Platform Admin | Org Admin | Manage teams, SSO, and global module registries. Keep this group very small. |
| Lead Engineer | Project Admin | Create/delete workspaces within a specific project; manage team access for their domain. |
| Developer | Write | Trigger runs, update variables, see plans and applies. Cannot manage workspace settings. |
| Security Auditor | Read-Only | View state files and run history for compliance auditing. Cannot change anything. |
Always map Identity Provider (IdP) Groups (Okta/Active Directory) to HCP Terraform Teams — never assign permissions to individual users. When an employee leaves, their TFE access terminates automatically with their IdP account. No manual deprovisioning required.
State files and variables are the most sensitive assets in TFE. By default, "write" access allows a user to see the state file. For high-security environments, use Custom Workspace Permissions.
- Grant a team the ability to Apply changes without being able to read the state file or sensitive variable values
- Set
state-versionstononeorread-outputs— prevents downloading raw state JSON (which may contain passwords) - Set
variablestonone— users trigger runs that use variables, but cannot see the sensitive values in the UI
- By default, workspaces are isolated
- If Workspace B needs an output from Workspace A, you must explicitly enable Remote State Sharing
- Never use "Share with all workspaces" — explicitly list the workspaces allowed to read outputs
- Consider using HCP Terraform Outputs or targeted data sources instead of sharing full state
In air-gapped TFE deployments, the TFE encryption password will likely be wrapped in a Hardware Security Module (HSM) or a cloud KMS. If TFE loses its unseal key, all data becomes unreadable. Document the key recovery procedure before you need it.