Home
Terraform Implementation Guide
📐 Implementation vs Discovery
This guide covers the how-to of building Terraform. Use the Discovery Guide first to establish platform choice, team structure, and governance model.
🔀 Two Navigation Modes
Toggle between Section view (linear build order) and Domain view (jump to specific technical area mid-engagement). Same content — two entry paths.
💻 Code Examples
HCL examples are embedded throughout. Cloud-specific guidance (AWS/Azure/GCP) uses tabbed panels — switch providers without leaving the section.
SECTION 01 · STATE & BACKEND
State Management
Where is the source of truth stored? Terraform tracks infrastructure state in a terraform.tfstate file. If you lose this, Terraform forgets what it built.
Hard Limit

Local state is a singleton — only one person can own it. No locking, no collaboration, no recovery if the machine dies. Never advise this for team or production use.

.gitignore — Required Setup

State files contain a map of your infrastructure and sensitive data in plain text. Never commit to version control.

. gitignore # Local .terraform directory (plugins/modules) .terraform/* # State files and backups — NEVER commit these *.tfstate *.tfstate.backup # Variables that might contain secrets *.tfvars *.tfvars.json

Four Decisions to Address Before Using Local State

1. How will you back up state?
  • If your laptop dies, Terraform loses its memory entirely
  • Without the state file, you must manually import every resource back into a new state file
  • Decision: manual backup strategy or migrate to remote immediately
2. Is there sensitive data in your code?
  • State files are plain-text JSON — RDS passwords, API keys stored in clear text on disk
  • Does your local machine have full-disk encryption?
  • If dealing with RDS, API Keys, or sensitive metadata: use sops or migrate to remote backend with encryption
3. What is your team's future?
  • Will anyone else ever need to run this code?
  • If yes: you are building a bottleneck — state files emailed back and forth is a failure mode
  • Decision: migrate to remote backend before the team grows
4. How will you handle concurrency?
  • Local state has no robust state locking
  • Two simultaneous terraform apply runs can corrupt the state file
  • You must enforce: never run concurrent local operations

Migrating Local to Remote

Change your backend configuration to point to the new remote location
Run terraform init
When prompted "Do you want to copy your existing state to the new backend?" — confirm yes
Verify: local file is uploaded and managed remotely. Delete the local .tfstate file.
explicit local backend (optional) terraform { backend "local" { path = "relative/path/to/terraform.tfstate" } }
OptionBest ForSetup SpeedCostVisibility
Cloud-Native
S3, Azure Blob, GCS
Small-to-medium teams already embedded in a cloud provider Medium Minimal — pennies for storage CLI or Cloud Console only
Managed Platform
HCP Terraform, Spacelift, Scalr
Teams needing audit logs, UI-based management, sophisticated access controls Fast Free tier → per-resource or per-seat Rich Web UI with history & diffs
Self-Managed Platform
Terraform Enterprise
Air-gapped or highly regulated; no-SaaS policy Slow License-based (per workspace) Rich Web UI with history & diffs
HTTP / CI-Integrated
GitLab Managed TF State
Orgs wanting to centralize everything inside their Git provider Medium Included in GitLab tier GitLab UI

Cloud-Native Backend — Per Provider

S3 + DynamoDB (or Native S3 Locking)
  • State stored in S3 bucket
  • Locking: historically required DynamoDB table — as of late 2024/2025, Terraform introduced native S3 locking, making DynamoDB optional for newer versions
  • Enable bucket versioning — required for state recovery
  • Enable SSE-KMS encryption at rest
  • Block all public access on the bucket
Azure Blob Storage
  • State stored in Azure Storage Account Blob container
  • Locking handled natively via "Blob leases" — no extra service required
  • Enable blob versioning for state recovery
  • Use a private endpoint so traffic never hits the public internet
  • Enable Customer Managed Key (CMK) encryption at rest
Google Cloud Storage (GCS)
  • State stored in GCS bucket
  • Locking handled natively by the bucket itself — no extra service required
  • Enable object versioning for state recovery
  • Enable "Uniform" bucket-level access control
  • Enable CMEK encryption at rest
Managed Platform Notes

HCP Terraform: Handles state, locking, and remote runs. Plan/Apply happens on HashiCorp servers. Changed to resource-based pricing — if you have thousands of small resources, do the math vs an S3 bucket at $0.50/month.

Spacelift / Scalr: More advanced policy-as-code (OPA) and workflow orchestration than HCP. Better support than OpenTofu for enterprise needs.

Ask these four questions to narrow the backend decision:

Four Deciding Questions
1
"Where is my infrastructure located?" — If 95% of resources are in AWS, S3 backend is the path of least resistance. Mixing clouds for state management adds unnecessary complexity.
2
"Who is running the commands?" — Just you? Cloud-native is fine. A team of 10+? You want audit logs and UI visibility from a Managed Platform.
3
"How sensitive is my data?" — State files contain every detail of your network. If your security team has a "No-SaaS" policy for sensitive metadata, stick to Cloud-Native storage inside your own VPC.
4
"What is my budget for non-revenue tools?" — HCP Terraform's resource-based pricing can become expensive for high-volume, low-value resources. Know your resource count before committing.
Does the security policy allow infrastructure state in the public cloud?
✗ NO — No-SaaS policy
→ Terraform Enterprise (TFE)Air-gapped or internal VPC deployment only. State never leaves the customer's environment.
✓ YES — SaaS allowed
→ HCP Terraform or Cloud-NativeEvaluate team size, budget, and feature needs. Proceed to the four questions above.

TFE requires three main external services to remain stateless and resilient. These must be provisioned and managed by the customer's platform team before TFE installation.

LayerAWS ServiceConfiguration Notes
ComputeEC2 Instance or EKS (Kubernetes)Min 4 vCPU / 8GB RAM. Production: 8–16 vCPU / 32GB+ RAM. Runtime: Docker Engine or Kubernetes.
DatabaseRDS Multi-AZ (PostgreSQL v12–v16)Use Multi-AZ for HA. Stores user accounts, workspace settings, run history. Does NOT store state files.
Object StorageS3 BucketVersioned objects + SSE-KMS encryption. Stores all .tfstate files, plan files, run logs, and config code.
IdentityIAM Roles (Instance Profile)TFE app writes to S3 and talks with RDS without hardcoded Access Keys.
NetworkVPC + NLB or ALBVPC across at least two Private Subnets. TFE requires HTTPS — terminate SSL at the Load Balancer (recommended).
SecretsVault (recommended) or Secrets Manager + KMSService credentials, TFE license, encryption key (enc-password), TLS certs.
Redis (Active-Active)ElastiCache for RedisRequired for multi-node Active/Active. Coordinates the Run Queue between nodes.
LayerAzure ServiceConfiguration Notes
ComputeAzure VM or AKS (Kubernetes)Min 4 vCPU / 8GB RAM. Create a VNet with a dedicated subnet for AKS or VMSS.
DatabaseAzure Database for PostgreSQL Flexible ServerFlexible Server preferred for TFE's performance needs. PostgreSQL v12–v16.
Object StorageAzure Blob StorageUse a Storage Account with a private endpoint so traffic never hits the public internet.
IdentityUser-Assigned Managed IdentityAssign to VM/Pod to handle authentication to Storage Account — no hardcoded keys.
NetworkVNet + AKS subnet or VMSSHTTPS required. SSL termination at Load Balancer recommended.
SecretsVault (recommended) or Azure Key VaultService credentials, TFE license, encryption key, TLS certs.
Redis (Active-Active)Azure Cache for RedisRequired for multi-node Active/Active. Coordinates the Run Queue between nodes.
LayerGCP ServiceConfiguration Notes
ComputeCompute Engine or GKE (Kubernetes)Min 4 vCPU / 8GB RAM. Create a VPC with Private Service Connect to reach Cloud SQL.
DatabaseCloud SQL for PostgreSQLUse a Private IP address. PostgreSQL v12–v16.
Object StorageCloud Storage (GCS) BucketEnable "Uniform" bucket-level access. Stores all state files, plan files, run logs.
IdentityGoogle Service AccountAssign roles/storage.objectAdmin and roles/cloudsql.client permissions.
NetworkVPC + Private Service ConnectHTTPS required. SSL termination at Load Balancer recommended.
SecretsVault (recommended) or Secret Manager + KMSService credentials, TFE license, encryption key, TLS certs.
Redis (Active-Active)Memorystore for RedisRequired for multi-node Active/Active. Coordinates the Run Queue between nodes.
SECTION 02 · CODE ARCHITECTURE
Environment Isolation: Folders vs. Workspaces
How do you plan to separate Dev, Staging, and Prod? This decision affects security, visibility, and operational risk.
✅ Pros
  • Physical separation of code per environment
  • Easiest to use different module versions per environment — controlled updates
  • Transparent visibility — browse the repo to see all environments
  • Distributed across different backend paths — strong isolation
⚠️ Cons
  • Higher code redundancy — shared files copied across folders
  • High environment variability if dev genuinely differs from prod (often a good thing)
Security Advantage

With directory isolation, you can assign different IAM roles or service accounts per environment directory. The CI/CD pipeline for the prod/ folder can be restricted to only allow the "Production" service account. This is a hard security boundary — workspaces cannot do this.

⚠️ Security Boundary Warning

Workspaces are NOT a security boundary. All workspaces in a directory share the same backend configuration — anyone with access to run Terraform in that folder can access the state of any workspace. If you need strict IAM permissions separating Dev and Prod, directory-based isolation is required.

✅ Pros
  • Low code redundancy — same code for all environments
  • No environment variability by design
  • Centralized on one backend path
⚠️ Cons
  • High risk of accidental "prod" changes if you forget to switch workspaces
  • Hidden visibility — can't see which environments exist without running a command
  • Shared credentials across all environments

Valid Workspace Use Cases

Workspace-based environments are better when you need to deploy identical logic multiple times:

🔁 Ephemeral Preview Environments
  • Preview environment for each Pull Request
  • Deploy: terraform workspace new pr-123
  • Delete: terraform workspace delete pr-123 when PR is merged
🌎 Multi-Region Cookie-Cutter
  • SaaS provider deploying identical app infra per region
  • Set region in provider using var.region_map[terraform.workspace]
  • Same logic, different regional execution context
🏢 Multi-Tenant Managed Services
  • Customer is the unit of isolation
  • Prevents drift between customer environments
  • Each workspace = one tenant
🔵🟢 Blue/Green Infra Swapping
  • New version of entire environment alongside existing for cutover
  • Swap traffic, validate, then destroy old workspace
  • Requires careful state management during transition
SECTION 03 · CODE ARCHITECTURE
Modular Strategy: How Much Abstraction?
Don't write flat code. Establish a three-tier modular hierarchy: Bricks → Walls → Buildings. Each tier has a specific purpose, ownership model, and change cadence.
Tier 1
Resource Modules— "The Bricks"

Goal: Enforce company standards and compliance. Create thin wrappers around single resources or closely related resources that encode the "Golden Resource" — e.g., "Every S3 bucket must have encryption."

Rule: Never create one-to-one wrappers that expose every single provider argument. If you can't find a reason to change a default, don't expose it as a variable.

Owned by: Security / Platform Team — very low change frequency, global blast radius.

Tier 1 Examples

☁️ Golden Storage — AWS S3 Bucket
  • AES-256 Encryption enforced
  • Block Public Access enforced
  • 90-day versioning lifecycle
☁️ Golden Storage — Azure Storage Account
  • Enforces TLS 1.2
  • Disabled shared key access
  • Requires Private Endpoint connectivity
🔐 Standard Identity — AWS IAM Role
  • Automatically attaches a Boundary policy
  • Adds standard owner tags
🔐 Standard Identity — Azure User-Assigned Identity
  • Configures Federated Identity Credential for OIDC by default
Tier 2
Infrastructure Modules— "The Walls"

Goal: Provide a "best practice" implementation of a common architecture pattern. Collections of Tier 1 modules that form a complete service.

Rule: These should be opinionated — they define how your company builds a web server or a database cluster. Not every argument should be exposed.

Owned by: Platform / SRE Team — low change frequency, medium (service-level) blast radius.

Tier 2 Examples

🌐 AWS Enterprise Connected VPC
  • VPC with public/private subnets
  • NAT Gateways + Route Tables with Network ACLs
  • Connected to Corporate Hub with Transit Gateway / VPC Peering
  • IPAM integration
🗄️ AWS Secure Database
  • RDS Instance + Subnet Group
  • Security Group Rule
  • Randomized credentials stored in Secrets Manager
Tier 3
Application Modules— "The Buildings"

Goal: These are the "root modules" that developers call. They combine Infrastructure Modules to deploy a full environment. Abstract away all complex logic — a developer should only need to provide app_name and environment.

Owned by: App Developers — high change frequency (daily/weekly), low (app-level) blast radius.

Tier 3 Example: E-Commerce Checkout Service

# Application Module: e-commerce checkout # Developer only provides app_name and environment module "network" { source = "registry.terraform.io/acme/enterprise-vpc/aws" version = "~> 2.0" } module "web_stack" { source = "registry.terraform.io/acme/web-stack/aws" version = "~> 1.4" vpc_id = module.network.vpc_id app_name = var.app_name } module "database" { source = "registry.terraform.io/acme/secure-database/aws" version = "~> 1.2" subnet_ids = module.network.private_subnet_ids }

Tier Comparison

FeatureTier 1: BricksTier 2: WallsTier 3: Buildings
Owned bySecurity / PlatformPlatform / SREApp Developers
Change FrequencyVery LowLowHigh (daily/weekly)
Blast RadiusHuge (Global Impact)Medium (Service Impact)Low (App Impact)
Versioning GoalStrict SemVerFeature-based releasesEnvironment-based tags

Using Git tags (e.g., v1.2.0) is essential so a change to a module doesn't instantly break every project consuming it. If a user upgrades their module, they should know exactly what the risk is by looking at the version number.

Increment
When to Use
Example
Major (1.0.0)
Breaking Changes: removing a variable, renaming an output, changing a resource in a way that forces delete-and-recreate
v1.2.3 → v2.0.0
Minor (0.1.0)
New Features: adding an optional variable, a new output, or an additional non-disruptive resource
v1.2.3 → v1.3.0
Patch (0.0.1)
Bug Fixes: documentation updates, fixing a typo in a tag, updating a provider version constraint
v1.2.3 → v1.2.4

Conventional Commits Workflow

Require developers to use a standard Git commit message format. Tools: commitlint (enforce format), Commitizen (generate messages), Semantic-Release (automate version bumps), Husky (git hooks to block non-compliant commits).

commitizen syntax <type>(<optional scope>): <description> # Examples that automate version bumps: feat: add secondary disk to VM → minor bump (0.1.0) fix: correct dns record typo → patch bump (0.0.1) feat!: remove deprecated legacy LB → major bump (1.0.0) # Valid types: feat, fix, docs, style, refactor, test, chore, ci

Required Module Guardrail — versions.tf

Every module must have a versions.tf file that restricts the Terraform and Provider versions it supports:

versions.tf terraform { required_version = ">= 1.5.0" # Prevents users on old, buggy TF versions required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" # Prevents accidental breaking provider upgrades } } }

Automation Tools for Consistency

📚 Automated Documentation
  • Use terraform-docs via pre-commit hook or GitHub Action
  • Every time a variable is added or changed, the README.md updates automatically
  • Prevents "documentation drift" where README and code disagree
🤖 Dependency Management
  • Use Renovate Bot to scan Terraform code for newer module versions
  • Automatically opens a PR to upgrade — combine with terraform test or Terratest
  • "Patch" and "minor" upgrades can be auto-merged with high confidence when tests pass
Source MechanismRecommendationWhy
Local Path
./modules/x
Avoid for shared code Hard to version — changes affect everyone instantly. Only appropriate for tightly-coupled, single-repo code.
Git Tag
ref=v1.2.0
Good for Startups Simple to set up; immutable once tagged. No registry infrastructure required.
Private Registry
HCP / TFE
Best for Enterprise Native versioning UI, security scanning, "official" badges. Single source of truth for all teams.
SECTION 04 · STANDARDIZATION
Standardization & Naming
Decide on naming conventions and style early. Inconsistency in naming is a governance failure — it makes resources unsearchable and ownership ambiguous.

HCL identifiers are the names of resources inside your code — not what appears in the AWS/Azure console.

Use snake_case (underscores), never kebab-case (dashes) — In HCL, underscores allow you to double-click a name to select the whole string. Dashes do not.
Use singular nouns and avoid repeating the resource type — Anti-pattern: resource "aws_s3_bucket" "s3_bucket_logs" {} → Better: resource "aws_s3_bucket" "logs" {}
this/main pattern — If a module only creates one primary resource, name it this or main. This makes it predictable for anyone reading your code.
# ❌ Anti-pattern: redundant type in name, kebab-case resource "aws_s3_bucket" "s3-bucket-logs" {} # ✅ Correct: singular noun, snake_case, no type redundancy resource "aws_s3_bucket" "logs" {} # ✅ this/main pattern for single-resource modules resource "aws_iam_role" "this" {}

Physical resource names are what appears in the AWS/Azure/GCP console. Use kebab-case for physical names (opposite of HCL identifiers).

Recommended Segmented Pattern

[Org]-[Project]-[Env]-[Region]-[Resource]-[Suffix]

This ensures that even if someone sees a resource in the cloud console without any context, they know exactly what it is and who owns it.

# Example: acme-apollo-prod-us-east-1-rds-primary locals { resource_prefix = "${var.org}-${var.project}-${var.env}-${var.region}" } resource "aws_db_instance" "primary" { identifier = "${local.resource_prefix}-rds-primary" }
Shared Modules (Bricks & Walls)
  • Convention: terraform-<provider>-<name>
  • Examples: terraform-aws-vpc, terraform-azure-aks
  • Follows Standard Module Structure so Terraform registries can parse automatically
Application Code (Buildings)
  • Convention: infra-<project>-<business-unit>
  • Polyrepo for shared modules (independent versioning)
  • Monorepo or single repo for application environment configs (Dev/Staging/Prod side-by-side)

Standard Module Directory Structure

terraform-aws-s3 ├── main.tf # Core logic ├── variables.tf # Inputs ├── outputs.tf # Outputs ├── versions.tf # Provider / TF version constraints ├── README.md # Auto-generated by terraform-docs └── examples/ # Sub-folders with usage examples

Provider Version Pinning — Lock at the Edge

Pessimistic Operator (~>)

~> 1.2.0 — allows 1.2.1, 1.2.9, but blocks 1.3.0. Use this for providers.

~> 1.2 — allows 1.3.0, 1.9.0, but blocks 2.0.0. Use this for reusable modules.

If you pin a library too tightly, you create dependency hell — Module A requires v1.1 and Module B requires v1.2, you can't have both in the same project. Ranges allow negotiation.

LocationPinning StrategyWhy
Root Module
Code you actually run apply on
Pin exactly: v1.2.5 or ~> v1.2.5100% reproducibility — rebuild exactly as it was
Reusable Module
Shared library / "brick"
Use ranges: >= 1.2 or ~> 1.0Avoids dependency conflicts between consumers

Decide on a Global Tag policy early. Every resource must carry these tags. Create a "Global Tags" variable that is merged to every resource automatically.

💰 Financial Pillar
  • CostCenter: Internal budget code or Department ID (e.g., DEPT-402)
  • BusinessUnit: High-level org unit (Marketing, Engineering)
  • Project: Billing code or specific initiative name
👤 Ownership Pillar
  • Owner: Team, email alias, or Slack channel — never personal names (people leave)
  • TechnicalContact: Primary engineer responsible for the service
  • Service/Application: Logical name of the application the resource belongs to
⚙️ Technical Pillar (DevOps/SRE)
  • Environment: Standardized values: prod, staging, dev, sandbox
  • ManagedBy: Always set to "Terraform"
  • ProvisionedBy: Specific Git repository or CI/CD pipeline that built the resource
🔒 Security Pillar
  • DataClassification: public, internal, confidential, pii
  • Criticality: low, medium, high, mission-critical (for incident response prioritization)
  • Compliance: PCI, HIPAA, SOC2 if resource is subject to regulatory scope

Implementation — AWS and Azure/GCP

AWS — provider default_tags provider "aws" { default_tags { tags = { ManagedBy = "Terraform" Environment = var.environment Owner = "platform-eng@acme.com" Project = "Apollo-Scale" } } }
Azure / GCP — locals merge locals { mandatory_tags = { CostCenter = "FIN-99" Environment = "prod" ManagedBy = "Terraform" } } resource "azurerm_resource_group" "app" { name = "app-resources" location = "West US" # Merges common tags with resource-specific ones tags = merge(local.mandatory_tags, { Name = "AppGateway" }) }
Best Practices

Case Consistency: Pick your case requirements (lowercase, PascalCase) and enforce them with TFLint rules.

Normalization: Use a restricted list of valid values for tags like Environment — never allow freeform input.

No Sensitive Data in Tags: Never put IP addresses, passwords, or phone numbers in tags — they appear in cloud console search and billing exports.

Tagging policy requires enforcement at multiple layers. A policy that can be bypassed is not a policy.

Layer 1 — Pre-commit (Shift Left)

Catch violations before the code is even pushed.

  • tflint — fails the build if mandatory tags are missing from code
  • checkov — security scanning that catches missing required tags
  • pre-commit hooks — run terraform fmt and tflint before a developer can push
Layer 2 — Plan / Apply (Policy Gate)

Block non-compliant applies at the execution layer.

  • HCP Terraform Sentinel — block any terraform apply that doesn't meet tagging criteria
  • Open Policy Agent (OPA) — alternative policy-as-code for non-HCP environments
Layer 3 — Cloud Native (Hard Stop)

Last resort — prevents resource creation at the cloud provider level.

  • AWS Tag Policies — SCP-level enforcement at the Org level
  • Azure Policy — physically prevents a resource from being created without required tags
Enforcement Tools for Consistency

TFLint: Configurable with custom rules to fail a build if a resource name uses dashes instead of underscores, or if a variable is missing a description.

terraform-docs: Automatically generates README.md from variables and outputs. If a developer changes a variable name, documentation updates itself — eliminates documentation drift.

SECTION 05 · SECURITY & GOVERNANCE
Security & Secrets Management
Terraform state files contain secrets in plain text. Never hardcode passwords. The question is which secrets management approach matches the type and scope of each secret.
Is the secret a Cloud Provider Credential (AWS Access Key, Azure Client Secret, GCP Service Account Key)?
✓ YES
→ Dynamic Provider Credentials (OIDC) Never store cloud provider credentials in a secrets manager or variable. Configure OIDC Workload Identity — Terraform generates a short-lived token per run. No static keys to steal.
✗ NO — Continue
→ Next question ↓
Does the secret need to be shared across multiple non-cloud platforms (e.g., Datadog, Snowflake, AWS simultaneously)?
✓ YES — Multi-platform
→ HashiCorp Vault Single source of truth for multi-cloud and hybrid environments. Supports dynamic secrets, lease management, and cross-platform access policies.
✗ NO — Continue
→ Next question ↓
Is the secret only needed during the terraform apply phase (not stored long-term)?
✓ YES — Ephemeral
→ Terraform Ephemeral Resources (TF 1.10+) Temporary secrets (e.g., temporary tokens) that are used only during apply and never written to state. The 2026 gold standard — requires TF 1.10+.
✗ NO — Persistent, single-cloud
→ Cloud Native Secret Manager Use when the secret is related to a resource tied to that specific provider or account — AWS Secrets Manager, Azure Key Vault, GCP Secret Manager.
Historical Problem — Secrets in State

If you use a data source to fetch a password from Vault, that password gets written to terraform.tfstate and persists there permanently. Ephemeral Resources (TF 1.10+) solve this — the value is fetched at apply time and never written to state.

Core Principle

Only your CI/CD runner should have "Owner" / apply permissions. Developers should have "Read-Only" or "Plan-only" access. Humans applying directly from laptops in production is a control failure, not a workflow.

RoleAccess LevelRationale
CI/CD Runner (Service Account)Owner / ApplyThe only entity that should run terraform apply in prod
Lead Engineer / SREPlan + ApproveCan review plans and approve runs, cannot directly apply
DeveloperPlan-onlyCan see what will change, cannot trigger applies in production
Security AuditorRead-onlyCan view state and run history for compliance purposes
Non-Negotiable

State files are plain-text JSON. They contain passwords, IP addresses, access keys, and the complete map of your infrastructure. They must never be committed to version control — ever.

.gitignore is configured*.tfstate, *.tfstate.backup, .terraform/*, *.tfvars all excluded from Git
Encryption at rest — KMS/CMK encryption enabled on the state backend (S3, Azure Blob, GCS). HCP Terraform encrypts by default.
Bucket/blob versioning enabled — required for state recovery from corruption or accidental deletion
Access logging enabled on the state backend — who accessed the state file and when is an audit requirement
RBAC on state access — only the apply service account can read raw state; developers get plan output only
SECTION 06 · PLATFORM OPERATIONS
TFE Deployment
Terraform Enterprise is a complex, customer-managed application. As of 2026, the standard deployment method is FDO (Flexible Deployment Options) using containers — not the legacy Replicated installer.
Does your security policy allow infrastructure state in the public cloud (SaaS)?
✗ NO — No-SaaS / Air-gapped
→ Terraform Enterprise (Self-Hosted)TFE can be installed in a fully disconnected environment. Proceed to External Services architecture.
✓ YES — SaaS allowed
→ HCP TerraformContinue evaluating private runner needs below.
Do you need private networking for your runners (execute Terraform inside your private network)?
✓ YES — Private runners
→ HCP Terraform + Private Agents Pull-based architecture — no inbound access needed. Agent polls HCP for work. Requires HCP Terraform Business Tier.
✗ NO — Cloud runners acceptable
→ Standard HCP TerraformRuns execute on HashiCorp's infrastructure. No agent setup required.

Private Agent Setup

docker — pull-based, no inbound needed docker run \ -e TFC_AGENT_TOKEN="your_token_here" \ -e TFC_AGENT_NAME="private-network-agent-01" \ hashicorp/tfc-agent:latest # Agent polls HCP Terraform for work — no inbound firewall rules required # Requires HCP Terraform Business Tier for self-hosted agents
Do you have the staff to be Platform Operators for TFE?
✓ YES — Dedicated platform team
→ TFE is viableTeam must be capable of managing K8s/container runtime, PostgreSQL, object storage, Redis, and OS patching. See burden table below.
✗ NO — No platform admins
→ HCP Terraform strongly preferredTFE is a complex application. Without dedicated operators, it becomes a liability.

Managed vs Self-Managed Operational Burden

TaskTFE (Self-Managed)HCP Terraform
App UpdatesManual (Monthly/Quarterly) — track release notes, plan upgrades, monitor tfe-migrations logsAutomatic — zero effort
DB BackupsHigh — manage PostgreSQL backups, performance tuning, version upgradesNone
ScalingManual — K8s node scaling or Auto-scaling Group managementNone
SecurityFull — OS patching, network perimeter, encryption key management, TLS cert rotationMinimal — identity/RBAC only
Redis ManagementRequired for Active/Active — customer manages Redis clusterNone
Custom Worker ImagesRequired if devs need special tools (jq, Python, AWS CLI) — must build, maintain, and secure imagesNot needed
✅ Multi-AZ — Standard HA (Recommended Start)

Multi-Availability Zone setup in a single region.

  • Provides 99.9% availability with 10% of the complexity of multi-region
  • RDS Multi-AZ, storage replication within region
  • This should be the baseline for all production TFE deployments
⚡ Multi-Region — Active/Passive DR
  • Primary Region: TFE running and handling all traffic
  • Secondary Region: Infra defined but "scaled to zero"
  • Data Sync: PostgreSQL DB and storage buckets continuously replicating to secondary
  • Failover: Platform team scales up secondary nodes and updates DNS

When Is Multi-Region Worth Implementing?

Extreme RTO/RPO Requirements: Your company loses millions of dollars if Terraform is down for more than 15 minutes
Regulatory Compliance: Banking or Life Sciences sector where "Regional Resilience" is a legal requirement
Massive Scale: 5000+ developers globally needing TFE closer to them — though HCP Agents in those regions is a much simpler solution for latency
2026 FDO Deployment Notes

As of 2026, the standard deployment method is Flexible Deployment Options (FDO) using containers — Docker, Kubernetes, OpenShift, Nomad, or Podman. The legacy Replicated installer is being phased out. TFE supports AMD and ARM architectures as of v1.0.0, and IPv4, IPv6, and mixed IP environments.

End of Life — Action Required

Terraform Enterprise on the Replicated platform will no longer be supported after April 1, 2026. Any customer still on Replicated must migrate to FDO (container-native deployment) immediately.

The Replicated platform was a containerized installation that used Replicated to manage TFE's lifecycle — Replicated Daemon, Replicated UI, and containerized TFE components (ptfe_atlas, ptfe_vault, ptfe_postgres, ptfe_nginx). This architecture is replaced by FDO.

Operational Modes (Legacy Reference)

ModePostgreSQLObject StorageRedis
external External — customer-managed External — customer-managed Docker volume on instance
active-active External — customer-managed External — customer-managed External — customer-managed
disk Internal directory on instance Internal directory on instance Docker volume on instance
SECTION 07 · SECURITY & GOVERNANCE
Governance & RBAC
RBAC is about aligning your security model with your organizational structure. Don't manage permissions at the individual workspace level — it becomes an administrative nightmare at scale.
🏢 Organization Level

Reserve for Platform Team — manages global settings, policies, providers, and org-level variables.

  • Keep this team small (2–3 people)
  • Most admins should only have Manage Workspaces — not full org controls
📁 Project Level

Group related workspaces by Business Unit or environment. Permissions at the project level cascade to all workspaces.

  • Lead Engineers manage their project domain
  • Teams create their own workspaces without central admin approval every time
⚙️ Workspace Level

Use only for exceptions or highly sensitive standalone resources that don't fit a project grouping.

  • Avoid managing the majority of permissions here — it doesn't scale
  • Individual contributors get Read / Plan / Write as appropriate

Map your existing team structure into these standardized personas. Then assign TFE permission levels to match — not to individuals, but to IdP Groups (Okta/AD) mapped to TFE Teams.

PersonaTFE Permission LevelCapabilities
Platform AdminOrg AdminManage teams, SSO, and global module registries. Keep this group very small.
Lead EngineerProject AdminCreate/delete workspaces within a specific project; manage team access for their domain.
DeveloperWriteTrigger runs, update variables, see plans and applies. Cannot manage workspace settings.
Security AuditorRead-OnlyView state files and run history for compliance auditing. Cannot change anything.
Team Mapping Best Practice

Always map Identity Provider (IdP) Groups (Okta/Active Directory) to HCP Terraform Teams — never assign permissions to individual users. When an employee leaves, their TFE access terminates automatically with their IdP account. No manual deprovisioning required.

State files and variables are the most sensitive assets in TFE. By default, "write" access allows a user to see the state file. For high-security environments, use Custom Workspace Permissions.

👁️‍🗨️ The Blind Apply Strategy
  • Grant a team the ability to Apply changes without being able to read the state file or sensitive variable values
  • Set state-versions to none or read-outputs — prevents downloading raw state JSON (which may contain passwords)
  • Set variables to none — users trigger runs that use variables, but cannot see the sensitive values in the UI
🔗 Remote State Sharing
  • By default, workspaces are isolated
  • If Workspace B needs an output from Workspace A, you must explicitly enable Remote State Sharing
  • Never use "Share with all workspaces" — explicitly list the workspaces allowed to read outputs
  • Consider using HCP Terraform Outputs or targeted data sources instead of sharing full state
Mandatory Sentinel/OPA Policies: Implement a policy that prevents any workspace from being created without being assigned to a Project. Workspaces in the Default Project should be blocked by policy — every workspace must have explicit ownership.
Audit Log Streaming: Configure TFE to stream all audit logs to a centralized SIEM (Splunk, Datadog). Local logs are a compliance risk — they can be rotated or deleted. What to track: sensitive variable access, policy overrides, workspace creation, team membership changes.
SSO Enforcement with MFA: Ensure TFE is backed by the corporate Identity Provider with MFA enabled. When an employee leaves, their TFE access should terminate automatically — not through a manual deprovisioning ticket.
Workspace Naming Convention Policy: Use Sentinel to enforce workspace naming standards — prevent ad-hoc names that obscure ownership and environment scope.
Air-Gapped Encryption Note

In air-gapped TFE deployments, the TFE encryption password will likely be wrapped in a Hardware Security Module (HSM) or a cloud KMS. If TFE loses its unseal key, all data becomes unreadable. Document the key recovery procedure before you need it.