Terraform Implementation Guide

SECTION 01 · STATE & BACKEND

State Management

Where is the source of truth stored? Terraform tracks infrastructure state in a terraform.tfstate file. If you lose this, Terraform forgets what it built.

Hard Limit

Local state is a singleton — only one person can own it. No locking, no collaboration, no recovery if the machine dies. Never advise this for team or production use.

.gitignore — Required Setup

State files contain a map of your infrastructure and sensitive data in plain text. Never commit to version control.

. gitignore
# Local .terraform directory (plugins/modules)
.terraform/*

# State files and backups — NEVER commit these
*.tfstate
*.tfstate.backup

# Variables that might contain secrets
*.tfvars
*.tfvars.json

Four Decisions to Address Before Using Local State

1. How will you back up state?

If your laptop dies, Terraform loses its memory entirely
Without the state file, you must manually import every resource back into a new state file
Decision: manual backup strategy or migrate to remote immediately

2. Is there sensitive data in your code?

State files are plain-text JSON — RDS passwords, API keys stored in clear text on disk
Does your local machine have full-disk encryption?
If dealing with RDS, API Keys, or sensitive metadata: use sops or migrate to remote backend with encryption

3. What is your team's future?

Will anyone else ever need to run this code?
If yes: you are building a bottleneck — state files emailed back and forth is a failure mode
Decision: migrate to remote backend before the team grows

4. How will you handle concurrency?

Local state has no robust state locking
Two simultaneous terraform apply runs can corrupt the state file
You must enforce: never run concurrent local operations

Migrating Local to Remote

Change your backend configuration to point to the new remote location

Run terraform init

When prompted "Do you want to copy your existing state to the new backend?" — confirm yes

Verify: local file is uploaded and managed remotely. Delete the local .tfstate file.

explicit local backend (optional)
terraform {
  backend "local" {
    path = "relative/path/to/terraform.tfstate"
  }
}

Option	Best For	Setup Speed	Cost	Visibility
Cloud-Native S3, Azure Blob, GCS	Small-to-medium teams already embedded in a cloud provider	Medium	Minimal — pennies for storage	CLI or Cloud Console only
Managed Platform HCP Terraform, Spacelift, Scalr	Teams needing audit logs, UI-based management, sophisticated access controls	Fast	Free tier → per-resource or per-seat	Rich Web UI with history & diffs
Self-Managed Platform Terraform Enterprise	Air-gapped or highly regulated; no-SaaS policy	Slow	License-based (per workspace)	Rich Web UI with history & diffs
HTTP / CI-Integrated GitLab Managed TF State	Orgs wanting to centralize everything inside their Git provider	Medium	Included in GitLab tier	GitLab UI

Cloud-Native Backend — Per Provider

S3 + DynamoDB (or Native S3 Locking)

State stored in S3 bucket
Locking: historically required DynamoDB table — as of late 2024/2025, Terraform introduced native S3 locking, making DynamoDB optional for newer versions
Enable bucket versioning — required for state recovery
Enable SSE-KMS encryption at rest
Block all public access on the bucket

Azure Blob Storage

State stored in Azure Storage Account Blob container
Locking handled natively via "Blob leases" — no extra service required
Enable blob versioning for state recovery
Use a private endpoint so traffic never hits the public internet
Enable Customer Managed Key (CMK) encryption at rest

Google Cloud Storage (GCS)

State stored in GCS bucket
Locking handled natively by the bucket itself — no extra service required
Enable object versioning for state recovery
Enable "Uniform" bucket-level access control
Enable CMEK encryption at rest

Managed Platform Notes

HCP Terraform: Handles state, locking, and remote runs. Plan/Apply happens on HashiCorp servers. Changed to resource-based pricing — if you have thousands of small resources, do the math vs an S3 bucket at $0.50/month.

Spacelift / Scalr: More advanced policy-as-code (OPA) and workflow orchestration than HCP. Better support than OpenTofu for enterprise needs.

Ask these four questions to narrow the backend decision:

Four Deciding Questions

1

"Where is my infrastructure located?" — If 95% of resources are in AWS, S3 backend is the path of least resistance. Mixing clouds for state management adds unnecessary complexity.

2

"Who is running the commands?" — Just you? Cloud-native is fine. A team of 10+? You want audit logs and UI visibility from a Managed Platform.

3

"How sensitive is my data?" — State files contain every detail of your network. If your security team has a "No-SaaS" policy for sensitive metadata, stick to Cloud-Native storage inside your own VPC.

4

"What is my budget for non-revenue tools?" — HCP Terraform's resource-based pricing can become expensive for high-volume, low-value resources. Know your resource count before committing.

Does the security policy allow infrastructure state in the public cloud?

✗ NO — No-SaaS policy

→ Terraform Enterprise (TFE)Air-gapped or internal VPC deployment only. State never leaves the customer's environment.

✓ YES — SaaS allowed

→ HCP Terraform or Cloud-NativeEvaluate team size, budget, and feature needs. Proceed to the four questions above.

TFE requires three main external services to remain stateless and resilient. These must be provisioned and managed by the customer's platform team before TFE installation.

Layer	AWS Service	Configuration Notes
Compute	EC2 Instance or EKS (Kubernetes)	Min 4 vCPU / 8GB RAM. Production: 8–16 vCPU / 32GB+ RAM. Runtime: Docker Engine or Kubernetes.
Database	RDS Multi-AZ (PostgreSQL v12–v16)	Use Multi-AZ for HA. Stores user accounts, workspace settings, run history. Does NOT store state files.
Object Storage	S3 Bucket	Versioned objects + SSE-KMS encryption. Stores all `.tfstate` files, plan files, run logs, and config code.
Identity	IAM Roles (Instance Profile)	TFE app writes to S3 and talks with RDS without hardcoded Access Keys.
Network	VPC + NLB or ALB	VPC across at least two Private Subnets. TFE requires HTTPS — terminate SSL at the Load Balancer (recommended).
Secrets	Vault (recommended) or Secrets Manager + KMS	Service credentials, TFE license, encryption key (enc-password), TLS certs.
Redis (Active-Active)	ElastiCache for Redis	Required for multi-node Active/Active. Coordinates the Run Queue between nodes.

Layer	Azure Service	Configuration Notes
Compute	Azure VM or AKS (Kubernetes)	Min 4 vCPU / 8GB RAM. Create a VNet with a dedicated subnet for AKS or VMSS.
Database	Azure Database for PostgreSQL Flexible Server	Flexible Server preferred for TFE's performance needs. PostgreSQL v12–v16.
Object Storage	Azure Blob Storage	Use a Storage Account with a private endpoint so traffic never hits the public internet.
Identity	User-Assigned Managed Identity	Assign to VM/Pod to handle authentication to Storage Account — no hardcoded keys.
Network	VNet + AKS subnet or VMSS	HTTPS required. SSL termination at Load Balancer recommended.
Secrets	Vault (recommended) or Azure Key Vault	Service credentials, TFE license, encryption key, TLS certs.
Redis (Active-Active)	Azure Cache for Redis	Required for multi-node Active/Active. Coordinates the Run Queue between nodes.

Layer	GCP Service	Configuration Notes
Compute	Compute Engine or GKE (Kubernetes)	Min 4 vCPU / 8GB RAM. Create a VPC with Private Service Connect to reach Cloud SQL.
Database	Cloud SQL for PostgreSQL	Use a Private IP address. PostgreSQL v12–v16.
Object Storage	Cloud Storage (GCS) Bucket	Enable "Uniform" bucket-level access. Stores all state files, plan files, run logs.
Identity	Google Service Account	Assign `roles/storage.objectAdmin` and `roles/cloudsql.client` permissions.
Network	VPC + Private Service Connect	HTTPS required. SSL termination at Load Balancer recommended.
Secrets	Vault (recommended) or Secret Manager + KMS	Service credentials, TFE license, encryption key, TLS certs.
Redis (Active-Active)	Memorystore for Redis	Required for multi-node Active/Active. Coordinates the Run Queue between nodes.

SECTION 02 · CODE ARCHITECTURE

Environment Isolation: Folders vs. Workspaces

How do you plan to separate Dev, Staging, and Prod? This decision affects security, visibility, and operational risk.

✅ Pros

Physical separation of code per environment
Easiest to use different module versions per environment — controlled updates
Transparent visibility — browse the repo to see all environments
Distributed across different backend paths — strong isolation

⚠️ Cons

Higher code redundancy — shared files copied across folders
High environment variability if dev genuinely differs from prod (often a good thing)

Security Advantage

With directory isolation, you can assign different IAM roles or service accounts per environment directory. The CI/CD pipeline for the prod/ folder can be restricted to only allow the "Production" service account. This is a hard security boundary — workspaces cannot do this.

⚠️ Security Boundary Warning

Workspaces are NOT a security boundary. All workspaces in a directory share the same backend configuration — anyone with access to run Terraform in that folder can access the state of any workspace. If you need strict IAM permissions separating Dev and Prod, directory-based isolation is required.

✅ Pros

Low code redundancy — same code for all environments
No environment variability by design
Centralized on one backend path

⚠️ Cons

High risk of accidental "prod" changes if you forget to switch workspaces
Hidden visibility — can't see which environments exist without running a command
Shared credentials across all environments

Valid Workspace Use Cases

Workspace-based environments are better when you need to deploy identical logic multiple times:

🔁 Ephemeral Preview Environments

Preview environment for each Pull Request
Deploy: terraform workspace new pr-123
Delete: terraform workspace delete pr-123 when PR is merged

🌎 Multi-Region Cookie-Cutter

SaaS provider deploying identical app infra per region
Set region in provider using var.region_map[terraform.workspace]
Same logic, different regional execution context

🏢 Multi-Tenant Managed Services

Customer is the unit of isolation
Prevents drift between customer environments
Each workspace = one tenant

🔵🟢 Blue/Green Infra Swapping

New version of entire environment alongside existing for cutover
Swap traffic, validate, then destroy old workspace
Requires careful state management during transition

SECTION 03 · CODE ARCHITECTURE

Modular Strategy: How Much Abstraction?

Don't write flat code. Establish a three-tier modular hierarchy: Bricks → Walls → Buildings. Each tier has a specific purpose, ownership model, and change cadence.

Tier 1

Resource Modules— "The Bricks"

Goal: Enforce company standards and compliance. Create thin wrappers around single resources or closely related resources that encode the "Golden Resource" — e.g., "Every S3 bucket must have encryption."

Rule: Never create one-to-one wrappers that expose every single provider argument. If you can't find a reason to change a default, don't expose it as a variable.

Owned by: Security / Platform Team — very low change frequency, global blast radius.

Tier 1 Examples

☁️ Golden Storage — AWS S3 Bucket

AES-256 Encryption enforced
Block Public Access enforced
90-day versioning lifecycle

☁️ Golden Storage — Azure Storage Account

Enforces TLS 1.2
Disabled shared key access
Requires Private Endpoint connectivity

🔐 Standard Identity — AWS IAM Role

Automatically attaches a Boundary policy
Adds standard owner tags

🔐 Standard Identity — Azure User-Assigned Identity

Configures Federated Identity Credential for OIDC by default

Tier 2

Infrastructure Modules— "The Walls"

Goal: Provide a "best practice" implementation of a common architecture pattern. Collections of Tier 1 modules that form a complete service.

Rule: These should be opinionated — they define how your company builds a web server or a database cluster. Not every argument should be exposed.

Owned by: Platform / SRE Team — low change frequency, medium (service-level) blast radius.

Tier 2 Examples

🌐 AWS Enterprise Connected VPC

VPC with public/private subnets
NAT Gateways + Route Tables with Network ACLs
Connected to Corporate Hub with Transit Gateway / VPC Peering
IPAM integration

🗄️ AWS Secure Database

RDS Instance + Subnet Group
Security Group Rule
Randomized credentials stored in Secrets Manager

Tier 3

Application Modules— "The Buildings"

Goal: These are the "root modules" that developers call. They combine Infrastructure Modules to deploy a full environment. Abstract away all complex logic — a developer should only need to provide app_name and environment.

Owned by: App Developers — high change frequency (daily/weekly), low (app-level) blast radius.

Tier 3 Example: E-Commerce Checkout Service

# Application Module: e-commerce checkout
# Developer only provides app_name and environment

module "network" {
  source = "registry.terraform.io/acme/enterprise-vpc/aws"
  version = "~> 2.0"
}

module "web_stack" {
  source    = "registry.terraform.io/acme/web-stack/aws"
  version   = "~> 1.4"
  vpc_id    = module.network.vpc_id
  app_name  = var.app_name
}

module "database" {
  source      = "registry.terraform.io/acme/secure-database/aws"
  version     = "~> 1.2"
  subnet_ids  = module.network.private_subnet_ids
}

Tier Comparison

Feature	Tier 1: Bricks	Tier 2: Walls	Tier 3: Buildings
Owned by	Security / Platform	Platform / SRE	App Developers
Change Frequency	Very Low	Low	High (daily/weekly)
Blast Radius	Huge (Global Impact)	Medium (Service Impact)	Low (App Impact)
Versioning Goal	Strict SemVer	Feature-based releases	Environment-based tags

Using Git tags (e.g., v1.2.0) is essential so a change to a module doesn't instantly break every project consuming it. If a user upgrades their module, they should know exactly what the risk is by looking at the version number.

Major (1.0.0)

Breaking Changes: removing a variable, renaming an output, changing a resource in a way that forces delete-and-recreate

v1.2.3 → v2.0.0

Minor (0.1.0)

New Features: adding an optional variable, a new output, or an additional non-disruptive resource

v1.2.3 → v1.3.0

Patch (0.0.1)

Bug Fixes: documentation updates, fixing a typo in a tag, updating a provider version constraint

v1.2.3 → v1.2.4

Conventional Commits Workflow

Require developers to use a standard Git commit message format. Tools: commitlint (enforce format), Commitizen (generate messages), Semantic-Release (automate version bumps), Husky (git hooks to block non-compliant commits).

commitizen syntax
<type>(<optional scope>): <description>

# Examples that automate version bumps:
feat: add secondary disk to VM      → minor bump (0.1.0)
fix: correct dns record typo        → patch bump (0.0.1)
feat!: remove deprecated legacy LB  → major bump (1.0.0)

# Valid types: feat, fix, docs, style, refactor, test, chore, ci

Required Module Guardrail — versions.tf

Every module must have a versions.tf file that restricts the Terraform and Provider versions it supports:

versions.tf
terraform {
  required_version = ">= 1.5.0"  # Prevents users on old, buggy TF versions

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # Prevents accidental breaking provider upgrades
    }
  }
}

Automation Tools for Consistency

📚 Automated Documentation

Use terraform-docs via pre-commit hook or GitHub Action
Every time a variable is added or changed, the README.md updates automatically
Prevents "documentation drift" where README and code disagree

🤖 Dependency Management

Use Renovate Bot to scan Terraform code for newer module versions
Automatically opens a PR to upgrade — combine with terraform test or Terratest
"Patch" and "minor" upgrades can be auto-merged with high confidence when tests pass

Source Mechanism	Recommendation	Why
Local Path `./modules/x`	Avoid for shared code	Hard to version — changes affect everyone instantly. Only appropriate for tightly-coupled, single-repo code.
Git Tag `ref=v1.2.0`	Good for Startups	Simple to set up; immutable once tagged. No registry infrastructure required.
Private Registry `HCP / TFE`	Best for Enterprise	Native versioning UI, security scanning, "official" badges. Single source of truth for all teams.

SECTION 04 · STANDARDIZATION

Standardization & Naming

Decide on naming conventions and style early. Inconsistency in naming is a governance failure — it makes resources unsearchable and ownership ambiguous.

HCL identifiers are the names of resources inside your code — not what appears in the AWS/Azure console.

Use snake_case (underscores), never kebab-case (dashes) — In HCL, underscores allow you to double-click a name to select the whole string. Dashes do not.

Use singular nouns and avoid repeating the resource type — Anti-pattern: resource "aws_s3_bucket" "s3_bucket_logs" {} → Better: resource "aws_s3_bucket" "logs" {}

this/main pattern — If a module only creates one primary resource, name it this or main. This makes it predictable for anyone reading your code.

# ❌ Anti-pattern: redundant type in name, kebab-case
resource "aws_s3_bucket" "s3-bucket-logs" {}

# ✅ Correct: singular noun, snake_case, no type redundancy
resource "aws_s3_bucket" "logs" {}

# ✅ this/main pattern for single-resource modules
resource "aws_iam_role" "this" {}

Physical resource names are what appears in the AWS/Azure/GCP console. Use kebab-case for physical names (opposite of HCL identifiers).

Recommended Segmented Pattern

[Org]-[Project]-[Env]-[Region]-[Resource]-[Suffix]

This ensures that even if someone sees a resource in the cloud console without any context, they know exactly what it is and who owns it.

# Example: acme-apollo-prod-us-east-1-rds-primary
locals {
  resource_prefix = "${var.org}-${var.project}-${var.env}-${var.region}"
}

resource "aws_db_instance" "primary" {
  identifier = "${local.resource_prefix}-rds-primary"
}

Shared Modules (Bricks & Walls)

Convention: terraform-<provider>-<name>
Examples: terraform-aws-vpc, terraform-azure-aks
Follows Standard Module Structure so Terraform registries can parse automatically

Application Code (Buildings)

Convention: infra-<project>-<business-unit>
Polyrepo for shared modules (independent versioning)
Monorepo or single repo for application environment configs (Dev/Staging/Prod side-by-side)

Standard Module Directory Structure

terraform-aws-s3
├── main.tf       # Core logic
├── variables.tf  # Inputs
├── outputs.tf    # Outputs
├── versions.tf   # Provider / TF version constraints
├── README.md     # Auto-generated by terraform-docs
└── examples/     # Sub-folders with usage examples

Provider Version Pinning — Lock at the Edge

Pessimistic Operator (~>)

~> 1.2.0 — allows 1.2.1, 1.2.9, but blocks 1.3.0. Use this for providers.

~> 1.2 — allows 1.3.0, 1.9.0, but blocks 2.0.0. Use this for reusable modules.

If you pin a library too tightly, you create dependency hell — Module A requires v1.1 and Module B requires v1.2, you can't have both in the same project. Ranges allow negotiation.

Location	Pinning Strategy	Why
Root Module Code you actually run apply on	Pin exactly: `v1.2.5` or `~> v1.2.5`	100% reproducibility — rebuild exactly as it was
Reusable Module Shared library / "brick"	Use ranges: `>= 1.2` or `~> 1.0`	Avoids dependency conflicts between consumers

Decide on a Global Tag policy early. Every resource must carry these tags. Create a "Global Tags" variable that is merged to every resource automatically.

💰 Financial Pillar

CostCenter: Internal budget code or Department ID (e.g., DEPT-402)
BusinessUnit: High-level org unit (Marketing, Engineering)
Project: Billing code or specific initiative name

👤 Ownership Pillar

Owner: Team, email alias, or Slack channel — never personal names (people leave)
TechnicalContact: Primary engineer responsible for the service
Service/Application: Logical name of the application the resource belongs to

⚙️ Technical Pillar (DevOps/SRE)

Environment: Standardized values: prod, staging, dev, sandbox
ManagedBy: Always set to "Terraform"
ProvisionedBy: Specific Git repository or CI/CD pipeline that built the resource

🔒 Security Pillar

DataClassification: public, internal, confidential, pii
Criticality: low, medium, high, mission-critical (for incident response prioritization)
Compliance: PCI, HIPAA, SOC2 if resource is subject to regulatory scope

Implementation — AWS and Azure/GCP

AWS — provider default_tags
provider "aws" {
  default_tags {
    tags = {
      ManagedBy   = "Terraform"
      Environment = var.environment
      Owner       = "platform-eng@acme.com"
      Project     = "Apollo-Scale"
    }
  }
}

Azure / GCP — locals merge
locals {
  mandatory_tags = {
    CostCenter  = "FIN-99"
    Environment = "prod"
    ManagedBy   = "Terraform"
  }
}

resource "azurerm_resource_group" "app" {
  name     = "app-resources"
  location = "West US"
  # Merges common tags with resource-specific ones
  tags = merge(local.mandatory_tags, { Name = "AppGateway" })
}

Best Practices

Case Consistency: Pick your case requirements (lowercase, PascalCase) and enforce them with TFLint rules.

Normalization: Use a restricted list of valid values for tags like Environment — never allow freeform input.

No Sensitive Data in Tags: Never put IP addresses, passwords, or phone numbers in tags — they appear in cloud console search and billing exports.

Tagging policy requires enforcement at multiple layers. A policy that can be bypassed is not a policy.

Layer 1 — Pre-commit (Shift Left)

Catch violations before the code is even pushed.

tflint — fails the build if mandatory tags are missing from code
checkov — security scanning that catches missing required tags
pre-commit hooks — run terraform fmt and tflint before a developer can push

Layer 2 — Plan / Apply (Policy Gate)

Block non-compliant applies at the execution layer.

HCP Terraform Sentinel — block any terraform apply that doesn't meet tagging criteria
Open Policy Agent (OPA) — alternative policy-as-code for non-HCP environments

Layer 3 — Cloud Native (Hard Stop)

Last resort — prevents resource creation at the cloud provider level.

AWS Tag Policies — SCP-level enforcement at the Org level
Azure Policy — physically prevents a resource from being created without required tags

Enforcement Tools for Consistency

TFLint: Configurable with custom rules to fail a build if a resource name uses dashes instead of underscores, or if a variable is missing a description.

terraform-docs: Automatically generates README.md from variables and outputs. If a developer changes a variable name, documentation updates itself — eliminates documentation drift.

SECTION 05 · SECURITY & GOVERNANCE

Security & Secrets Management

Terraform state files contain secrets in plain text. Never hardcode passwords. The question is which secrets management approach matches the type and scope of each secret.

Is the secret a Cloud Provider Credential (AWS Access Key, Azure Client Secret, GCP Service Account Key)?

✓ YES

→ Dynamic Provider Credentials (OIDC) Never store cloud provider credentials in a secrets manager or variable. Configure OIDC Workload Identity — Terraform generates a short-lived token per run. No static keys to steal.

✗ NO — Continue

→ Next question ↓

Does the secret need to be shared across multiple non-cloud platforms (e.g., Datadog, Snowflake, AWS simultaneously)?

✓ YES — Multi-platform

→ HashiCorp Vault Single source of truth for multi-cloud and hybrid environments. Supports dynamic secrets, lease management, and cross-platform access policies.

✗ NO — Continue

→ Next question ↓

Is the secret only needed during the terraform apply phase (not stored long-term)?

✓ YES — Ephemeral

→ Terraform Ephemeral Resources (TF 1.10+) Temporary secrets (e.g., temporary tokens) that are used only during apply and never written to state. The 2026 gold standard — requires TF 1.10+.

✗ NO — Persistent, single-cloud

→ Cloud Native Secret Manager Use when the secret is related to a resource tied to that specific provider or account — AWS Secrets Manager, Azure Key Vault, GCP Secret Manager.

Historical Problem — Secrets in State

If you use a data source to fetch a password from Vault, that password gets written to terraform.tfstate and persists there permanently. Ephemeral Resources (TF 1.10+) solve this — the value is fetched at apply time and never written to state.

Core Principle

Only your CI/CD runner should have "Owner" / apply permissions. Developers should have "Read-Only" or "Plan-only" access. Humans applying directly from laptops in production is a control failure, not a workflow.

Role	Access Level	Rationale
CI/CD Runner (Service Account)	Owner / Apply	The only entity that should run `terraform apply` in prod
Lead Engineer / SRE	Plan + Approve	Can review plans and approve runs, cannot directly apply
Developer	Plan-only	Can see what will change, cannot trigger applies in production
Security Auditor	Read-only	Can view state and run history for compliance purposes

Non-Negotiable

State files are plain-text JSON. They contain passwords, IP addresses, access keys, and the complete map of your infrastructure. They must never be committed to version control — ever.

.gitignore is configured — *.tfstate, *.tfstate.backup, .terraform/*, *.tfvars all excluded from Git

Encryption at rest — KMS/CMK encryption enabled on the state backend (S3, Azure Blob, GCS). HCP Terraform encrypts by default.

Bucket/blob versioning enabled — required for state recovery from corruption or accidental deletion

Access logging enabled on the state backend — who accessed the state file and when is an audit requirement

RBAC on state access — only the apply service account can read raw state; developers get plan output only

SECTION 06 · PLATFORM OPERATIONS

TFE Deployment

Terraform Enterprise is a complex, customer-managed application. As of 2026, the standard deployment method is FDO (Flexible Deployment Options) using containers — not the legacy Replicated installer.

Does your security policy allow infrastructure state in the public cloud (SaaS)?

✗ NO — No-SaaS / Air-gapped

→ Terraform Enterprise (Self-Hosted)TFE can be installed in a fully disconnected environment. Proceed to External Services architecture.

✓ YES — SaaS allowed

→ HCP TerraformContinue evaluating private runner needs below.

Do you need private networking for your runners (execute Terraform inside your private network)?

✓ YES — Private runners

→ HCP Terraform + Private Agents Pull-based architecture — no inbound access needed. Agent polls HCP for work. Requires HCP Terraform Business Tier.

✗ NO — Cloud runners acceptable

→ Standard HCP TerraformRuns execute on HashiCorp's infrastructure. No agent setup required.

Private Agent Setup

docker — pull-based, no inbound needed
docker run \
  -e TFC_AGENT_TOKEN="your_token_here" \
  -e TFC_AGENT_NAME="private-network-agent-01" \
  hashicorp/tfc-agent:latest

# Agent polls HCP Terraform for work — no inbound firewall rules required
# Requires HCP Terraform Business Tier for self-hosted agents

Do you have the staff to be Platform Operators for TFE?

✓ YES — Dedicated platform team

→ TFE is viableTeam must be capable of managing K8s/container runtime, PostgreSQL, object storage, Redis, and OS patching. See burden table below.

✗ NO — No platform admins

→ HCP Terraform strongly preferredTFE is a complex application. Without dedicated operators, it becomes a liability.

Managed vs Self-Managed Operational Burden

Task	TFE (Self-Managed)	HCP Terraform
App Updates	Manual (Monthly/Quarterly) — track release notes, plan upgrades, monitor `tfe-migrations` logs	Automatic — zero effort
DB Backups	High — manage PostgreSQL backups, performance tuning, version upgrades	None
Scaling	Manual — K8s node scaling or Auto-scaling Group management	None
Security	Full — OS patching, network perimeter, encryption key management, TLS cert rotation	Minimal — identity/RBAC only
Redis Management	Required for Active/Active — customer manages Redis cluster	None
Custom Worker Images	Required if devs need special tools (`jq`, Python, AWS CLI) — must build, maintain, and secure images	Not needed

✅ Multi-AZ — Standard HA (Recommended Start)

Multi-Availability Zone setup in a single region.

Provides 99.9% availability with 10% of the complexity of multi-region
RDS Multi-AZ, storage replication within region
This should be the baseline for all production TFE deployments

⚡ Multi-Region — Active/Passive DR

Primary Region: TFE running and handling all traffic
Secondary Region: Infra defined but "scaled to zero"
Data Sync: PostgreSQL DB and storage buckets continuously replicating to secondary
Failover: Platform team scales up secondary nodes and updates DNS

When Is Multi-Region Worth Implementing?

Extreme RTO/RPO Requirements: Your company loses millions of dollars if Terraform is down for more than 15 minutes

Regulatory Compliance: Banking or Life Sciences sector where "Regional Resilience" is a legal requirement

Massive Scale: 5000+ developers globally needing TFE closer to them — though HCP Agents in those regions is a much simpler solution for latency

2026 FDO Deployment Notes

As of 2026, the standard deployment method is Flexible Deployment Options (FDO) using containers — Docker, Kubernetes, OpenShift, Nomad, or Podman. The legacy Replicated installer is being phased out. TFE supports AMD and ARM architectures as of v1.0.0, and IPv4, IPv6, and mixed IP environments.

End of Life — Action Required

Terraform Enterprise on the Replicated platform will no longer be supported after April 1, 2026. Any customer still on Replicated must migrate to FDO (container-native deployment) immediately.

The Replicated platform was a containerized installation that used Replicated to manage TFE's lifecycle — Replicated Daemon, Replicated UI, and containerized TFE components (ptfe_atlas, ptfe_vault, ptfe_postgres, ptfe_nginx). This architecture is replaced by FDO.

Operational Modes (Legacy Reference)

Mode	PostgreSQL	Object Storage	Redis
external	External — customer-managed	External — customer-managed	Docker volume on instance
active-active	External — customer-managed	External — customer-managed	External — customer-managed
disk	Internal directory on instance	Internal directory on instance	Docker volume on instance

SECTION 07 · SECURITY & GOVERNANCE

Governance & RBAC

RBAC is about aligning your security model with your organizational structure. Don't manage permissions at the individual workspace level — it becomes an administrative nightmare at scale.

🏢 Organization Level

Reserve for Platform Team — manages global settings, policies, providers, and org-level variables.

Keep this team small (2–3 people)
Most admins should only have Manage Workspaces — not full org controls

📁 Project Level

Group related workspaces by Business Unit or environment. Permissions at the project level cascade to all workspaces.

Lead Engineers manage their project domain
Teams create their own workspaces without central admin approval every time

⚙️ Workspace Level

Use only for exceptions or highly sensitive standalone resources that don't fit a project grouping.

Avoid managing the majority of permissions here — it doesn't scale
Individual contributors get Read / Plan / Write as appropriate

Map your existing team structure into these standardized personas. Then assign TFE permission levels to match — not to individuals, but to IdP Groups (Okta/AD) mapped to TFE Teams.

Persona	TFE Permission Level	Capabilities
Platform Admin	Org Admin	Manage teams, SSO, and global module registries. Keep this group very small.
Lead Engineer	Project Admin	Create/delete workspaces within a specific project; manage team access for their domain.
Developer	Write	Trigger runs, update variables, see plans and applies. Cannot manage workspace settings.
Security Auditor	Read-Only	View state files and run history for compliance auditing. Cannot change anything.

Team Mapping Best Practice

Always map Identity Provider (IdP) Groups (Okta/Active Directory) to HCP Terraform Teams — never assign permissions to individual users. When an employee leaves, their TFE access terminates automatically with their IdP account. No manual deprovisioning required.

State files and variables are the most sensitive assets in TFE. By default, "write" access allows a user to see the state file. For high-security environments, use Custom Workspace Permissions.

👁️‍🗨️ The Blind Apply Strategy

Grant a team the ability to Apply changes without being able to read the state file or sensitive variable values
Set state-versions to none or read-outputs — prevents downloading raw state JSON (which may contain passwords)
Set variables to none — users trigger runs that use variables, but cannot see the sensitive values in the UI

🔗 Remote State Sharing

By default, workspaces are isolated
If Workspace B needs an output from Workspace A, you must explicitly enable Remote State Sharing
Never use "Share with all workspaces" — explicitly list the workspaces allowed to read outputs
Consider using HCP Terraform Outputs or targeted data sources instead of sharing full state

Mandatory Sentinel/OPA Policies: Implement a policy that prevents any workspace from being created without being assigned to a Project. Workspaces in the Default Project should be blocked by policy — every workspace must have explicit ownership.

Audit Log Streaming: Configure TFE to stream all audit logs to a centralized SIEM (Splunk, Datadog). Local logs are a compliance risk — they can be rotated or deleted. What to track: sensitive variable access, policy overrides, workspace creation, team membership changes.

SSO Enforcement with MFA: Ensure TFE is backed by the corporate Identity Provider with MFA enabled. When an employee leaves, their TFE access should terminate automatically — not through a manual deprovisioning ticket.

Workspace Naming Convention Policy: Use Sentinel to enforce workspace naming standards — prevent ad-hoc names that obscure ownership and environment scope.

Air-Gapped Encryption Note

In air-gapped TFE deployments, the TFE encryption password will likely be wrapped in a Hardware Security Module (HSM) or a cloud KMS. If TFE loses its unseal key, all data becomes unreadable. Document the key recovery procedure before you need it.