Home
Terraform Discovery & Implementation
🧭 Two Navigation Modes
Toggle the sidebar between Phase view (linear discovery flow) and Domain view (jump directly to any technical area). Same content — two entry paths.
⚡ Decision Points
Amber boxes are branching decisions — follow YES/NO paths to the right recommendation. Deal Breakers must be checked before Phase 1 begins.
🎙️ Discovery Questions
Teal-labeled prompts are conversation starters. Adapt to the customer's language — don't read them verbatim. The goal is to understand, not interrogate.
PRE-DISCOVERY
⛔ Deal Breaker Categories
Check these before any technical discovery begins. Any one of these can determine the platform decision before Phase 1 starts.
Does the customer have environments with zero internet access (air-gapped)?
✓ YES — Air-gapped
→ Terraform Enterprise (TFE) Non-negotiable. TFE is self-managed, on-prem/VPC, fully offline. See air-gap definition below.
✗ NO — But strict networking
→ HCP Terraform + Agents HCP Terraform Agents (self-hosted runners) keep execution inside the customer network while the control plane remains SaaS.

Air-Gap — Formal Definition

An air-gapped environment is one where the entire installation and its environment are physically and logically isolated from external networks, including the public internet.

Key Characteristics
  • No internet access — inbound or outbound
  • Manual installation and updates (no auto-update)
  • On-premises Version Control System (GitLab, Gitea)
  • Private Module Registry (self-hosted)
  • Private Container Registry
  • Internal license server
Operational Connectivity Requirements
  • All dependencies must be pre-staged internally
  • Provider binaries must be mirrored to internal registry
  • Terraform binary updates require manual pull and internal distribution
  • No external SaaS integrations (Slack, PagerDuty, etc.)
Architect Tip — Data Sovereignty

Ask separately: "Does your legal team mandate that state files or sensitive variables must never leave a specific cloud region or VPC?" Data sovereignty is distinct from air-gap and can push toward TFE even in cloud-connected environments.

Does the customer have a dedicated Platform Engineering team with cycles to manage Linux, PostgreSQL, and S3-compatible storage backends?
✓ YES — Wants control
→ Consider TFE They can manage platform lifecycle and pin versions to their own upgrade schedule.
✗ NO — Wants to ship code
→ HCP Terraform Fully managed, always latest. Focus entirely on writing Terraform — not running it.
Upgrade Tolerance

HCP Terraform = always on latest.
TFE = customer chooses upgrade cadence.
If they need to pin platform versions for change management or stability, TFE wins this axis regardless of staffing.

Does Terraform need to reach internal APIs behind a heavy firewall or on-premises (e.g., private GitLab, legacy IPAM, on-prem DNS)?
✓ YES — Heavy internal access
→ TFE preferred TFE lives inside that network. Internal service access is seamless — no tunnels, no agents.
✗ NO — Or bridgeable
→ HCP Terraform + Agent or VPN/Direct Connect Adds complexity but remains viable. HCP Agent runs inside the customer network and proxies execution.
⚠️ Urgent Action Required

The legacy "free plan" ended March 31, 2026. Any customer still on that plan needs a migration conversation immediately — this is not optional.

Does the customer have millions of low-value, high-count resources (e.g., individual DNS records, specific tag resources)?
✓ YES — High resource count
→ Evaluate TFE licensing RUM pricing on HCP Terraform scales with resource count. TFE license-based pricing is often more predictable for massive-scale organizations.
✗ NO — Standard footprint
→ HCP Terraform RUM likely acceptable Standard resource composition won't trigger cost surprises on the RUM model.
PHASE 01 · GOVERNANCE & MATURITY
Vision & Governance
Goal: Understand who owns the infrastructure and why before any technical specs. Determine the high-level deployment model (HCP vs Enterprise).
Discovery Prompt
"Walk me through a typical 'Day 1' today. If a developer needs a new environment for deploying an application, what happens from the moment they ask until resources are live?"
MethodProsConsAction
Click-Ops
Manual console
No barrier to entry; good for PoCs and discovery Non-repeatable, drift-prone, high human error, no audit trail Educate & Migrate
Imperative Scripts
Bash / Python / CLI
Familiar, flexible, no extra tools required State-blind, spaghetti maintenance, no dependency management Migrate — Day 2 glue only
Terraform / OpenTofu
Declarative IaC
State management, plan-before-apply, 3000+ providers HCL learning curve, state complexity, abstracted errors Mature & Expand
Cloud-Native IaC
CloudFormation / Bicep
Day-zero provider support, no state management burden Cloud lock-in, cannot manage SaaS tools or cross-cloud Supplement Only
Discovery Prompt
"What is the team's current proficiency with HCL and Git workflows? Are developers writing Terraform today, or are we starting from scratch?"
Why This Matters

Skill level directly determines the implementation path, the starting point for automation maturity, and whether education and training need to be scoped as a pre-requisite. Pitching API-driven workflows to a team that's never written HCL is a failure mode.

What is the team's HCL and Git workflow proficiency?
Beginner
→ Education first Start with managed HCP Terraform, simple remote state, and module consumption from a central registry. Do not introduce CI/CD automation or policy enforcement until HCL basics are solid.
Intermediate
→ VCS-driven workflows Ready for VCS-integrated CI/CD, speculative plans on PR, module versioning, and basic Sentinel policies. Registry consumption is standard.
Advanced
→ API-driven + full ownership Custom pipelines, API-driven workflows, module authorship, registry publishing, advanced policy frameworks. Focus shifts to governance and scale.
Discovery Prompts
"Who are the primary 'producers' of Terraform code versus the 'consumers'? Is this a centralized platform team model, or are individual app teams owning their own infrastructure?"
"Does your network team and your application team have the same manager?" — If separate orgs, they should have separate repositories.
Is infrastructure ownership centralized (platform team) or decentralized (app teams)?
Centralized
→ Monorepo + Gatekeeper model Central team manages code, CI/CD, and modules. Golden module registry with gated publishing. High consistency, risk of bottleneck at scale.
Decentralized
→ Multi-repo + Federated model Teams own their infrastructure domains. Federated module ownership. Requires strong versioning governance to prevent fragmentation.
Discovery Prompts
"Are there strict regulatory requirements — SOC2, HIPAA, PCI-DSS, FedRAMP — that mandate Policy as Code enforcement?"
"Are there any hard requirements regarding where Terraform state and execution data live? Data residency requirements or air-gap policies that prevent using a SaaS platform?"
Are there regulatory mandates that require automated policy enforcement?
✓ YES — Regulated
→ Sentinel / OPA required Policy as Code is not optional. Compliance evidence must be system-generated (JSON evaluation records), not manual. See Phase 4 — Policy Enforcement.
✗ NO — Standard compliance
→ Native validations as starting point Variable rules, syntax checking, and module-specific validation cover most standard compliance needs. Migrate to Sentinel/OPA as maturity grows.

This is the same product. The decision is operational convenience vs. total environmental control. Refer to Deal Breakers A–D before running this comparison — they may decide it before you get here.

☁️ HCP Terraform (SaaS)
  • Management: Fully managed by HashiCorp
  • Updates: Automatic — always latest
  • Infrastructure: Runs on HashiCorp Cloud Platform
  • Pricing: Resources Under Management (RUM)
  • Best for: Speed to value, low ops overhead
🏢 Terraform Enterprise (TFE)
  • Management: Customer-managed (VM or Kubernetes)
  • Updates: Manual — customer controls upgrade cycle
  • Infrastructure: Customer VPC, data center, or air-gap
  • Pricing: License-based (per workspace/user)
  • Best for: Strict compliance, air-gapped, high-security
Does the customer require an air-gapped environment?
✓ YES
→ Terraform EnterpriseNon-negotiable. Full stop.
✗ NO
→ HCP TerraformThen evaluate run capacity requirements — see next section.
Architect Tip — Magic Words

"Global, Consistent, Standardized" → lean toward Registry-driven workflows.

"Agile, Autonomous, Fast" → lean toward VCS-driven with OPA guardrails.

CLI-only insistence → dig deeper. They often don't trust automation yet. This is an opportunity to demo Speculative Plans and build confidence in the pipeline.

Discovery Prompts
"Do you have high run-capacity requirements — 50 or more simultaneous Terraform jobs?"
"Is your team small (under 10 engineers) or do you lack a dedicated platform admin who can manage the Terraform platform itself?"
Does the customer require 50+ simultaneous runs, or do they have dedicated platform admins?
✓ YES — High capacity / dedicated team
→ Plus Tier or scale-out compute runners Scale your own compute runners to handle concurrent job load. Requires platform admin capacity.
✗ NO — Small team / low volume
→ Standard Tier Standard Tier handles the majority of enterprise workloads. Upgrade path is straightforward when scale demands it.
Discovery Prompt
"Is this a single cloud (AWS/Azure/GCP), multi-cloud, or hybrid cloud environment? And how do you expect that to evolve over the next 2–3 years?"
What is the customer's cloud footprint?
Single Cloud
→ Cloud-native backends viable S3 + DynamoDB (AWS), Azure Blob, or GCS are all strong options. HCP Terraform also well-suited. See State Management section.
Multi-Cloud
→ HCP Terraform strongly preferred Unified state management across providers. Avoid juggling multiple cloud-native backends and state backends simultaneously.
Hybrid (Cloud + On-Prem)
→ TFE or HCP + Agent On-prem access requirements likely push toward TFE. Revisit Deal Breaker C for network complexity evaluation.
Why This Matters Upstream

Cloud footprint directly informs state backend selection. A multi-cloud customer who picks cloud-native backends will end up managing three separate state systems with different locking mechanisms and IAM configurations. That operational burden is avoidable with HCP Terraform.

PHASE 02 · WORKFLOW & AUTOMATION
Workflow & Golden Path
Goal: Determine the automation strategy — how code moves from a developer's laptop to production. VCS vs API-driven is the core routing decision here.
Discovery Prompts
"Where do your developers spend most of their time — interacting with a UI, a CLI, or within a VCS like GitHub or GitLab?"
"Do you have an existing CI/CD standard — GitHub Actions, Jenkins, GitLab CI? Do you want Terraform to manage execution natively on a PR event, or stay within your existing pipeline?"
ModeTriggerBest ForGovernance
VCS-Driven Recommended Default Git events (PR/Merge) Most teams; standard apps Highest — built-in audit trail
CLI-Driven terraform apply from terminal; remote execution Iterative dev; break-fix Medium — consistent but manual
API-Driven / Custom CI/CD External script or orchestrator Complex pipelines; multi-stage Customizable — high setup effort
Generic CI/CD (DIY) Jenkins / GitHub Actions (non-HCP) Non-HCP Terraform users Lowest — manual state and lock mgmt
Does the customer want to use their own CI tool but retain HCP Terraform for state and policy?
✓ YES — Custom pipeline
→ API-Driven Workflow Customer owns the plumbing. They write scripts to handle API tokens, upload config, and poll for run status. More flexible — more operational ownership.
✗ NO — Native preferred
→ VCS-Driven Workflow Connect directly to source control. Speculative Plans run automatically on PR open. State is locked during the run. Lowest friction path.
Maturity Model Framing

Present this as a maturity progression: CLI → VCS-driven → API-driven. Most orgs start at CLI, move to VCS for standard workflows, and eventually adopt API-driven for multi-stage orchestration. Meet them where they are, show them where they're going — don't force the jump.

Discovery Prompt
"How do your developers and pipelines currently authenticate to the cloud? Are you using static IAM keys, service accounts with long-lived secrets, or have you moved to OIDC-based dynamic credentials?"
What is the current cloud authentication method for CI/CD pipelines?
Static IAM Keys / Long-lived Secrets
→ Migrate to OIDC — priority action Long-lived keys are a security liability. Even a compromised CI/CD pipeline yields permanent admin-level access. OIDC eliminates this exposure.
OIDC / Dynamic Credentials
→ Configure HCP Dynamic Credentials Already on the right path. Configure HCP Terraform Workload Identity to generate temporary OIDC tokens per run. Validate TTL and scope.
Service Accounts + Rotation Policy
→ Evaluate rotation cadence + TTL Better than static keys if rotation is enforced. Prioritize migration to OIDC. Ensure no service account keys are committed to Git.
How OIDC Works

HCP Terraform generates a temporary OIDC token per run → cloud provider trusts this token via Workload Identity Federation → grants a short-lived session (typically 1 hour). Even if the CI/CD pipeline is compromised, there are no static admin keys to steal — the token expires.

Golden Rule — Non-Negotiable

Regardless of tooling or team maturity: never allow an apply in Production without a documented and approved plan. This applies at every maturity level.

🔀 GitOps Model
  • PR is the gate — approval happens in VCS
  • Speculative Plan posts as a comment on PR open
  • Code Owner must approve before merge to main
  • Enforce branch protection rules — no direct push to main
  • Best for: High-velocity dev and staging
🔑 Platform Model (HCP Manual Apply)
  • Workspace set to "Manual Apply"
  • Plan runs but pauses in "Pending" state
  • Lead logs into HCP UI, reviews final plan, confirms apply
  • Even code-approved changes get a final production gate
  • Best for: Production environments
🏛️ Enterprise Model (ServiceNow / Jira)
  • Run Tasks trigger a Change Request in ServiceNow
  • Apply blocked until ticket moves to Approved state
  • High friction — reserve for Layer 1 and 2 core infra only
  • Best for: High-compliance workloads, CAB-required changes
🤖 Automated Approvals (Shift-Left)
  • Sentinel/OPA auto-approves changes that pass all policies
  • Hard Mandatory violations block and require senior override
  • Override events must be captured in the audit stream
  • Goal: Reduce toil without removing governance
EnvironmentApproval MethodWho Approves?
Sandbox / DevAuto-Apply (no manual gate)System — policy check only
Staging / QAVCS Pull Request ApprovalPeer Developer
ProductionHCP Manual Review + Policy CheckCloud Lead / SRE
Core Network / IdentityExternal CR (ServiceNow) + Two-Key PolicyChange Advisory Board (CAB)
Discovery Prompt
"Where do you currently store sensitive credentials like API keys and database passwords — Vault, AWS Secrets Manager, environment variables? How are they injected into Terraform runs?"
Goal

Move customers away from static secret management toward identity-based dynamic injection. Even if they can't reach the gold standard today, set the direction and build a roadmap toward it.

Option
Pros
Cons
Verdict
Input Variable
sensitive = true
Easy to use; redacted from CLI output and logs
Stored in plain text in state file — anyone with state access can read it
OK for non-critical values
Secrets Manager
data source
Centralized; supports rotation
Secret value persists in state file after first fetch
Legacy by 2026
HCP Vault / Dynamic Secrets
Just-in-time credentials; automatic TTL expiry
Requires Vault infrastructure and management overhead
Best for DB/App credentials
Ephemeral Resources
TF 1.10+
Never stored in state — only used during apply run
Requires TF 1.10+ — verify customer's version
⭐ 2026 Gold Standard

Ephemeral Resources — Example

# The secret is fetched at apply time only. # It is never written to terraform.tfstate. ephemeral "vault_kv_secret_v2" "db_password" { mount = "secret" name = "database" } resource "aws_db_instance" "main" { password = ephemeral.vault_kv_secret_v2.db_password.data["password"] # ↑ Used during apply, gone after. Never in state. }

State File — Defense in Depth

No matter how secrets are injected, the state file is always a potential vulnerability. Apply defense in depth:

State encryption at rest — HCP TF encrypted by default; cloud backends use KMS/CMK
RBAC on state access — only the "Apply" service account has read access to the raw state file
Developers = Read-Only to plans, not state — they should never need to see raw state
HCP Variable Sets marked Sensitive — workspace admins cannot see values once entered
PHASE 03 · CODE ARCHITECTURE & MODULARITY · PLATFORM & BACKEND STRATEGY
Code & State Architecture
Goal: Define blast radius and module strategy. State file decisions are the most critical in the engagement — they are difficult to change later and can become single points of failure.
Most Critical Decision in This Engagement

Wrong state strategy causes state drift, corrupted environments, and security vulnerabilities — state files can contain sensitive data in plain text. Get this right before any other architectural decision.

The Three Truths — Frame This for Customers

Desired State: Configuration files in Version Control

Known State: What Terraform remembers in the .tfstate file

Actual State: The reality in the cloud provider — drift occurs when this deviates from Known or Desired

BackendWhen to UseKey Trade-offs
Local State Solo dev experimentation only. Never for org use. Zero config, zero collaboration, zero locking — data on a laptop is a liability
Cloud-Native Backends
S3, Azure Blob, GCS
Small-to-medium team with strong single-cloud presence Cost-effective (pennies/month), state locking — but DIY security, versioning, and IAM
HCP Terraform / TFE Recommended Enterprise governance, private module registries, low-ops approach Native state mgmt, built-in RBAC, Sentinel/OPA — cost is per-RUM; potential lock-in
TaCOS
Spacelift, Scalr, env0
Multi-IaC environments (TF + Pulumi + CloudFormation + OpenTofu) Unified single pane of glass, drift detection, TTL environments — third-party dependency

Cloud-Native Backend Selection by Provider

AWS — S3 + DynamoDB
  • State stored in S3 bucket
  • Locking via DynamoDB table
  • Enable bucket versioning (recovery from corruption)
  • Enable KMS encryption at rest
  • Restrict bucket policy — no public access
Azure — Blob Storage
  • State stored in Azure Blob container
  • Native state locking — no extra service needed
  • Enable blob versioning for recovery
  • Enable Customer Managed Key (CMK) encryption
  • Use private endpoints to restrict access
GCP — Cloud Storage
  • State stored in GCS bucket
  • Native state locking built in
  • Enable object versioning for recovery
  • Enable CMEK encryption at rest
  • Restrict with uniform bucket-level access
Hardening — Required for All Cloud Backends

Enable bucket/blob versioning on all state backends — this is your recovery mechanism if state gets corrupted. Enable KMS/CMK encryption to protect sensitive data at rest. These are non-negotiable regardless of which cloud provider is used.

State Isolation — Blast Radius

Discovery Prompt
"How large of a 'failure domain' can you tolerate? If a state file were corrupted, how much infrastructure would be affected? Are you comfortable with that?"

Certain groupings of resources should have their own isolated state file. The smaller the state file scope, the smaller the blast radius. Always separate state backends (different S3 prefixes or HCP TF Workspaces) for each environment to prevent cross-environment resource deployment.

Monorepo
Pros:
  • Full org visibility — search across all infra
  • Shared CI/CD pipelines and linting rules
  • Simplified VCS-level RBAC
Cons:
  • One bad CI trigger can break company-wide
  • VCS performance degrades at scale
  • High merge conflict rate across teams

Best for: global infra layers. Does not allow easy module versioning.

Multi-repo
Pros:
  • Hard blast radius isolation between services
  • Clear team ownership per repo
  • Granular per-service RBAC
Cons:
  • Configuration drift harder to enforce across repos
  • Dependency tracking across 50+ repos is painful
  • Heavy CI/CD pipeline and secrets management overhead

Best for: app infra isolation. Enables easy module versioning.

Hybrid InnerSource ✓ Recommended
Golden Modules → Multi-repo:
  • Each reusable module in its own repo
  • Version independently with SemVer tags
  • Publish to Private Module Registry
Live Environments → Monorepo:
  • Small/mid-size: single "infra-live" monorepo
  • Enterprise: split by Business Unit or foundation layer

Three Litmus Tests

Test 1 — Team Structure

"Does your network team and your application team have the same manager?" If they are separate organizations, they should have separate repositories. Forcing different teams into one repo creates a bottleneck on the peer review process.

Test 2 — Deployment Velocity

"How often do you deploy — once a week or 50 times a day?" High-velocity teams need multi-repo. A global lock on a monorepo can prevent a critical hotfix while someone else runs a long-lived terraform apply on a different section.

Test 3 — Blast Radius Tolerance

"If a developer makes a mistake in a GitHub Action and runs a 'destroy' at the root of the repo, what is lost?" If the answer is "the whole company" — move to multi-repo or multi-project structure immediately to enforce hard boundaries.

Discovery Prompt
"How do you currently separate Development, Staging, and Production — different cloud accounts, different VPCs, or just naming conventions in the same account?"
Workspaces

Same codebase, switched context. Good for identical, short-lived environments.


⚠️ Risks:
  • Shared backend credentials across environments → potential cross-env apply
  • Workspaces are invisible on disk — can't browse environments in the repo
  • Best used for ephemeral or short-lived environments only
Directory Structure ✓ Preferred for Production

Each environment gets its own folder and its own root module with its own backend.


✓ Advantages:
  • Credential isolation: different IAM roles per directory
  • Version pinning: test new module versions in dev while prod stays pinned
  • Clear visibility: browsing the repo shows exactly what environments exist

Solving the Backend Copy/Paste Problem

The common complaint about directory isolation is copying backend.tf into every environment folder. Three solutions:

Terragrunt
  • Popular wrapper that handles backend config dynamically — eliminates the copy/paste entirely. Standard recommendation for mature teams.
Symlinks
  • Link a shared providers.tf into every environment folder. Simple but brittle — symlinks can break in some CI/CD environments.
Partial Configuration
  • terraform init -backend-config=path/to/backend.dev.hcl injects environment-specific backend settings at runtime. No file duplication — slightly more complex init process.
Example Directory Structure . ├── modules/ # Reusable code (VPC, RDS, EKS) │ ├── vpc/ │ └── database/ └── environments/ ├── dev/ │ ├── main.tf # Calls modules with 'dev' variables │ └── backend.tf # Points to /states/dev/terraform.tfstate ├── staging/ │ ├── main.tf # Calls modules with 'staging' variables │ └── backend.tf # Points to /states/staging/terraform.tfstate └── prod/ ├── main.tf # Calls modules with 'prod' variables └── backend.tf # Points to /states/prod/terraform.tfstate
Layer 1 Identity & Global Governance
Changed infrequently. Foundational to everything else. Owned exclusively by Security Team.
Isolate: IAM Roles/Policies, SCPs, Guardrails, SSO Config. Reason: App devs must never touch security boundaries.
Layer 2 Core Networking
High-stakes, stable. Network changes are highest-risk. Owned by Cloud NetEng.
Isolate: VPCs, Subnets, Transit GW, DirectConnect, DNS Hubs. Reason: Day 2 app changes must not affect the network fabric.
Layer 3 Platform & Shared Services
Bridges network and application code. Different lifecycle than the apps running on it. Owned by Platform/DevOps.
Isolate: Kubernetes clusters, shared RDS, Logging/Monitoring, Secrets Manager. Reason: K8s version upgrade should not redeploy 50 apps.
Layer 4 Application Services
Changes most frequently. Each app or microservice should have its own state file. Owned by App Dev Teams.
Isolate: App-specific S3, Lambda, Microservices, App DBs. Reason: Teams need to move at their own velocity without locking others out.
LayerChange FreqBlast RadiusOwnership
IdentityVery LowGlobal / TotalSecurity Team
NetworkingLowRegional / CriticalCloud NetEng
PlatformMediumCluster-widePlatform / DevOps
ApplicationHighService-specificApp Dev Teams
Cross-Layer Data Access — Critical Pattern

Layer 4 (Application) needs data from Layer 2 (Network) — like vpc_id or subnet_ids. Do not use terraform_remote_state data sources — they require access to the entire state file, which violates blast radius isolation. Instead: use HCP Terraform Outputs or a targeted data source to read only the specific values needed across layers.

Discovery Prompts
"Do you have a set of 'Golden Images' or standard infrastructure patterns that every team must follow? How do you plan to distribute those — shared folders, or a formal private module repository?"
"Will teams consume modules provided by a central team, or will each team build their own?"
Centralized (Gatekeeper)
Platform team owns, writes, and maintains all modules. High consistency, strict compliance. Becomes a bottleneck at scale — feature requests pile up.

Best for: Regulated industries; orgs starting their TF journey.
Federated (Community)
App teams own modules for their stack. Domain experts write their own code; faster iteration. Risk of fragmentation — different teams solving the same problem in incompatible ways.

Best for: Large, mature orgs with high DevOps autonomy.
Hybrid InnerSource ✓ Recommended
Platform Team owns base modules. Any team can contribute improvements via PR. "You build it, you help maintain it."

Best for: Most enterprise clients targeting scale without sacrificing governance.

Ownership Matrix

TaskPlatform TeamApp/Stream Team
Foundational Modules (VPC, IAM, DNS)OwnerConsumer
Service Modules (RDS, S3, EKS)Co-Owner (Standards)Co-Owner (Features)
Application Logic InfrastructureConsultantOwner
Security Scanning (Checkov/Snyk)Sets PolicyFixes Findings

Key Ownership Pillars

Private Registry as Source of Truth: If it's not in the HCP Terraform Private Module Registry, it's not a supported module. Clear boundary between experimental and production-ready code.
Code Owners in VCS: GitHub CODEOWNERS or GitLab Protected Branches — any change to a module requires approval from the designated owner group.
Semantic Versioning (SemVer): Owners must commit to vX.Y.Z tagging. Deprecation requires a migration path or sunset period — not a silent removal.
Testing as Ownership Prerequisite: No PR merged unless terraform test integration tests pass. Ownership means knowing when your module is broken before your consumers do.
Discovery Prompt
"Who is responsible for fixing module code when a provider update breaks something? Who owns the SLA for getting consumers unblocked when that happens?"
Why This Is Its Own Decision

Module ownership (who writes it) and registry ownership (who is accountable for it being available and functional) are not always the same team. Blurring this line creates gaps — modules get published but nobody owns the support burden when a provider version upgrade breaks downstream consumers.

Who is responsible for fixing code when a provider update breaks a published module?
Central Platform Team
→ Gated registry with strict SLA Platform Team owns publishing rights and patch SLA. Version bumps require their review and sign-off. Consumers can always depend on the registry being reliable — but the team becomes a bottleneck.
Domain App Teams
→ Federated registry with tagged ownership Each module has a tagged owning team in the registry. That team owns the patch SLA for their modules. Platform Team still controls the registry infrastructure and publishing standards.

Registry Best Practices

Use HCP Terraform Private Module Registry as the single distribution point — eliminates "which version is in the shared folder?" questions entirely.
Pin provider version constraints in every module — required_providers blocks must specify minimum and maximum acceptable versions to prevent surprise breakage on provider updates.
Deprecation policy: Retiring a module requires a migration path document and a minimum sunset window communicated to all consumers before removal from the registry.
No module published without passing teststerraform test must be a gate in the module's CI pipeline before the registry publish step runs.
PHASE 04 · SECURITY & POLICY ENFORCEMENT
Policy & Day 2 Operations
Goal: Establish guardrails and long-term maintenance. This is where compliance becomes proactive — automated and audit-ready — rather than reactive and manual.
Discovery Prompts
"If a developer tries to deploy an expensive instance type or an unencrypted database, how is that caught today — manual review, or automated enforcement?"
"Are there specific rules that must always be enforced? For example: 'No public S3 buckets', 'All instances must have a Cost Center tag', 'No unencrypted databases'?"
"Is there a requirement to see the estimated cost of a change before it is applied to the environment?"
How complex and cross-cutting are the policy requirements?
High Compliance
→ Sentinel / OPA required Cross-resource logic, security guardrails, auditing, dynamic enforcement levels, centralized control. Produces JSON audit records. Enables self-mandatory override capture.
Standard Compliance
→ Native validations as starting point Variable constraints, syntax checking, and module-specific validation handle most standard compliance needs. Migrate to Sentinel/OPA as requirements grow.

Policy as Compliance Evidence

Key Selling Point for Regulated Customers

When Sentinel or OPA evaluates a run, it produces a JSON record of evaluation — every rule that was checked, and whether it passed or failed. This is auditor-ready evidence that compliance rules are being enforced by the system, not by humans manually reviewing configurations after the fact. It converts audits from manual retrospective reviews into automated, continuous proof.

If a Hard Mandatory policy is violated and a senior architect performs a manual override, that override event — including justification — must be captured in the audit stream. Overrides are not silent. This is what gives auditors confidence that exceptions are controlled and documented.

Shift-Left Approach

Use Sentinel or OPA to automatically block non-compliant infrastructure before it is provisioned. The execution mode is the last line of defense. Policy should catch violations during the Plan phase — never rely on post-apply remediation as the primary control.

RBAC aligns the security model with organizational structure to prevent blast-radius accidents. Use a funnel approach: broad at the Organization level, granular at the Workspace level.

LevelWhoPermissionsNotes
Organization CCoE / Lead Architects Manage Policies, Registry, Teams Keep Owners team to 2–3 max. Most admins should be limited to Manage Workspaces — not full org control.
Project Lead Engineers per domain Admin or Write at Project Level Teams create their own workspaces without needing central admin approval every time.
Workspace Individual Contributors Read / Plan / Write Most developers should be Plan-only in production workspaces.

Who Should Apply in Production?

Service Account First Rule

Humans should almost never have direct apply permissions in production. Grant apply permissions to a Service Principal only (OIDC/Dynamic Credentials). Workflow: Dev triggers PR → CI/CD runs Plan → Human reviews/approves → System executes Apply.

Two-Key Policy — Core Infrastructure

For Layer 1 (Identity & Governance) and Layer 2 (Core Networking) resources: implement a required two-approval gate before any apply can proceed. This ensures a single rogue user or compromised account cannot destroy foundational infrastructure. Implement via HCP Terraform Team approvals requiring two distinct reviewers, or VCS branch protection requiring two separate approvals.

TeamLayer / WorkspacePermissionRationale
Networking TeamCore Network / VPCAdminThey own the fabric and manage its lifecycle end to end
App Dev TeamApplication ServicesWrite (non-prod) / Plan (prod)High velocity in dev; protected gates in production
Security TeamAll WorkspacesRead-OnlyAuditing and compliance without operational risk
App Dev TeamCore NetworkRead-OnlyThey consume vpc_id and subnet_ids — they must not change them
RBAC Best Practices

Eliminate static secrets: HCP Terraform Dynamic Credentials — workspace authenticates via OIDC token per run. No static secrets to steal.

Map IdP Groups to Teams: Always map Okta/AD Groups to HCP Terraform Teams. User access is tied to the org's auth service, not directly to HCP users — offboarding is automatic.

Custom Permissions (Business Tier): Separate "Queue Run" (ability to trigger a plan) from "Approve Run" (ability to apply). This is the gold standard — not all Run writers should be Run approvers.

Discovery Prompts
"How do you currently detect 'Shadow IT' — manual changes made directly in the Cloud Console that aren't reflected in code?"
"Is drift detection a compliance requirement for you, or a nice-to-have?"
How to Frame This for Customers

Drift detection is not a monthly audit — it is a rhythmic heartbeat. Treat a "Drift Warning" with the same urgency as a "Build Failure." If your team wouldn't ignore a failed CI/CD pipeline, they shouldn't ignore drift either.

Workload TypeRecommended FrequencyImplementation Method
Critical ProductionContinuous / HourlyHCP Terraform Health Assessments (Standard/Plus Editions)
Standard ProductionDaily (every 24h)Scheduled CI/CD pipeline running terraform plan
Development / SandboxWeekly / On-DemandManual trigger or weekly cron job
Version Note

As of early 2026, HCP Terraform Health Assessments are part of Standard and Plus Editions only. Customers on older free tiers must implement drift detection via scheduled CI/CD pipelines.

Responding to Drift — Three Paths

🔄 Revert (Reconciliation)
  • Run terraform apply to overwrite the manual change
  • Use when: unauthorized changes, accidental deletions, compliance violations (opened security groups)
  • This is the standard response for drift that violates policy
✏️ Adopt (Alignment)
  • Update Terraform code to match the new cloud reality
  • Use when: an emergency manual fix was actually correct and must be codified permanently
  • Use the import block (TF 1.5+) to bring unmanaged resources under control without CLI surgery

Prevention — Best Practices

Read-Only Cloud Console — Transition all human users to Read-Only roles. Changes must go through a Pull Request. Reserve "Break-Glass" admin access for true emergencies only.
Drift Alerts to Slack/Teams/PagerDuty — Integrate detection with the same channels used for build failures. Drift that nobody sees is drift that festers.
HCP Terraform Health Assessments — Centralized dashboard showing exactly which resources drifted, with a remediate button to generate the fix run.
Use ignore_changes for Expected Drift — Auto-Scaling Group desired_capacity changes by design. Use lifecycle { ignore_changes = [...] } for these attributes to suppress false-positive alerts.
Discovery Prompts
"Do you need a full history of 'Who changed what and when' for every resource? Is this for internal ops visibility, or are external auditors requiring it?"
"How long is your compliance retention requirement for infrastructure change logs?"
Architect Rule

A terraform apply without a linked Git commit or Change Request ID should be treated as a Security Incident — not a routine deployment. Build this expectation into the customer's culture from day one.

🔗 Chain of Custody
  • What & When: Every HCP Run is automatically linked to a VCS commit — provides immutable record of what changed and when
  • Why: Integrate Jira or ServiceNow via Run Tasks to attach a Change Request ID to every Terraform plan — closes the "why was this done" gap
📤 Immutable Audit Logs
  • Local TF platform logs are a compliance risk — they can be rotated or deleted
  • Enable HCP Audit Log Streaming → external SIEM immediately
  • Track: sensitive variable access, policy overrides, service account applies, admin permission changes
  • SIEM retention must match industry compliance requirements
Audit LevelComponentBest Practice
SystemHCP TerraformEnable Audit Log Streaming to SIEM — do not rely on local logs
ProcessVCS / PRsEnforce signed commits and mandatory PR approvals before merge
DataState FileCMK Encryption + Versioning + Access Logging on the state backend
ComplianceSentinel / OPALog all Policy Pass/Fail events; capture all override events with justification
PHASE 05
Next Steps
Goal: Confirm success criteria, validate the implementation readiness checklist, and create a shared definition of "done" with the customer.

These are the non-negotiables for any mature Terraform implementation. Use this as a readiness check at the close of discovery. Check items off live during the conversation.

Standardize on Declarative: Move away from procedural scripts. Use Terraform to define the end state. Scripts are Day 2 glue — not Day 1 provisioning.
Version Everything in Git: If it's not in Git, it doesn't exist. All infrastructure configuration must live in source control.
Implement Remote State: Never store .tfstate files on a laptop. Use HCP Terraform, S3/GCS/Azure Blob, or TFE with state locking enabled.
Separate Environments: Use separate workspace directories or HCP TF workspaces for Dev, Staging, and Prod. Never share a state backend across environments.
Automate the Lifecycle: Shift from local CLI executions to remote executions via CI/CD or HCP Terraform. Local apply in production is not acceptable.
Plan on PR, Apply on Merge: No apply in Production without a documented and approved plan. This is the minimum governance bar.
Shift Left with Policy as Code: Sentinel or OPA blocks non-compliant changes before infrastructure is provisioned. Don't rely on post-apply remediation.
OIDC for All Credentials: No long-lived static IAM keys or service account secrets in CI/CD runners. Use OIDC Workload Identity for all cloud authentication.
State Encryption + Versioning: KMS or CMK encryption at rest. Bucket/blob versioning enabled for recovery from corruption.
Drift Alerts Integrated: Drift detection connected to Slack, Teams, or PagerDuty. Drift treated with same urgency as a build failure.
Audit Log Streaming Enabled: HCP Audit Logs streaming to external SIEM. Local logs alone do not meet compliance retention requirements.
Private Module Registry in Use: A central, versioned registry is the single source of truth for reusable modules. No informal shared folders.
Closing Discovery Question

"If we look back in six months, what is the one thing that must be true for you to consider this Terraform implementation a success?"

Use their answer to anchor the implementation plan. Common success signals and how they map to technical focus areas:

Customer SaysPrimary Focus Area
"Developers can self-serve environments without waiting on the platform team"Module registry, workspace templates, RBAC self-service at Project level
"No more manual changes in the console — everything is in code"Read-only console policy, drift detection, break-glass process documentation
"We can prove compliance to external auditors without manual evidence collection"Sentinel/OPA + audit log streaming to SIEM + state file encryption + policy-as-evidence
"Production deployments feel safe and are never a surprise"Approval strategy, branch protection, policy-as-code gates, speculative plans on PR
"We can onboard new teams to Terraform quickly without reinventing the wheel"Golden module library, private registry, InnerSource ownership model, documentation
"We've eliminated long-lived secrets from our CI/CD pipelines"OIDC/Dynamic Credentials, ephemeral resources (TF 1.10+), variable set governance