Terraform Discovery & Implementation Guide

🧭 Two Navigation Modes

Toggle the sidebar between Phase view (linear discovery flow) and Domain view (jump directly to any technical area). Same content — two entry paths.

⚡ Decision Points

Amber boxes are branching decisions — follow YES/NO paths to the right recommendation. Deal Breakers must be checked before Phase 1 begins.

🎙️ Discovery Questions

Teal-labeled prompts are conversation starters. Adapt to the customer's language — don't read them verbatim. The goal is to understand, not interrogate.

PRE-DISCOVERY

⛔ Deal Breaker Categories

Check these before any technical discovery begins. Any one of these can determine the platform decision before Phase 1 starts.

Does the customer have environments with zero internet access (air-gapped)?

✓ YES — Air-gapped

→ Terraform Enterprise (TFE) Non-negotiable. TFE is self-managed, on-prem/VPC, fully offline. See air-gap definition below.

✗ NO — But strict networking

→ HCP Terraform + Agents HCP Terraform Agents (self-hosted runners) keep execution inside the customer network while the control plane remains SaaS.

Air-Gap — Formal Definition

An air-gapped environment is one where the entire installation and its environment are physically and logically isolated from external networks, including the public internet.

Key Characteristics

No internet access — inbound or outbound
Manual installation and updates (no auto-update)
On-premises Version Control System (GitLab, Gitea)
Private Module Registry (self-hosted)
Private Container Registry
Internal license server

Operational Connectivity Requirements

All dependencies must be pre-staged internally
Provider binaries must be mirrored to internal registry
Terraform binary updates require manual pull and internal distribution
No external SaaS integrations (Slack, PagerDuty, etc.)

Architect Tip — Data Sovereignty

Ask separately: "Does your legal team mandate that state files or sensitive variables must never leave a specific cloud region or VPC?" Data sovereignty is distinct from air-gap and can push toward TFE even in cloud-connected environments.

Does the customer have a dedicated Platform Engineering team with cycles to manage Linux, PostgreSQL, and S3-compatible storage backends?

✓ YES — Wants control

→ Consider TFE They can manage platform lifecycle and pin versions to their own upgrade schedule.

✗ NO — Wants to ship code

→ HCP Terraform Fully managed, always latest. Focus entirely on writing Terraform — not running it.

Upgrade Tolerance

HCP Terraform = always on latest.
TFE = customer chooses upgrade cadence.
If they need to pin platform versions for change management or stability, TFE wins this axis regardless of staffing.

Does Terraform need to reach internal APIs behind a heavy firewall or on-premises (e.g., private GitLab, legacy IPAM, on-prem DNS)?

✓ YES — Heavy internal access

→ TFE preferred TFE lives inside that network. Internal service access is seamless — no tunnels, no agents.

✗ NO — Or bridgeable

→ HCP Terraform + Agent or VPN/Direct Connect Adds complexity but remains viable. HCP Agent runs inside the customer network and proxies execution.

⚠️ Urgent Action Required

The legacy "free plan" ended March 31, 2026. Any customer still on that plan needs a migration conversation immediately — this is not optional.

Does the customer have millions of low-value, high-count resources (e.g., individual DNS records, specific tag resources)?

✓ YES — High resource count

→ Evaluate TFE licensing RUM pricing on HCP Terraform scales with resource count. TFE license-based pricing is often more predictable for massive-scale organizations.

✗ NO — Standard footprint

→ HCP Terraform RUM likely acceptable Standard resource composition won't trigger cost surprises on the RUM model.

PHASE 01 · GOVERNANCE & MATURITY

Vision & Governance

Goal: Understand who owns the infrastructure and why before any technical specs. Determine the high-level deployment model (HCP vs Enterprise).

Discovery Prompt

▸"Walk me through a typical 'Day 1' today. If a developer needs a new environment for deploying an application, what happens from the moment they ask until resources are live?"

Method	Pros	Cons	Action
Click-Ops Manual console	No barrier to entry; good for PoCs and discovery	Non-repeatable, drift-prone, high human error, no audit trail	Educate & Migrate
Imperative Scripts Bash / Python / CLI	Familiar, flexible, no extra tools required	State-blind, spaghetti maintenance, no dependency management	Migrate — Day 2 glue only
Terraform / OpenTofu Declarative IaC	State management, plan-before-apply, 3000+ providers	HCL learning curve, state complexity, abstracted errors	Mature & Expand
Cloud-Native IaC CloudFormation / Bicep	Day-zero provider support, no state management burden	Cloud lock-in, cannot manage SaaS tools or cross-cloud	Supplement Only

Discovery Prompt

▸"What is the team's current proficiency with HCL and Git workflows? Are developers writing Terraform today, or are we starting from scratch?"

Why This Matters

Skill level directly determines the implementation path, the starting point for automation maturity, and whether education and training need to be scoped as a pre-requisite. Pitching API-driven workflows to a team that's never written HCL is a failure mode.

What is the team's HCL and Git workflow proficiency?

Beginner

→ Education first Start with managed HCP Terraform, simple remote state, and module consumption from a central registry. Do not introduce CI/CD automation or policy enforcement until HCL basics are solid.

Intermediate

→ VCS-driven workflows Ready for VCS-integrated CI/CD, speculative plans on PR, module versioning, and basic Sentinel policies. Registry consumption is standard.

Advanced

→ API-driven + full ownership Custom pipelines, API-driven workflows, module authorship, registry publishing, advanced policy frameworks. Focus shifts to governance and scale.

Discovery Prompts

▸"Who are the primary 'producers' of Terraform code versus the 'consumers'? Is this a centralized platform team model, or are individual app teams owning their own infrastructure?"

▸"Does your network team and your application team have the same manager?" — If separate orgs, they should have separate repositories.

Is infrastructure ownership centralized (platform team) or decentralized (app teams)?

Centralized

→ Monorepo + Gatekeeper model Central team manages code, CI/CD, and modules. Golden module registry with gated publishing. High consistency, risk of bottleneck at scale.

Decentralized

→ Multi-repo + Federated model Teams own their infrastructure domains. Federated module ownership. Requires strong versioning governance to prevent fragmentation.

Discovery Prompts

▸"Are there strict regulatory requirements — SOC2, HIPAA, PCI-DSS, FedRAMP — that mandate Policy as Code enforcement?"

▸"Are there any hard requirements regarding where Terraform state and execution data live? Data residency requirements or air-gap policies that prevent using a SaaS platform?"

Are there regulatory mandates that require automated policy enforcement?

✓ YES — Regulated

→ Sentinel / OPA required Policy as Code is not optional. Compliance evidence must be system-generated (JSON evaluation records), not manual. See Phase 4 — Policy Enforcement.

✗ NO — Standard compliance

→ Native validations as starting point Variable rules, syntax checking, and module-specific validation cover most standard compliance needs. Migrate to Sentinel/OPA as maturity grows.

This is the same product. The decision is operational convenience vs. total environmental control. Refer to Deal Breakers A–D before running this comparison — they may decide it before you get here.

☁️ HCP Terraform (SaaS)

Management: Fully managed by HashiCorp
Updates: Automatic — always latest
Infrastructure: Runs on HashiCorp Cloud Platform
Pricing: Resources Under Management (RUM)
Best for: Speed to value, low ops overhead

🏢 Terraform Enterprise (TFE)

Management: Customer-managed (VM or Kubernetes)
Updates: Manual — customer controls upgrade cycle
Infrastructure: Customer VPC, data center, or air-gap
Pricing: License-based (per workspace/user)
Best for: Strict compliance, air-gapped, high-security

Does the customer require an air-gapped environment?

✓ YES

→ Terraform EnterpriseNon-negotiable. Full stop.

✗ NO

→ HCP TerraformThen evaluate run capacity requirements — see next section.

Architect Tip — Magic Words

"Global, Consistent, Standardized" → lean toward Registry-driven workflows.

"Agile, Autonomous, Fast" → lean toward VCS-driven with OPA guardrails.

CLI-only insistence → dig deeper. They often don't trust automation yet. This is an opportunity to demo Speculative Plans and build confidence in the pipeline.

Discovery Prompts

▸"Do you have high run-capacity requirements — 50 or more simultaneous Terraform jobs?"

▸"Is your team small (under 10 engineers) or do you lack a dedicated platform admin who can manage the Terraform platform itself?"

Does the customer require 50+ simultaneous runs, or do they have dedicated platform admins?

✓ YES — High capacity / dedicated team

→ Plus Tier or scale-out compute runners Scale your own compute runners to handle concurrent job load. Requires platform admin capacity.

✗ NO — Small team / low volume

→ Standard Tier Standard Tier handles the majority of enterprise workloads. Upgrade path is straightforward when scale demands it.

Discovery Prompt

▸"Is this a single cloud (AWS/Azure/GCP), multi-cloud, or hybrid cloud environment? And how do you expect that to evolve over the next 2–3 years?"

What is the customer's cloud footprint?

Single Cloud

→ Cloud-native backends viable S3 + DynamoDB (AWS), Azure Blob, or GCS are all strong options. HCP Terraform also well-suited. See State Management section.

Multi-Cloud

→ HCP Terraform strongly preferred Unified state management across providers. Avoid juggling multiple cloud-native backends and state backends simultaneously.

Hybrid (Cloud + On-Prem)

→ TFE or HCP + Agent On-prem access requirements likely push toward TFE. Revisit Deal Breaker C for network complexity evaluation.

Why This Matters Upstream

Cloud footprint directly informs state backend selection. A multi-cloud customer who picks cloud-native backends will end up managing three separate state systems with different locking mechanisms and IAM configurations. That operational burden is avoidable with HCP Terraform.

PHASE 02 · WORKFLOW & AUTOMATION

Workflow & Golden Path

Goal: Determine the automation strategy — how code moves from a developer's laptop to production. VCS vs API-driven is the core routing decision here.

Discovery Prompts

▸"Where do your developers spend most of their time — interacting with a UI, a CLI, or within a VCS like GitHub or GitLab?"

▸"Do you have an existing CI/CD standard — GitHub Actions, Jenkins, GitLab CI? Do you want Terraform to manage execution natively on a PR event, or stay within your existing pipeline?"

Mode	Trigger	Best For	Governance
VCS-Driven Recommended Default	Git events (PR/Merge)	Most teams; standard apps	Highest — built-in audit trail
CLI-Driven	`terraform apply` from terminal; remote execution	Iterative dev; break-fix	Medium — consistent but manual
API-Driven / Custom CI/CD	External script or orchestrator	Complex pipelines; multi-stage	Customizable — high setup effort
Generic CI/CD (DIY)	Jenkins / GitHub Actions (non-HCP)	Non-HCP Terraform users	Lowest — manual state and lock mgmt

Does the customer want to use their own CI tool but retain HCP Terraform for state and policy?

✓ YES — Custom pipeline

→ API-Driven Workflow Customer owns the plumbing. They write scripts to handle API tokens, upload config, and poll for run status. More flexible — more operational ownership.

✗ NO — Native preferred

→ VCS-Driven Workflow Connect directly to source control. Speculative Plans run automatically on PR open. State is locked during the run. Lowest friction path.

Maturity Model Framing

Present this as a maturity progression: CLI → VCS-driven → API-driven. Most orgs start at CLI, move to VCS for standard workflows, and eventually adopt API-driven for multi-stage orchestration. Meet them where they are, show them where they're going — don't force the jump.

Discovery Prompt

▸"How do your developers and pipelines currently authenticate to the cloud? Are you using static IAM keys, service accounts with long-lived secrets, or have you moved to OIDC-based dynamic credentials?"

What is the current cloud authentication method for CI/CD pipelines?

Static IAM Keys / Long-lived Secrets

→ Migrate to OIDC — priority action Long-lived keys are a security liability. Even a compromised CI/CD pipeline yields permanent admin-level access. OIDC eliminates this exposure.

OIDC / Dynamic Credentials

→ Configure HCP Dynamic Credentials Already on the right path. Configure HCP Terraform Workload Identity to generate temporary OIDC tokens per run. Validate TTL and scope.

Service Accounts + Rotation Policy

→ Evaluate rotation cadence + TTL Better than static keys if rotation is enforced. Prioritize migration to OIDC. Ensure no service account keys are committed to Git.

How OIDC Works

HCP Terraform generates a temporary OIDC token per run → cloud provider trusts this token via Workload Identity Federation → grants a short-lived session (typically 1 hour). Even if the CI/CD pipeline is compromised, there are no static admin keys to steal — the token expires.

Golden Rule — Non-Negotiable

Regardless of tooling or team maturity: never allow an apply in Production without a documented and approved plan. This applies at every maturity level.

🔀 GitOps Model

PR is the gate — approval happens in VCS
Speculative Plan posts as a comment on PR open
Code Owner must approve before merge to main
Enforce branch protection rules — no direct push to main
Best for: High-velocity dev and staging

🔑 Platform Model (HCP Manual Apply)

Workspace set to "Manual Apply"
Plan runs but pauses in "Pending" state
Lead logs into HCP UI, reviews final plan, confirms apply
Even code-approved changes get a final production gate
Best for: Production environments

🏛️ Enterprise Model (ServiceNow / Jira)

Run Tasks trigger a Change Request in ServiceNow
Apply blocked until ticket moves to Approved state
High friction — reserve for Layer 1 and 2 core infra only
Best for: High-compliance workloads, CAB-required changes

🤖 Automated Approvals (Shift-Left)

Sentinel/OPA auto-approves changes that pass all policies
Hard Mandatory violations block and require senior override
Override events must be captured in the audit stream
Goal: Reduce toil without removing governance

Environment	Approval Method	Who Approves?
Sandbox / Dev	Auto-Apply (no manual gate)	System — policy check only
Staging / QA	VCS Pull Request Approval	Peer Developer
Production	HCP Manual Review + Policy Check	Cloud Lead / SRE
Core Network / Identity	External CR (ServiceNow) + Two-Key Policy	Change Advisory Board (CAB)

Discovery Prompt

▸"Where do you currently store sensitive credentials like API keys and database passwords — Vault, AWS Secrets Manager, environment variables? How are they injected into Terraform runs?"

Goal

Move customers away from static secret management toward identity-based dynamic injection. Even if they can't reach the gold standard today, set the direction and build a roadmap toward it.

Input Variable
sensitive = true

Easy to use; redacted from CLI output and logs

Stored in plain text in state file — anyone with state access can read it

OK for non-critical values

Secrets Manager
data source

Centralized; supports rotation

Secret value persists in state file after first fetch

Legacy by 2026

HCP Vault / Dynamic Secrets

Just-in-time credentials; automatic TTL expiry

Requires Vault infrastructure and management overhead

Best for DB/App credentials

Ephemeral Resources
TF 1.10+

Never stored in state — only used during apply run

Requires TF 1.10+ — verify customer's version

⭐ 2026 Gold Standard

Ephemeral Resources — Example

# The secret is fetched at apply time only.
# It is never written to terraform.tfstate.

ephemeral "vault_kv_secret_v2" "db_password" {
  mount = "secret"
  name  = "database"
}

resource "aws_db_instance" "main" {
  password = ephemeral.vault_kv_secret_v2.db_password.data["password"]
  # ↑ Used during apply, gone after. Never in state.
}

State File — Defense in Depth

No matter how secrets are injected, the state file is always a potential vulnerability. Apply defense in depth:

State encryption at rest — HCP TF encrypted by default; cloud backends use KMS/CMK

RBAC on state access — only the "Apply" service account has read access to the raw state file

Developers = Read-Only to plans, not state — they should never need to see raw state

HCP Variable Sets marked Sensitive — workspace admins cannot see values once entered

PHASE 03 · CODE ARCHITECTURE & MODULARITY · PLATFORM & BACKEND STRATEGY

Code & State Architecture

Goal: Define blast radius and module strategy. State file decisions are the most critical in the engagement — they are difficult to change later and can become single points of failure.

Most Critical Decision in This Engagement

Wrong state strategy causes state drift, corrupted environments, and security vulnerabilities — state files can contain sensitive data in plain text. Get this right before any other architectural decision.

The Three Truths — Frame This for Customers

Desired State: Configuration files in Version Control

Known State: What Terraform remembers in the .tfstate file

Actual State: The reality in the cloud provider — drift occurs when this deviates from Known or Desired

Backend	When to Use	Key Trade-offs
Local State	Solo dev experimentation only. Never for org use.	Zero config, zero collaboration, zero locking — data on a laptop is a liability
Cloud-Native Backends S3, Azure Blob, GCS	Small-to-medium team with strong single-cloud presence	Cost-effective (pennies/month), state locking — but DIY security, versioning, and IAM
HCP Terraform / TFE Recommended	Enterprise governance, private module registries, low-ops approach	Native state mgmt, built-in RBAC, Sentinel/OPA — cost is per-RUM; potential lock-in
TaCOS Spacelift, Scalr, env0	Multi-IaC environments (TF + Pulumi + CloudFormation + OpenTofu)	Unified single pane of glass, drift detection, TTL environments — third-party dependency

Cloud-Native Backend Selection by Provider

AWS — S3 + DynamoDB

State stored in S3 bucket
Locking via DynamoDB table
Enable bucket versioning (recovery from corruption)
Enable KMS encryption at rest
Restrict bucket policy — no public access

Azure — Blob Storage

State stored in Azure Blob container
Native state locking — no extra service needed
Enable blob versioning for recovery
Enable Customer Managed Key (CMK) encryption
Use private endpoints to restrict access

GCP — Cloud Storage

State stored in GCS bucket
Native state locking built in
Enable object versioning for recovery
Enable CMEK encryption at rest
Restrict with uniform bucket-level access

Hardening — Required for All Cloud Backends

Enable bucket/blob versioning on all state backends — this is your recovery mechanism if state gets corrupted. Enable KMS/CMK encryption to protect sensitive data at rest. These are non-negotiable regardless of which cloud provider is used.

State Isolation — Blast Radius

Discovery Prompt

▸"How large of a 'failure domain' can you tolerate? If a state file were corrupted, how much infrastructure would be affected? Are you comfortable with that?"

Certain groupings of resources should have their own isolated state file. The smaller the state file scope, the smaller the blast radius. Always separate state backends (different S3 prefixes or HCP TF Workspaces) for each environment to prevent cross-environment resource deployment.

Monorepo

Pros:

Full org visibility — search across all infra
Shared CI/CD pipelines and linting rules
Simplified VCS-level RBAC

Cons:

One bad CI trigger can break company-wide
VCS performance degrades at scale
High merge conflict rate across teams

Best for: global infra layers. Does not allow easy module versioning.

Multi-repo

Pros:

Hard blast radius isolation between services
Clear team ownership per repo
Granular per-service RBAC

Cons:

Configuration drift harder to enforce across repos
Dependency tracking across 50+ repos is painful
Heavy CI/CD pipeline and secrets management overhead

Best for: app infra isolation. Enables easy module versioning.

Hybrid InnerSource ✓ Recommended

Golden Modules → Multi-repo:

Each reusable module in its own repo
Version independently with SemVer tags
Publish to Private Module Registry

Live Environments → Monorepo:

Small/mid-size: single "infra-live" monorepo
Enterprise: split by Business Unit or foundation layer

Three Litmus Tests

Test 1 — Team Structure

"Does your network team and your application team have the same manager?" If they are separate organizations, they should have separate repositories. Forcing different teams into one repo creates a bottleneck on the peer review process.

Test 2 — Deployment Velocity

"How often do you deploy — once a week or 50 times a day?" High-velocity teams need multi-repo. A global lock on a monorepo can prevent a critical hotfix while someone else runs a long-lived terraform apply on a different section.

Test 3 — Blast Radius Tolerance

"If a developer makes a mistake in a GitHub Action and runs a 'destroy' at the root of the repo, what is lost?" If the answer is "the whole company" — move to multi-repo or multi-project structure immediately to enforce hard boundaries.

Discovery Prompt

▸"How do you currently separate Development, Staging, and Production — different cloud accounts, different VPCs, or just naming conventions in the same account?"

Workspaces

Same codebase, switched context. Good for identical, short-lived environments.

⚠️ Risks:

Shared backend credentials across environments → potential cross-env apply
Workspaces are invisible on disk — can't browse environments in the repo
Best used for ephemeral or short-lived environments only

Directory Structure ✓ Preferred for Production

Each environment gets its own folder and its own root module with its own backend.

✓ Advantages:

Credential isolation: different IAM roles per directory
Version pinning: test new module versions in dev while prod stays pinned
Clear visibility: browsing the repo shows exactly what environments exist

Solving the Backend Copy/Paste Problem

The common complaint about directory isolation is copying backend.tf into every environment folder. Three solutions:

Terragrunt

Popular wrapper that handles backend config dynamically — eliminates the copy/paste entirely. Standard recommendation for mature teams.

Symlinks

Link a shared providers.tf into every environment folder. Simple but brittle — symlinks can break in some CI/CD environments.

Partial Configuration

terraform init -backend-config=path/to/backend.dev.hcl injects environment-specific backend settings at runtime. No file duplication — slightly more complex init process.

Example Directory Structure
.
├── modules/ # Reusable code (VPC, RDS, EKS)
│   ├── vpc/
│   └── database/
└── environments/
    ├── dev/
    │   ├── main.tf    # Calls modules with 'dev' variables
    │   └── backend.tf # Points to /states/dev/terraform.tfstate
    ├── staging/
    │   ├── main.tf    # Calls modules with 'staging' variables
    │   └── backend.tf # Points to /states/staging/terraform.tfstate
    └── prod/
        ├── main.tf    # Calls modules with 'prod' variables
        └── backend.tf # Points to /states/prod/terraform.tfstate

Layer 1Identity & Global GovernanceChanged infrequently. Foundational to everything else. Owned exclusively by Security Team.
Isolate: IAM Roles/Policies, SCPs, Guardrails, SSO Config. Reason: App devs must never touch security boundaries.
Layer 2Core NetworkingHigh-stakes, stable. Network changes are highest-risk. Owned by Cloud NetEng.
Isolate: VPCs, Subnets, Transit GW, DirectConnect, DNS Hubs. Reason: Day 2 app changes must not affect the network fabric.
Layer 3Platform & Shared ServicesBridges network and application code. Different lifecycle than the apps running on it. Owned by Platform/DevOps.
Isolate: Kubernetes clusters, shared RDS, Logging/Monitoring, Secrets Manager. Reason: K8s version upgrade should not redeploy 50 apps.
Layer 4Application ServicesChanges most frequently. Each app or microservice should have its own state file. Owned by App Dev Teams.
Isolate: App-specific S3, Lambda, Microservices, App DBs. Reason: Teams need to move at their own velocity without locking others out.

Layer	Change Freq	Blast Radius	Ownership
Identity	Very Low	Global / Total	Security Team
Networking	Low	Regional / Critical	Cloud NetEng
Platform	Medium	Cluster-wide	Platform / DevOps
Application	High	Service-specific	App Dev Teams

Cross-Layer Data Access — Critical Pattern

Layer 4 (Application) needs data from Layer 2 (Network) — like vpc_id or subnet_ids. Do not use terraform_remote_state data sources — they require access to the entire state file, which violates blast radius isolation. Instead: use HCP Terraform Outputs or a targeted data source to read only the specific values needed across layers.

Discovery Prompts

▸"Do you have a set of 'Golden Images' or standard infrastructure patterns that every team must follow? How do you plan to distribute those — shared folders, or a formal private module repository?"

▸"Will teams consume modules provided by a central team, or will each team build their own?"

Centralized (Gatekeeper)

Platform team owns, writes, and maintains all modules. High consistency, strict compliance. Becomes a bottleneck at scale — feature requests pile up.

Best for: Regulated industries; orgs starting their TF journey.

Federated (Community)

App teams own modules for their stack. Domain experts write their own code; faster iteration. Risk of fragmentation — different teams solving the same problem in incompatible ways.

Best for: Large, mature orgs with high DevOps autonomy.

Hybrid InnerSource ✓ Recommended

Platform Team owns base modules. Any team can contribute improvements via PR. "You build it, you help maintain it."

Best for: Most enterprise clients targeting scale without sacrificing governance.

Ownership Matrix

Task	Platform Team	App/Stream Team
Foundational Modules (VPC, IAM, DNS)	Owner	Consumer
Service Modules (RDS, S3, EKS)	Co-Owner (Standards)	Co-Owner (Features)
Application Logic Infrastructure	Consultant	Owner
Security Scanning (Checkov/Snyk)	Sets Policy	Fixes Findings

Key Ownership Pillars

Private Registry as Source of Truth: If it's not in the HCP Terraform Private Module Registry, it's not a supported module. Clear boundary between experimental and production-ready code.

Code Owners in VCS: GitHub CODEOWNERS or GitLab Protected Branches — any change to a module requires approval from the designated owner group.

Semantic Versioning (SemVer): Owners must commit to vX.Y.Z tagging. Deprecation requires a migration path or sunset period — not a silent removal.

Testing as Ownership Prerequisite: No PR merged unless terraform test integration tests pass. Ownership means knowing when your module is broken before your consumers do.

Discovery Prompt

▸"Who is responsible for fixing module code when a provider update breaks something? Who owns the SLA for getting consumers unblocked when that happens?"

Why This Is Its Own Decision

Module ownership (who writes it) and registry ownership (who is accountable for it being available and functional) are not always the same team. Blurring this line creates gaps — modules get published but nobody owns the support burden when a provider version upgrade breaks downstream consumers.

Who is responsible for fixing code when a provider update breaks a published module?

Central Platform Team

→ Gated registry with strict SLA Platform Team owns publishing rights and patch SLA. Version bumps require their review and sign-off. Consumers can always depend on the registry being reliable — but the team becomes a bottleneck.

Domain App Teams

→ Federated registry with tagged ownership Each module has a tagged owning team in the registry. That team owns the patch SLA for their modules. Platform Team still controls the registry infrastructure and publishing standards.

Registry Best Practices

Use HCP Terraform Private Module Registry as the single distribution point — eliminates "which version is in the shared folder?" questions entirely.

Pin provider version constraints in every module — required_providers blocks must specify minimum and maximum acceptable versions to prevent surprise breakage on provider updates.

Deprecation policy: Retiring a module requires a migration path document and a minimum sunset window communicated to all consumers before removal from the registry.

No module published without passing tests — terraform test must be a gate in the module's CI pipeline before the registry publish step runs.

PHASE 04 · SECURITY & POLICY ENFORCEMENT

Policy & Day 2 Operations

Goal: Establish guardrails and long-term maintenance. This is where compliance becomes proactive — automated and audit-ready — rather than reactive and manual.

Discovery Prompts

▸"If a developer tries to deploy an expensive instance type or an unencrypted database, how is that caught today — manual review, or automated enforcement?"

▸"Are there specific rules that must always be enforced? For example: 'No public S3 buckets', 'All instances must have a Cost Center tag', 'No unencrypted databases'?"

▸"Is there a requirement to see the estimated cost of a change before it is applied to the environment?"

How complex and cross-cutting are the policy requirements?

High Compliance

→ Sentinel / OPA required Cross-resource logic, security guardrails, auditing, dynamic enforcement levels, centralized control. Produces JSON audit records. Enables self-mandatory override capture.

Standard Compliance

→ Native validations as starting point Variable constraints, syntax checking, and module-specific validation handle most standard compliance needs. Migrate to Sentinel/OPA as requirements grow.

Policy as Compliance Evidence

Key Selling Point for Regulated Customers

When Sentinel or OPA evaluates a run, it produces a JSON record of evaluation — every rule that was checked, and whether it passed or failed. This is auditor-ready evidence that compliance rules are being enforced by the system, not by humans manually reviewing configurations after the fact. It converts audits from manual retrospective reviews into automated, continuous proof.

If a Hard Mandatory policy is violated and a senior architect performs a manual override, that override event — including justification — must be captured in the audit stream. Overrides are not silent. This is what gives auditors confidence that exceptions are controlled and documented.

Shift-Left Approach

Use Sentinel or OPA to automatically block non-compliant infrastructure before it is provisioned. The execution mode is the last line of defense. Policy should catch violations during the Plan phase — never rely on post-apply remediation as the primary control.

RBAC aligns the security model with organizational structure to prevent blast-radius accidents. Use a funnel approach: broad at the Organization level, granular at the Workspace level.

Level	Who	Permissions	Notes
Organization	CCoE / Lead Architects	Manage Policies, Registry, Teams	Keep Owners team to 2–3 max. Most admins should be limited to Manage Workspaces — not full org control.
Project	Lead Engineers per domain	Admin or Write at Project Level	Teams create their own workspaces without needing central admin approval every time.
Workspace	Individual Contributors	Read / Plan / Write	Most developers should be Plan-only in production workspaces.

Who Should Apply in Production?

Service Account First Rule

Humans should almost never have direct apply permissions in production. Grant apply permissions to a Service Principal only (OIDC/Dynamic Credentials). Workflow: Dev triggers PR → CI/CD runs Plan → Human reviews/approves → System executes Apply.

Two-Key Policy — Core Infrastructure

For Layer 1 (Identity & Governance) and Layer 2 (Core Networking) resources: implement a required two-approval gate before any apply can proceed. This ensures a single rogue user or compromised account cannot destroy foundational infrastructure. Implement via HCP Terraform Team approvals requiring two distinct reviewers, or VCS branch protection requiring two separate approvals.

Team	Layer / Workspace	Permission	Rationale
Networking Team	Core Network / VPC	Admin	They own the fabric and manage its lifecycle end to end
App Dev Team	Application Services	Write (non-prod) / Plan (prod)	High velocity in dev; protected gates in production
Security Team	All Workspaces	Read-Only	Auditing and compliance without operational risk
App Dev Team	Core Network	Read-Only	They consume `vpc_id` and `subnet_ids` — they must not change them

RBAC Best Practices

Eliminate static secrets: HCP Terraform Dynamic Credentials — workspace authenticates via OIDC token per run. No static secrets to steal.

Map IdP Groups to Teams: Always map Okta/AD Groups to HCP Terraform Teams. User access is tied to the org's auth service, not directly to HCP users — offboarding is automatic.

Custom Permissions (Business Tier): Separate "Queue Run" (ability to trigger a plan) from "Approve Run" (ability to apply). This is the gold standard — not all Run writers should be Run approvers.

Discovery Prompts

▸"How do you currently detect 'Shadow IT' — manual changes made directly in the Cloud Console that aren't reflected in code?"

▸"Is drift detection a compliance requirement for you, or a nice-to-have?"

How to Frame This for Customers

Drift detection is not a monthly audit — it is a rhythmic heartbeat. Treat a "Drift Warning" with the same urgency as a "Build Failure." If your team wouldn't ignore a failed CI/CD pipeline, they shouldn't ignore drift either.

Workload Type	Recommended Frequency	Implementation Method
Critical Production	Continuous / Hourly	HCP Terraform Health Assessments (Standard/Plus Editions)
Standard Production	Daily (every 24h)	Scheduled CI/CD pipeline running `terraform plan`
Development / Sandbox	Weekly / On-Demand	Manual trigger or weekly cron job

Version Note

As of early 2026, HCP Terraform Health Assessments are part of Standard and Plus Editions only. Customers on older free tiers must implement drift detection via scheduled CI/CD pipelines.

Responding to Drift — Three Paths

🔄 Revert (Reconciliation)

Run terraform apply to overwrite the manual change
Use when: unauthorized changes, accidental deletions, compliance violations (opened security groups)
This is the standard response for drift that violates policy

✏️ Adopt (Alignment)

Update Terraform code to match the new cloud reality
Use when: an emergency manual fix was actually correct and must be codified permanently
Use the import block (TF 1.5+) to bring unmanaged resources under control without CLI surgery

Prevention — Best Practices

Read-Only Cloud Console — Transition all human users to Read-Only roles. Changes must go through a Pull Request. Reserve "Break-Glass" admin access for true emergencies only.

Drift Alerts to Slack/Teams/PagerDuty — Integrate detection with the same channels used for build failures. Drift that nobody sees is drift that festers.

HCP Terraform Health Assessments — Centralized dashboard showing exactly which resources drifted, with a remediate button to generate the fix run.

Use ignore_changes for Expected Drift — Auto-Scaling Group desired_capacity changes by design. Use lifecycle { ignore_changes = [...] } for these attributes to suppress false-positive alerts.

Discovery Prompts

▸"Do you need a full history of 'Who changed what and when' for every resource? Is this for internal ops visibility, or are external auditors requiring it?"

▸"How long is your compliance retention requirement for infrastructure change logs?"

Architect Rule

A terraform apply without a linked Git commit or Change Request ID should be treated as a Security Incident — not a routine deployment. Build this expectation into the customer's culture from day one.

🔗 Chain of Custody

What & When: Every HCP Run is automatically linked to a VCS commit — provides immutable record of what changed and when
Why: Integrate Jira or ServiceNow via Run Tasks to attach a Change Request ID to every Terraform plan — closes the "why was this done" gap

📤 Immutable Audit Logs

Local TF platform logs are a compliance risk — they can be rotated or deleted
Enable HCP Audit Log Streaming → external SIEM immediately
Track: sensitive variable access, policy overrides, service account applies, admin permission changes
SIEM retention must match industry compliance requirements

Audit Level	Component	Best Practice
System	HCP Terraform	Enable Audit Log Streaming to SIEM — do not rely on local logs
Process	VCS / PRs	Enforce signed commits and mandatory PR approvals before merge
Data	State File	CMK Encryption + Versioning + Access Logging on the state backend
Compliance	Sentinel / OPA	Log all Policy Pass/Fail events; capture all override events with justification

PHASE 05

Next Steps

Goal: Confirm success criteria, validate the implementation readiness checklist, and create a shared definition of "done" with the customer.

These are the non-negotiables for any mature Terraform implementation. Use this as a readiness check at the close of discovery. Check items off live during the conversation.

Standardize on Declarative: Move away from procedural scripts. Use Terraform to define the end state. Scripts are Day 2 glue — not Day 1 provisioning.

Version Everything in Git: If it's not in Git, it doesn't exist. All infrastructure configuration must live in source control.

Implement Remote State: Never store .tfstate files on a laptop. Use HCP Terraform, S3/GCS/Azure Blob, or TFE with state locking enabled.

Separate Environments: Use separate workspace directories or HCP TF workspaces for Dev, Staging, and Prod. Never share a state backend across environments.

Automate the Lifecycle: Shift from local CLI executions to remote executions via CI/CD or HCP Terraform. Local apply in production is not acceptable.

Plan on PR, Apply on Merge: No apply in Production without a documented and approved plan. This is the minimum governance bar.

Shift Left with Policy as Code: Sentinel or OPA blocks non-compliant changes before infrastructure is provisioned. Don't rely on post-apply remediation.

OIDC for All Credentials: No long-lived static IAM keys or service account secrets in CI/CD runners. Use OIDC Workload Identity for all cloud authentication.

State Encryption + Versioning: KMS or CMK encryption at rest. Bucket/blob versioning enabled for recovery from corruption.

Drift Alerts Integrated: Drift detection connected to Slack, Teams, or PagerDuty. Drift treated with same urgency as a build failure.

Audit Log Streaming Enabled: HCP Audit Logs streaming to external SIEM. Local logs alone do not meet compliance retention requirements.

Private Module Registry in Use: A central, versioned registry is the single source of truth for reusable modules. No informal shared folders.

Closing Discovery Question

"If we look back in six months, what is the one thing that must be true for you to consider this Terraform implementation a success?"

Use their answer to anchor the implementation plan. Common success signals and how they map to technical focus areas:

Customer Says	Primary Focus Area
"Developers can self-serve environments without waiting on the platform team"	Module registry, workspace templates, RBAC self-service at Project level
"No more manual changes in the console — everything is in code"	Read-only console policy, drift detection, break-glass process documentation
"We can prove compliance to external auditors without manual evidence collection"	Sentinel/OPA + audit log streaming to SIEM + state file encryption + policy-as-evidence
"Production deployments feel safe and are never a surprise"	Approval strategy, branch protection, policy-as-code gates, speculative plans on PR
"We can onboard new teams to Terraform quickly without reinventing the wheel"	Golden module library, private registry, InnerSource ownership model, documentation
"We've eliminated long-lived secrets from our CI/CD pipelines"	OIDC/Dynamic Credentials, ephemeral resources (TF 1.10+), variable set governance

Terraform Discovery &Implementation Guide

Air-Gap — Formal Definition

Ephemeral Resources — Example

State File — Defense in Depth

Cloud-Native Backend Selection by Provider

State Isolation — Blast Radius

Three Litmus Tests

Solving the Backend Copy/Paste Problem

Ownership Matrix

Key Ownership Pillars

Registry Best Practices

Policy as Compliance Evidence

Who Should Apply in Production?

Responding to Drift — Three Paths

Prevention — Best Practices

Terraform Discovery &
Implementation Guide