Terraform Discovery &
Implementation Guide
A structured decision guide for running Terraform discovery engagements. Navigate By Phase when running a full discovery from scratch. Switch to By Domain when referencing a specific technical area mid-engagement.
Air-Gap — Formal Definition
An air-gapped environment is one where the entire installation and its environment are physically and logically isolated from external networks, including the public internet.
- No internet access — inbound or outbound
- Manual installation and updates (no auto-update)
- On-premises Version Control System (GitLab, Gitea)
- Private Module Registry (self-hosted)
- Private Container Registry
- Internal license server
- All dependencies must be pre-staged internally
- Provider binaries must be mirrored to internal registry
- Terraform binary updates require manual pull and internal distribution
- No external SaaS integrations (Slack, PagerDuty, etc.)
Ask separately: "Does your legal team mandate that state files or sensitive variables must never leave a specific cloud region or VPC?" Data sovereignty is distinct from air-gap and can push toward TFE even in cloud-connected environments.
HCP Terraform = always on latest.
TFE = customer chooses upgrade cadence.
If they need to pin platform versions for change management or stability, TFE wins this axis regardless of staffing.
The legacy "free plan" ended March 31, 2026. Any customer still on that plan needs a migration conversation immediately — this is not optional.
| Method | Pros | Cons | Action |
|---|---|---|---|
| Click-Ops Manual console |
No barrier to entry; good for PoCs and discovery | Non-repeatable, drift-prone, high human error, no audit trail | Educate & Migrate |
| Imperative Scripts Bash / Python / CLI |
Familiar, flexible, no extra tools required | State-blind, spaghetti maintenance, no dependency management | Migrate — Day 2 glue only |
| Terraform / OpenTofu Declarative IaC |
State management, plan-before-apply, 3000+ providers | HCL learning curve, state complexity, abstracted errors | Mature & Expand |
| Cloud-Native IaC CloudFormation / Bicep |
Day-zero provider support, no state management burden | Cloud lock-in, cannot manage SaaS tools or cross-cloud | Supplement Only |
Skill level directly determines the implementation path, the starting point for automation maturity, and whether education and training need to be scoped as a pre-requisite. Pitching API-driven workflows to a team that's never written HCL is a failure mode.
This is the same product. The decision is operational convenience vs. total environmental control. Refer to Deal Breakers A–D before running this comparison — they may decide it before you get here.
- Management: Fully managed by HashiCorp
- Updates: Automatic — always latest
- Infrastructure: Runs on HashiCorp Cloud Platform
- Pricing: Resources Under Management (RUM)
- Best for: Speed to value, low ops overhead
- Management: Customer-managed (VM or Kubernetes)
- Updates: Manual — customer controls upgrade cycle
- Infrastructure: Customer VPC, data center, or air-gap
- Pricing: License-based (per workspace/user)
- Best for: Strict compliance, air-gapped, high-security
"Global, Consistent, Standardized" → lean toward Registry-driven workflows.
"Agile, Autonomous, Fast" → lean toward VCS-driven with OPA guardrails.
CLI-only insistence → dig deeper. They often don't trust automation yet. This is an opportunity to demo Speculative Plans and build confidence in the pipeline.
Cloud footprint directly informs state backend selection. A multi-cloud customer who picks cloud-native backends will end up managing three separate state systems with different locking mechanisms and IAM configurations. That operational burden is avoidable with HCP Terraform.
| Mode | Trigger | Best For | Governance |
|---|---|---|---|
| VCS-Driven Recommended Default | Git events (PR/Merge) | Most teams; standard apps | Highest — built-in audit trail |
| CLI-Driven | terraform apply from terminal; remote execution |
Iterative dev; break-fix | Medium — consistent but manual |
| API-Driven / Custom CI/CD | External script or orchestrator | Complex pipelines; multi-stage | Customizable — high setup effort |
| Generic CI/CD (DIY) | Jenkins / GitHub Actions (non-HCP) | Non-HCP Terraform users | Lowest — manual state and lock mgmt |
Present this as a maturity progression: CLI → VCS-driven → API-driven. Most orgs start at CLI, move to VCS for standard workflows, and eventually adopt API-driven for multi-stage orchestration. Meet them where they are, show them where they're going — don't force the jump.
HCP Terraform generates a temporary OIDC token per run → cloud provider trusts this token via Workload Identity Federation → grants a short-lived session (typically 1 hour). Even if the CI/CD pipeline is compromised, there are no static admin keys to steal — the token expires.
Regardless of tooling or team maturity: never allow an apply in Production without a documented and approved plan. This applies at every maturity level.
- PR is the gate — approval happens in VCS
- Speculative Plan posts as a comment on PR open
- Code Owner must approve before merge to main
- Enforce branch protection rules — no direct push to main
- Best for: High-velocity dev and staging
- Workspace set to "Manual Apply"
- Plan runs but pauses in "Pending" state
- Lead logs into HCP UI, reviews final plan, confirms apply
- Even code-approved changes get a final production gate
- Best for: Production environments
- Run Tasks trigger a Change Request in ServiceNow
- Apply blocked until ticket moves to Approved state
- High friction — reserve for Layer 1 and 2 core infra only
- Best for: High-compliance workloads, CAB-required changes
- Sentinel/OPA auto-approves changes that pass all policies
- Hard Mandatory violations block and require senior override
- Override events must be captured in the audit stream
- Goal: Reduce toil without removing governance
| Environment | Approval Method | Who Approves? |
|---|---|---|
| Sandbox / Dev | Auto-Apply (no manual gate) | System — policy check only |
| Staging / QA | VCS Pull Request Approval | Peer Developer |
| Production | HCP Manual Review + Policy Check | Cloud Lead / SRE |
| Core Network / Identity | External CR (ServiceNow) + Two-Key Policy | Change Advisory Board (CAB) |
Move customers away from static secret management toward identity-based dynamic injection. Even if they can't reach the gold standard today, set the direction and build a roadmap toward it.
sensitive = truedata sourceTF 1.10+Ephemeral Resources — Example
State File — Defense in Depth
No matter how secrets are injected, the state file is always a potential vulnerability. Apply defense in depth:
Wrong state strategy causes state drift, corrupted environments, and security vulnerabilities — state files can contain sensitive data in plain text. Get this right before any other architectural decision.
Desired State: Configuration files in Version Control
Known State: What Terraform remembers in the .tfstate file
Actual State: The reality in the cloud provider — drift occurs when this deviates from Known or Desired
| Backend | When to Use | Key Trade-offs |
|---|---|---|
| Local State | Solo dev experimentation only. Never for org use. | Zero config, zero collaboration, zero locking — data on a laptop is a liability |
| Cloud-Native Backends S3, Azure Blob, GCS |
Small-to-medium team with strong single-cloud presence | Cost-effective (pennies/month), state locking — but DIY security, versioning, and IAM |
| HCP Terraform / TFE Recommended | Enterprise governance, private module registries, low-ops approach | Native state mgmt, built-in RBAC, Sentinel/OPA — cost is per-RUM; potential lock-in |
| TaCOS Spacelift, Scalr, env0 |
Multi-IaC environments (TF + Pulumi + CloudFormation + OpenTofu) | Unified single pane of glass, drift detection, TTL environments — third-party dependency |
Cloud-Native Backend Selection by Provider
- State stored in S3 bucket
- Locking via DynamoDB table
- Enable bucket versioning (recovery from corruption)
- Enable KMS encryption at rest
- Restrict bucket policy — no public access
- State stored in Azure Blob container
- Native state locking — no extra service needed
- Enable blob versioning for recovery
- Enable Customer Managed Key (CMK) encryption
- Use private endpoints to restrict access
- State stored in GCS bucket
- Native state locking built in
- Enable object versioning for recovery
- Enable CMEK encryption at rest
- Restrict with uniform bucket-level access
Enable bucket/blob versioning on all state backends — this is your recovery mechanism if state gets corrupted. Enable KMS/CMK encryption to protect sensitive data at rest. These are non-negotiable regardless of which cloud provider is used.
State Isolation — Blast Radius
Certain groupings of resources should have their own isolated state file. The smaller the state file scope, the smaller the blast radius. Always separate state backends (different S3 prefixes or HCP TF Workspaces) for each environment to prevent cross-environment resource deployment.
- Full org visibility — search across all infra
- Shared CI/CD pipelines and linting rules
- Simplified VCS-level RBAC
- One bad CI trigger can break company-wide
- VCS performance degrades at scale
- High merge conflict rate across teams
Best for: global infra layers. Does not allow easy module versioning.
- Hard blast radius isolation between services
- Clear team ownership per repo
- Granular per-service RBAC
- Configuration drift harder to enforce across repos
- Dependency tracking across 50+ repos is painful
- Heavy CI/CD pipeline and secrets management overhead
Best for: app infra isolation. Enables easy module versioning.
- Each reusable module in its own repo
- Version independently with SemVer tags
- Publish to Private Module Registry
- Small/mid-size: single "infra-live" monorepo
- Enterprise: split by Business Unit or foundation layer
Three Litmus Tests
"Does your network team and your application team have the same manager?" If they are separate organizations, they should have separate repositories. Forcing different teams into one repo creates a bottleneck on the peer review process.
"How often do you deploy — once a week or 50 times a day?" High-velocity teams need multi-repo. A global lock on a monorepo can prevent a critical hotfix while someone else runs a long-lived terraform apply on a different section.
"If a developer makes a mistake in a GitHub Action and runs a 'destroy' at the root of the repo, what is lost?" If the answer is "the whole company" — move to multi-repo or multi-project structure immediately to enforce hard boundaries.
Same codebase, switched context. Good for identical, short-lived environments.
⚠️ Risks:
- Shared backend credentials across environments → potential cross-env apply
- Workspaces are invisible on disk — can't browse environments in the repo
- Best used for ephemeral or short-lived environments only
Each environment gets its own folder and its own root module with its own backend.
✓ Advantages:
- Credential isolation: different IAM roles per directory
- Version pinning: test new module versions in dev while prod stays pinned
- Clear visibility: browsing the repo shows exactly what environments exist
Solving the Backend Copy/Paste Problem
The common complaint about directory isolation is copying backend.tf into every environment folder. Three solutions:
- Popular wrapper that handles backend config dynamically — eliminates the copy/paste entirely. Standard recommendation for mature teams.
- Link a shared
providers.tfinto every environment folder. Simple but brittle — symlinks can break in some CI/CD environments.
terraform init -backend-config=path/to/backend.dev.hclinjects environment-specific backend settings at runtime. No file duplication — slightly more complex init process.
| Layer | Change Freq | Blast Radius | Ownership |
|---|---|---|---|
| Identity | Very Low | Global / Total | Security Team |
| Networking | Low | Regional / Critical | Cloud NetEng |
| Platform | Medium | Cluster-wide | Platform / DevOps |
| Application | High | Service-specific | App Dev Teams |
Layer 4 (Application) needs data from Layer 2 (Network) — like vpc_id or subnet_ids. Do not use terraform_remote_state data sources — they require access to the entire state file, which violates blast radius isolation. Instead: use HCP Terraform Outputs or a targeted data source to read only the specific values needed across layers.
Best for: Regulated industries; orgs starting their TF journey.
Best for: Large, mature orgs with high DevOps autonomy.
Best for: Most enterprise clients targeting scale without sacrificing governance.
Ownership Matrix
| Task | Platform Team | App/Stream Team |
|---|---|---|
| Foundational Modules (VPC, IAM, DNS) | Owner | Consumer |
| Service Modules (RDS, S3, EKS) | Co-Owner (Standards) | Co-Owner (Features) |
| Application Logic Infrastructure | Consultant | Owner |
| Security Scanning (Checkov/Snyk) | Sets Policy | Fixes Findings |
Key Ownership Pillars
vX.Y.Z tagging. Deprecation requires a migration path or sunset period — not a silent removal.terraform test integration tests pass. Ownership means knowing when your module is broken before your consumers do.Module ownership (who writes it) and registry ownership (who is accountable for it being available and functional) are not always the same team. Blurring this line creates gaps — modules get published but nobody owns the support burden when a provider version upgrade breaks downstream consumers.
Registry Best Practices
required_providers blocks must specify minimum and maximum acceptable versions to prevent surprise breakage on provider updates.terraform test must be a gate in the module's CI pipeline before the registry publish step runs.Policy as Compliance Evidence
When Sentinel or OPA evaluates a run, it produces a JSON record of evaluation — every rule that was checked, and whether it passed or failed. This is auditor-ready evidence that compliance rules are being enforced by the system, not by humans manually reviewing configurations after the fact. It converts audits from manual retrospective reviews into automated, continuous proof.
If a Hard Mandatory policy is violated and a senior architect performs a manual override, that override event — including justification — must be captured in the audit stream. Overrides are not silent. This is what gives auditors confidence that exceptions are controlled and documented.
Use Sentinel or OPA to automatically block non-compliant infrastructure before it is provisioned. The execution mode is the last line of defense. Policy should catch violations during the Plan phase — never rely on post-apply remediation as the primary control.
RBAC aligns the security model with organizational structure to prevent blast-radius accidents. Use a funnel approach: broad at the Organization level, granular at the Workspace level.
| Level | Who | Permissions | Notes |
|---|---|---|---|
| Organization | CCoE / Lead Architects | Manage Policies, Registry, Teams | Keep Owners team to 2–3 max. Most admins should be limited to Manage Workspaces — not full org control. |
| Project | Lead Engineers per domain | Admin or Write at Project Level | Teams create their own workspaces without needing central admin approval every time. |
| Workspace | Individual Contributors | Read / Plan / Write | Most developers should be Plan-only in production workspaces. |
Who Should Apply in Production?
Humans should almost never have direct apply permissions in production. Grant apply permissions to a Service Principal only (OIDC/Dynamic Credentials). Workflow: Dev triggers PR → CI/CD runs Plan → Human reviews/approves → System executes Apply.
For Layer 1 (Identity & Governance) and Layer 2 (Core Networking) resources: implement a required two-approval gate before any apply can proceed. This ensures a single rogue user or compromised account cannot destroy foundational infrastructure. Implement via HCP Terraform Team approvals requiring two distinct reviewers, or VCS branch protection requiring two separate approvals.
| Team | Layer / Workspace | Permission | Rationale |
|---|---|---|---|
| Networking Team | Core Network / VPC | Admin | They own the fabric and manage its lifecycle end to end |
| App Dev Team | Application Services | Write (non-prod) / Plan (prod) | High velocity in dev; protected gates in production |
| Security Team | All Workspaces | Read-Only | Auditing and compliance without operational risk |
| App Dev Team | Core Network | Read-Only | They consume vpc_id and subnet_ids — they must not change them |
Eliminate static secrets: HCP Terraform Dynamic Credentials — workspace authenticates via OIDC token per run. No static secrets to steal.
Map IdP Groups to Teams: Always map Okta/AD Groups to HCP Terraform Teams. User access is tied to the org's auth service, not directly to HCP users — offboarding is automatic.
Custom Permissions (Business Tier): Separate "Queue Run" (ability to trigger a plan) from "Approve Run" (ability to apply). This is the gold standard — not all Run writers should be Run approvers.
Drift detection is not a monthly audit — it is a rhythmic heartbeat. Treat a "Drift Warning" with the same urgency as a "Build Failure." If your team wouldn't ignore a failed CI/CD pipeline, they shouldn't ignore drift either.
| Workload Type | Recommended Frequency | Implementation Method |
|---|---|---|
| Critical Production | Continuous / Hourly | HCP Terraform Health Assessments (Standard/Plus Editions) |
| Standard Production | Daily (every 24h) | Scheduled CI/CD pipeline running terraform plan |
| Development / Sandbox | Weekly / On-Demand | Manual trigger or weekly cron job |
As of early 2026, HCP Terraform Health Assessments are part of Standard and Plus Editions only. Customers on older free tiers must implement drift detection via scheduled CI/CD pipelines.
Responding to Drift — Three Paths
- Run
terraform applyto overwrite the manual change - Use when: unauthorized changes, accidental deletions, compliance violations (opened security groups)
- This is the standard response for drift that violates policy
- Update Terraform code to match the new cloud reality
- Use when: an emergency manual fix was actually correct and must be codified permanently
- Use the
importblock (TF 1.5+) to bring unmanaged resources under control without CLI surgery
Prevention — Best Practices
ignore_changes for Expected Drift — Auto-Scaling Group desired_capacity changes by design. Use lifecycle { ignore_changes = [...] } for these attributes to suppress false-positive alerts.A terraform apply without a linked Git commit or Change Request ID should be treated as a Security Incident — not a routine deployment. Build this expectation into the customer's culture from day one.
- What & When: Every HCP Run is automatically linked to a VCS commit — provides immutable record of what changed and when
- Why: Integrate Jira or ServiceNow via Run Tasks to attach a Change Request ID to every Terraform plan — closes the "why was this done" gap
- Local TF platform logs are a compliance risk — they can be rotated or deleted
- Enable HCP Audit Log Streaming → external SIEM immediately
- Track: sensitive variable access, policy overrides, service account applies, admin permission changes
- SIEM retention must match industry compliance requirements
| Audit Level | Component | Best Practice |
|---|---|---|
| System | HCP Terraform | Enable Audit Log Streaming to SIEM — do not rely on local logs |
| Process | VCS / PRs | Enforce signed commits and mandatory PR approvals before merge |
| Data | State File | CMK Encryption + Versioning + Access Logging on the state backend |
| Compliance | Sentinel / OPA | Log all Policy Pass/Fail events; capture all override events with justification |
These are the non-negotiables for any mature Terraform implementation. Use this as a readiness check at the close of discovery. Check items off live during the conversation.
.tfstate files on a laptop. Use HCP Terraform, S3/GCS/Azure Blob, or TFE with state locking enabled."If we look back in six months, what is the one thing that must be true for you to consider this Terraform implementation a success?"
Use their answer to anchor the implementation plan. Common success signals and how they map to technical focus areas:
| Customer Says | Primary Focus Area |
|---|---|
| "Developers can self-serve environments without waiting on the platform team" | Module registry, workspace templates, RBAC self-service at Project level |
| "No more manual changes in the console — everything is in code" | Read-only console policy, drift detection, break-glass process documentation |
| "We can prove compliance to external auditors without manual evidence collection" | Sentinel/OPA + audit log streaming to SIEM + state file encryption + policy-as-evidence |
| "Production deployments feel safe and are never a surprise" | Approval strategy, branch protection, policy-as-code gates, speculative plans on PR |
| "We can onboard new teams to Terraform quickly without reinventing the wheel" | Golden module library, private registry, InnerSource ownership model, documentation |
| "We've eliminated long-lived secrets from our CI/CD pipelines" | OIDC/Dynamic Credentials, ephemeral resources (TF 1.10+), variable set governance |