Terraform HVD Navigator

🏗️

SDG Solution Design Guide

Architecture, deployment paths, sizing, security hardening, and multi-region DR for Terraform Enterprise. Applies across all maturity stages — the platform layer never expires.

🌱

Adopt Operating Guide: Adoption

People & process, consumption models, org structure, VCS workflows, observability, and Day 2 operations. Foundation for standing up TFE as a shared service.

📋

Standardize Operating Guide: Standardization

Health assessments, private registry, Sentinel policy-as-code, and run tasks. Focus shifts to governance and making TFE available org-wide.

📈

Scale Operating Guide: Scaling (Beta)

Self-hosted agents, Kubernetes operator, continuous validation, no-code provisioning, ephemeral workspaces. For customers with a mature platform expanding capabilities.

The Maturity Model

Adopt

IaC & cloud provisioning, secure variables, VCS integration, RBAC, observability. The platform is new and onboarding early teams.

Standardize

Private registry, policy-as-code, run tasks, drift detection. The platform becomes a shared org-wide service with guardrails.

Scale

Cost management, self-service workflows, agent-based execution, no-code provisioning, ephemeral workspaces. Mature platform, operational efficiency focus.

Four Guides at a Glance

Guide	Primary Audience	Core Question	Prerequisite
SDG	Platform engineers, infra architects, DevOps admins	How do we deploy TFE correctly and keep it running?	None
Adopt	Platform team standing up TFE as a shared service	How do we configure TFE and onboard teams onto it?	SDG completed
Standardize	Platform team maturing toward org-wide governance	How do we enforce guardrails and scale to the whole org?	Adopt + maturity review
Scale (Beta)	Platform team at scale, enabling self-service workflows	How do we extend TFE capabilities once the platform is mature?	Adopt + Standardize

Source Constraint

All content in this navigator is sourced strictly from the four Terraform HVD PDFs: Solution Design Guide, and Operating Guides for Adoption, Standardization, and Scaling. Nothing is invented outside those sources.

TFE Components

Component	Role	Key Notes
TFE Application	Core application container provided by HashiCorp	Consult HVD Module code for default machine size per cloud/K8s env
HCP Terraform Agent	Optional isolated execution environments for runs	Highly recommended; TFE includes built-in agents if not deployed separately
PostgreSQL	Primary store for workspace settings, user data, application state	Review supported versions; required for all modes except disk
Redis Cache	Caching & coordination between core web/background workers	Required for active-active. Native Redis services from AWS/Azure/GCP validated. Redis Cluster not supported. Redis Sentinel not supported.
Object Storage	State files, plan files, config, output logs	All objects symmetrically encrypted (AES-128 CTR). S3-compatible, GCS, or Azure Blob

Operational Modes

💿

disk mode

Dev, testing, familiarization only — never production

▼

All TFE services — including PostgreSQL and object storage — deploy onto a single node using localized Docker disk volumes. No failover. No active/active. Fully self-contained.

Hard Limit

Do not use disk mode for any production environment. It lacks enterprise-grade resiliency and cannot scale performance without service interruption. Acceptable only for functional testing, training, and gaining familiarity.

Characteristics

Minimal resource requirements; no specialized expertise to deploy
Rapid to stand up
No failover capability — single AZ, single node
Cannot scale without downtime

🔌

external mode

External DB + object storage, internal cache, single compute node

▼

PostgreSQL and object storage move to dedicated infrastructure. Cache remains internal (transient only). Application runs on a single compute node. Each public cloud provides native services for PostgreSQL and object storage that support this pattern.

Characteristics

Stateless application front-end; distributed core components
Improved resilience — eliminates single points of vulnerability for data
Single compute node is still a single point of failure for the application layer
Does not provide performance scalability

Positioning Note

External mode is a stepping stone, not a destination. If the customer's RTO requires zero downtime from AZ failures, push toward active-active. External is suitable when resilience matters but full HA is not yet required.

⚡

active-active mode

HashiCorp's recommendation for all production deployments

▼

Multiple stateless TFE instances across at least three AZs, connected to external PostgreSQL, shared object storage, and external Redis. The SDG explicitly recommends this for production.

Why This Mode

n-2 failure profile — survives failure of two AZs
Eliminates potential service failure points
Safeguards against revenue loss from unscheduled interruptions
Ensures data integrity and data residency compliance
Improves workload distribution and overall performance

Hard Requirements

Active-active mode requires an automated deployment process — this is non-negotiable. Redis is also a hard requirement. Redis Cluster is not supported. Use native Redis services from AWS, Azure, or GCP. Redis Sentinel is not supported.

Operational Complexity Trade-offs

Automated TFE deployment process is mandatory
Monitoring must account for multiple instances
Custom automation required to manage application node lifecycle
Note: Redis does not need to be external when running a single TFE node — HVD modules provision Redis automatically when active-active parameter is true

Operational Mode Decision

Does the customer require zero downtime from expected failures (instance or AZ failure)?

✓ Yes

→ active-active mode

Non-negotiable for production services that cannot tolerate downtime. Requires automated deployment and external Redis (not Cluster or Sentinel).

Stepping stone

→ external mode

Acceptable if resilience matters but full HA is not yet required. Still production-viable; plan migration to active-active over time.

Dev / Test only

→ disk mode

Acceptable only for functional testing, training, and familiarization. Never production.

Design Attributes

The SDG evaluates decisions against four non-functional attributes. Each recommendation in the guide explicitly notes its impact:

Attribute	Description
Availability	Minimizes the impact of subsystem failures on uptime (e.g., multiple load-balanced app instances)
Operational Complexity	Occasionally introduces upfront complexity to reduce ongoing operational burden (e.g., Packer-based immutable images)
Scalability	Avoids choices that introduce overhead at scale (e.g., automated onboarding vs. manual UI processes)
Security	Notes how decisions change security posture (e.g., workload identity/OIDC vs. long-lived credentials)

Personnel Roles

🧑‍💼

Project Leader

Coordination, resource facilitation, duty assignment

▼

Coordinates events, facilitates resources, and assigns duties to the Cloud Administration Team. Responsible for project-level planning including timeline and access acquisition.

☁️

Cloud Administration Team

Carries out functional installation tasks

▼

Assumed knowledge

Cloud architecture and administration
Administration-level experience with Linux
Practical knowledge of Docker
Practical knowledge of Terraform

🔒

Security Operations Team

Strongly recommended — integrates formal security controls

▼

Focuses on integrating formal security controls required for services hosted in the chosen cloud environment. Critical for regulated industries.

🖥️

Production Services Team

Takes over once the deployment goes live

▼

Designated to own the TFE service post-deployment. Handover planning and documentation should occur before go-live.

Access Requirements

The installation team requires direct (including administrator) access to the following before starting:

Resource Type	Examples
Compute & Storage Instances	VMs, storage volumes, EBS/managed disks
Network Objects	Firewall rules, load balancers, security groups
TLS Certificate Material	Certificate + private key matching TFE hostname (SAN, not CN only). PEM encoded. Signed by public or private CA. No self-signed certificates recommended.
Identity & IAM	AWS IAM, GCP Cloud Identity, Azure Active Directory
Secrets Management	AWS Secrets Manager / KMS, GCP Secret Manager / Cloud KMS, Azure Key Vault, VMware vSphere Native Key Provider
TFE License File	Must be obtained from HashiCorp account team. Save as `terraform.hclic`. Single line, no newline character. Treat as a company asset.
DNS Record	DNS record must exist matching the SAN in the TLS certificate

Architectural Summary

TFE deploys on a VM (AWS/Azure/GCP) regardless of topology. For active-active, scale additional VMs across different AZs in the same region. Use managed object storage, PostgreSQL, and (for active-active) managed Redis — specifically not Redis Cluster. Use a layer 4 load balancer for ingress. TFE is a single-region application architecture.

Network Egress Requirements

TFE should not be exposed to the public internet for ingress. Users must be on the company network. TFE does need outbound access to:

registry.terraform.io — public module registry (official providers index here; restrict via Sentinel/OPA for community content)
releases.hashicorp.com — Terraform binary releases (stay within two minor releases of latest)
reporting.hashicorp.services — license usage aggregation (strongly recommend allow-listing)
Algolia — used by the Terraform Registry for indexing
VCS/SAML endpoints, public cloud cost estimation APIs as applicable

Air-Gapped Environments

If outbound access is not available, TFE can run fully air-gapped. Manually download and host provider and Terraform binary versions in the TFE registry as releases occur.

Resource Sizing

Component	AWS	Azure	GCP
Disk	EBS gp3 (3000 IOPS)	Premium SSD (5000 IOPS)	Balanced Persistent SSD (10000 IOPS)
Machine (default)	m7i.2xlarge (8 vCPU, 30 GB)	Standard_D8s_v4 (8 vCPU, 30 GB)	n2-standard-8 (8 vCPU, 30 GB)
Machine (scaled)	m7i.4xlarge (16 vCPU, 61 GB)	Standard_D16s_v4 (16 vCPU, 61 GB)	n2-standard-16 (16 vCPU, 61 GB)
Database	db.r6i.xlarge	GP_Standard_D4ds_v4	db-custom-4-16384
Cache (Redis)	cache.m5.large	Premium P1	STANDARD_HA

CPU Sizing Rules (All Providers)

Avoid burstable CPU instances (AWS T-type, Azure B-type, GCP e2-/f1-/g1-series)
Choose latest generation general-purpose x86-64 instances
Use CPU/RAM ratio of 1:4 or higher
Do not use memory-optimized instances

Concurrency & RAM Calculations

The TFE_CAPACITY_CONCURRENCY variable controls concurrent workspace runs. Default RAM per agent is 2048 MiB. Formula:

RAM Sizing Formula

Max agent RAM = TFE_CAPACITY_CONCURRENCY × 2 GB
Add 10% overhead for OS and TFE application (~4 GB). For 30 concurrent agents: 30 × 2 GB = 60 GB + overhead = ~66 GB total.

Default machine (8 vCPU, 30 GB): maximum concurrency of 11. Scaled machine (16 vCPU, 61 GB): maximum concurrency of 26.

HVD Module Deployment Process (High-Level)

Import TFE modules into your VCS repository
Configure remote state storage (S3/Blob/GCS or HCP Terraform free tier)
Select a machine with Terraform CLI available and cloud credentials instantiated
Read the module GitHub README in its entirety before starting
Prepare TLS certificate and private key (SAN must match FQDN; no self-signed certificates)
Run terraform init, plan, apply
Tail installation logs post-deployment; watch for errors
Retrieve the Initial Admin Creation Token (IACT) within 60 minutes of deployment

State File Security

TFE deployment state contains sensitive information. Do not store in VCS or any unprotected location. This is the only state file that requires separate protection — TFE protects all other state your org generates.

K8s Architectural Summary

Kubernetes deployments require active-active mode — a Redis instance is always separate. Use managed PostgreSQL, object storage, and Redis (not Redis Cluster). Layer 4 load balancer for ingress. Three AZs (one pod per AZ) provides n-2 failure profile. Single-region architecture.

General Guidance

Separate TFE pods and HCP Terraform agent worker pods — agent workload is inconsistent under load
Use HCP Terraform Operator instead of the internal Kubernetes driver run pipeline for customers going beyond default concurrency per TFE pod
Three TFE pods is sufficient for HA — HCP Terraform agent cluster capacity has the greatest impact on run success at scale
TFE supports x86-64 on all versions; ARM requires v1.0.0 or later
Do not use burstable instance types (AWS T-type, Azure B-type, GCP e2-/f1-/g1-series)

K8s Resource Sizing

Component	EKS	AKS	GKE
Disk	EBS gp3	Premium SSD Managed Disks	Persistent SSD Disks
Machine (3-node cluster)	m7i.2xlarge (8 vCPU, 32 GB)	Standard_D8s_v5 (8 vCPU, 32 GB)	n2-standard-8 (8 vCPU, 32 GB)
Machine (5-node cluster)	m7i.xlarge (4 vCPU, 16 GB)	Standard_D4s_v5 (4 vCPU, 16 GB)	n2-standard-4 (4 vCPU, 16 GB)

Approximate minimum cluster sizing for HCP Terraform agents with 3 TFE pods at system defaults:

3-node cluster: 96 GB total memory, 64 GB (n-1)
5-node cluster: 80 GB total memory, 64 GB (n-1)

Agent Count Formula

Agent Capacity Calculation

Number of agents = TFE_CAPACITY_CONCURRENCY × number of TFE pods
Default: TFE_CAPACITY_CONCURRENCY = 10. With 3 TFE pods: expect capacity of 30 concurrent agents. RAM per agent = 2 GB by default, configured via agentWorkerPodTemplate in Helm overrides.

Internal Run Pipeline vs. HCP Terraform Operator

Is the customer going beyond default concurrency per TFE pod?

✓ Yes / At Scale

→ HCP Terraform Operator

Prevents thundering herd issues. Establishes minimum agent replicas and spreads demand more efficiently.

Smaller Deployment

→ Internal run pipeline may suffice

Simpler to start. All agents come online at once under load — acceptable at small scale.

Network Considerations

Specify a version tag for HCP Terraform agent image (e.g., tfc-agent:<tag>) — using :latest pulls the image on every run, adding unnecessary network load
HVD Modules deploy layer 4 load balancers (highest throughput available)
Load HCP Terraform agent Docker image from a region-local source (ECR) rather than public internet when possible
Do not use instances with burstable network characteristics

Common Troubleshooting

Error	Likely Cause	Resolution
ImagePullBackOff	Cluster cannot pull TFE container from HashiCorp registry	Check permissions, image version in `locals_helm_overrides.tf`, and that license file has no newline (run `cat tfe.hclic \| base64`)
CrashLoopBackOff	TFE container failing to start	Open two terminals — one to tail `terraform-enterprise.log`, one to run `helm install`. Capture startup error for support ticket.

Active-Active is Still the Target

Even for private cloud, HashiCorp recommends the active-active deployment topology for HA, resilience, and scale. Automated deployment is a hard requirement for active-active — engineering automation must be a headline project planning item.

Private Cloud Component Considerations

🗄️

Redis

Hard requirement for active-active — must be deployed on-prem

▼

Multiple concurrent TFE nodes require an external Redis instance. This is a hard requirement for active-active. If Redis is not feasible on-premise, HashiCorp recommends considering public cloud deployment or HCP Terraform (SaaS). Redis Sentinel is not supported.

The only alternative is to use external operational mode with a single TFE container — acceptable if RTO allows, but forward plan for Redis in due course as business HA requirements typically increase over time.

🐘

PostgreSQL

Specific version and schema requirements

▼

TFE depends specifically on PostgreSQL. Private cloud requires an organizational pattern for deploying and production-managing PostgreSQL at the supported versions. Recommended: version 15.x or later (14.x has some unsupported versions). Version 17.x supported through end of 2029.

Liaise with your DBA team early — there are specific schema requirements.

Sizing

CPU: 4 core
Memory: 32 GB RAM
Disk: 2 TB

📦

S3-Compatible Object Storage

Required — third-party tech to present S3-compatible API

▼

TFE requires S3-compatible storage. In a private datacenter, this requires a third-party technology. HashiCorp sees significant success with Dell ECS and MinIO. If the organization already has an S3-compatible pattern, use that.

Compute Sizing (Private Cloud)

Recommended operating systems: Red Hat Enterprise Linux or Ubuntu LTS.

Component	Recommended Spec
TFE Compute	4 vCPU, 32 GB RAM, 1 TB disk (many scaled customers use 8 vCPU/32 GB as initial production spec)
Disk (Docker)	Min 40 GB available to `/var/lib/docker`; recommend 3000 IOPS minimum
PostgreSQL	4 core, 32 GB RAM, 2 TB disk
Redis Cache	4 core, 16 GB RAM, 500 GB disk. Redis 6.2+ or 7. Recommend 7.

Monitoring PostgreSQL and Redis

Monitor CPU, memory, available disk space, and disk IO using organizational telemetry. Create alerts at 50% and 70% utilization thresholds. If any parameter consistently exceeds 70%, increase the resource.

Network Requirements

TFE has specific ingress and egress requirements — refer to the TFE network requirements page for the latest. If a corporate proxy is filtering outbound traffic, add required destinations to the allow-list. Use a layer 4 load balancer. Air-gapped mode is available if external access is not possible.

Operating System

Use OS configurations compliant with the CIS benchmark for the chosen operating system
Limit CLI access to machines to a shortlist of well-known staff
Ensure the organization's SIEM/audit log reflects all access

Application

Use single sign-on (SSO) with multi-factor authentication (MFA) for all users
TCP port restrictions for ingress/egress are configured by the deployment. Do not alter unless advised by HashiCorp support, a solutions engineer, or certified partner.
Enable the Strict-Transport-Security response header
For manual installs: set restrict_worker_metadata_access as a Docker environment variable to prevent Terraform operations from accessing the cloud instance metadata service
HVD Module automated deployments restrict access to the AWS metadata service — do not re-enable this
After deployment, do not create the initial administrator immediately — coordinate a handoff to the operations team

Critical Architectural Constraint

TFE is a single-region application even in active-active mode. PostgreSQL does not operate active-active cross-region. Multi-region support is for DR/business continuity only — run two separate TFE instances in two regions. Do not promote the secondary DB replica to read-write while the primary is still online — this causes data divergence that cannot be automatically reconciled.

Primary Planning Considerations

💰

Cost vs. Risk

▼

The benefit of duplicating system tiers in a secondary region outweighs cost given TFE's mission-critical role. Calculate TCO of two TFE instances (one per region) including geo-redundant data layer costs. Also calculate the cost to the business if developers cannot deploy applications — this is often the more compelling number. Include any committed spend under enterprise discount programs.

🤖

Automation

▼

Use automated means to deploy all infrastructure in both regions. Use HVD Modules to mirror resources across regions — deploy and maintain state for each region separately.

Do Not Automate Failover Triggering

Automated health checking is recommended. Automated fail-over detection + execution together is not recommended — transient network outages can cause unnecessary failovers with data replication lag consequences. Senior management or dedicated business continuity staff should declare an outage.

🧪

Testing

▼

Test region failover capability on a regular cadence — at least twice annually
Document both failover and failback processes step-by-step in run books
Have team members who did not write the document use it — validates clarity and trains staff
Deploy an engineering pair of TFE instances (one per region) mirroring production; use for meaningful failover tests
Maintain independent instances: each environment must have its own DNS, storage, and supporting services
Perform fault injection testing using cloud provider features or third-party tooling

Component-Specific Guidance

🖥️

Compute

▼

Keep VM/container images version-controlled and available in the failover region. Use Packer as the standard for machine image creation.
Do not run TFE containers in the secondary region while primary is online — risk of premature DB read replica promotion causing corruption
Keep compute cluster infrastructure deployed but scaled down until failover
Keep TFE containers at the same version in both regions; upgrade during the same change window
Co-locate primary and secondary compute layers in the same regions as their respective storage and database components

🪣

Object Storage — by Cloud

▼

Cloud	Recommendation	RPO Consideration
AWS	S3 Cross-Region Replication (CRR), live replication, bidirectional during failover, S3 Replication Time Control for monitoring	99.99% of objects replicate within 15 min. Check for missing objects in run book before failing over.
Azure	Geo-zone-redundant storage (GZRS) — 16 nines durability. Use Standard general-purpose v2 storage accounts. Deploy only in paired regions with AZ support.	Azure Storage Geo Priority Replication guarantees 99% of blobs replicate within 15 min.
GCP	Dual-region GCS buckets with Turbo Replication	Turbo Replication guarantees 100% of objects replicate within 15 min. Premium feature — recommended for mission-critical TFE.
VMware	Deploy identical object store in each region; use strategic inter-DC connections for migration. Most customers solve with vSAN.	Work with VMware team to understand replication SLA between data centers.

🐘

PostgreSQL — by Cloud

▼

Cloud	Recommendation
AWS	Use Aurora with cross-region read replicas. Use `aws_rds_global_cluster` resource. Aurora global databases replicate in ~1 second. Monitor `AuroraReplicaLag` and `AuroraGlobalDBReplicationLag` metrics.
Azure	Use Azure Database for PostgreSQL read replicas. Set `geo_redundant_backup_enabled = true`. Monitor replication lag — Azure auto-sets DB to read-only below 5% storage, which adversely affects TFE.
GCP	Use Cloud SQL with cross-region read replicas. Enable PITR.

Split-Brain Prevention

Be irrefutably certain the primary database replica is offline and stays offline before promoting the secondary read replica to read-write. Automated failover execution is not recommended for this reason.

🔴

Redis

▼

Redis does not require cross-region replication. Deploy Redis in both regions and ensure it is ready in the failover region before starting TFE. HVD Modules handle this when used iteratively for both regions.

The Platform Team

The platform team serves as the central hub, orchestrating functions and ensuring streamlined operations across teams. It may consist of one or more teams with separate areas of responsibility.

🏛️

Cloud Center of Excellence (CCoE)

Strategic component of the platform team

▼

Drives cloud adoption and aligns cloud strategies with business objectives. Establishes governance frameworks, fosters knowledge sharing, and optimizes cloud resources to maximize value and ensure compliance.

🔧

Automation / Tools

Tactical component — manages tooling, implements golden workflows

▼

Manages essential tools and automation to support efficient system operations. Implements golden workflows and reusable modules. Collaborates with stakeholders to ensure services meet consumer needs. May or may not be a separate sub-team.

📐

IaC / PaC

Sets intent for infrastructure-as-code and policy-as-code

▼

Collaborates with security to ensure adherence to standards and best practices. Promotes standardization, scalability, and compliance across projects.

🚀

Internal Developer Platform (IDP)

Self-service golden paths for development teams

▼

Streamlines development cycles and reduces organizational complexity. Uses a product management approach, allowing development teams to consume golden paths in a self-service mode, enhancing productivity and efficiency.

Security Team Collaboration

The security team collaborates with the platform team to establish governance policies and deploy monitoring tools. Key cloud security functions include:

Subscribing to security updates and vulnerability alerts; collaborating with SRE on patches
Managing TLS certificate validity and renewal for TFE
Scanning TFE instances for vulnerabilities; integrating SIEM tools for audit logging
Providing governance inputs to CCoE (PaC guardrails, CIS benchmarks)

Producers and Consumers

Role	Responsibilities
Producers (Platform Team)	Provide seamless onboarding to HCP Terraform platform. Manage the private registry. Oversee Policy-as-Code implementation. Offer enablement to consumers.
Consumers (Application Teams)	Initiate requests for platform access. Write Terraform code based on available private registry modules and platform team recommended practices.

Golden IaC Workflows

A golden workflow is a standardized, repeatable process for completing a specific task. The platform team shifts these workflows left using pre-approved Terraform configurations and curated modules, empowering teams to independently provision infrastructure while maintaining centralized management.

Workflow Type	Description
Producer — Module Development	Create Terraform modules and register them with the private registry
Producer — PaC Development	Develop Sentinel/OPA code and register policy sets in HCP Terraform
Consumer — Landing Zone	Provision cloud accounts/credentials and core TFE configuration elements from reusable modules; deploy VCS repos, projects, workspaces, and Stacks
Consumer — Developer	Create IaC relevant to support application components under remit

What is the target user group's technical knowledge level and need for customization?

Low Tech / Limited Customization

→ Service Catalog Model

Standardized infrastructure patterns. Business users, project managers, any persona. Vending portal (UI, ServiceNow, etc.).

High Tech / High Customization

→ Infrastructure Franchise Model

Build anything within guardrails. Development teams, SRE. Git, API, CI/CD as primary UX.

🛍️

Service Catalog Model

Centralized vending portal of pre-certified infrastructure components

▼

Key Features

Vending portal: Central platform where users view, request, and build standardized components. Platform team maintains; offers self-service experience with audit and chargeback policy support.
Pre-configured components: Range from full application templates to specific infrastructure aspects — guarantees deployment uniformity.
Accelerated consumption path: Business units choose from validated patterns for faster delivery, or opt for custom architecture (slower process).

Supporting TFE Features

Feature	Use
Private Registry	Host and share internal modules/providers; versioned and searchable
Private Providers & Modules	Restrict access to org members; cross-org sharing available in TFE
Run Tasks	Direct integration with third-party tools at specific run lifecycle stages

🏪

Infrastructure Franchise Model

Platform team sets rules; consumers build independently within guardrails

▼

Key Components

Central control team (franchiser): Provides workflow and resources, keeps system running, sets rules, adds capabilities.
Consumption workflow: End users have a path to access provisioning resources and manage their own infrastructure.
Upfront governance: Governance at the outset ensures compliance while enabling provisioning — avoids separate compliance sign-off at go-live.
Controlled vending: Proactive controls ensure legal, regulatory, and enterprise standards are met.

Supporting TFE Features

Feature	Use
Terraform Workspaces	Persistent working directory per collection of infrastructure resources
Workspace Projects	Enables self-managed portions of TFE with same policies as root org
Sentinel Policies	Policy-as-code for fine-grained, logic-based policy decisions using external source data

⚙️

Post-Installation Tasks

▼

After automated installation, the Initial Admin Creation Token (IACT) may be created as an optional final step. Two viable options for post-provision configuration:

API scripts
Terraform Enterprise provider (best practice — use this to derive state for the configuration)

🔑

SSO and Teams

▼

Automate team creation alongside projects, Stacks, and workspaces using the TFE provider. Automate user addition within your IdP. Configure a team attribute name (default: MemberOf) in your IdP to automatically assign users to groups in SAML.

For service accounts used by pipelines: use IsServiceAccount: true in SAML
Create teams in TFE with the exact name of the group in the IdP
Do not create users manually — TFE creates them automatically at SAML assertion login

🔗

VCS Connection

▼

Connect HCP Terraform/TFE to your VCS provider to enable workflows for managing modules, policy sets, and connecting VCS-backed Stacks and workspaces.

Which VCS provider is the customer using?

GitHub

→ GitHub App (preferred)

Not tied to a specific user. No personal access token. Safe as a static VCS connection — no token rotation needed. TFE supports a single site-wide GitHub App across all organizations.

Other VCS Providers

→ OAuth

Authorize using a service account with access to all IaC and PaC repositories. Rotate tokens periodically. Recommend one OAuth connection per VCS provider per TFE organization.

📊

Audit & Agents

▼

Audit: For TFE, forward logs for monitoring and auditing. For HCP Terraform, use the Audit Trails API. Include audit, logging, and monitoring in the target architecture — do not wait until after go-live to implement observability.

HCP Terraform Agents: Define agent pools and assign Stacks/workspaces using the TFE provider. Multiple agents can run concurrently on a single instance (license limits apply to HCP Terraform but not TFE). For containerized agents, use single-execution mode for a clean working environment per run.

Agent Upgrade Strategy

Test new agent versions in a separate pool before rolling out to main pools. Use the platform team's dedicated org for this. Patches require little testing; major changes require more extensive tests. Use the rolling deployment approach illustrated in the HVD.

Hierarchy Overview

Level	Scope	Key Notes
Organization	Encompasses all components — teams, projects, workspaces, Stacks, policies, registry, VCS, variable sets, SSH keys	Centralize core provisioning in a single org. Naming: `<customer-name>-prod` and `<customer-name>-test`.
Project	Container for workspaces and Stacks. Inherits team permissions, variable sets, and policy sets.	Primary tool for delegating config/management in multi-tenant setups. Typically allocated to an application team.
Workspace / Stack	Manages a Terraform configuration and its associated state file	Workspace-level permission applies to that workspace only

Concurrency Defaults

Default concurrency is 10 (TFE spawns up to 10 worker containers simultaneously). Default RAM per agent container is 2048 MiB. For active-active, concurrency is per active node (max 5 nodes). If concurrency needs to scale without vertical VM scaling, use HCP Terraform agents with a dedicated agent pool.

Access Management

HCP Terraform access is built on three components: User accounts, Teams, and Permissions. Implement SAML/SSO for user management combined with RBAC.

👑

Owners Team

Created automatically; cannot be removed

▼

Comprehensive access to all org aspects. Certain tasks are reserved to owners only: creating/deleting teams, managing org-level permissions, viewing the full team list (including secret teams). Limit membership to a small group of trusted platform team members.

🔐

Key Organization-Level Permissions

▼

Permission	Purpose
Manage Policies	Create, edit, read, list, delete Sentinel policies; access to read runs on all workspaces for enforcement
Manage Run Tasks	Create, edit, delete run tasks within the org
Manage Policy Overrides	Override soft-mandatory policy checks
Manage VCS Settings	Manage VCS providers and SSH keys
Manage Private Registry	Publish and delete providers/modules — owners only
Manage Membership	Invite/remove users; add/remove from teams. Cannot create teams or view secret teams.

Principle of Least Privilege

Grant org-wide permissions only to the most senior platform team members. Restrict development team access. Treat the organization token as a secret.

📁

Project Design Principles

▼

Projects are containers for workspaces and Stacks. Configuration elements attached at project level are inherited by all workspaces/Stacks within: team permissions, variable sets, policy sets.

Standard Project-Level Permissions

Admin: Full control over the project (including deleting it)
Maintain: Full control of everything in the project, except the project itself

Benefits of Project Delegation

Agility: Teams can create and manage infrastructure in their designated project without requesting org-admin access
Reduced Risk: Project permissions give admin access to a subset of workspaces/Stacks without cross-team interference
Self-Service: Projects integrate no-code provisioning — project admins can deploy no-code modules without org-wide workspace management privileges

Automated Onboarding

Automated onboarding is critical for project success at scale. Use the HCP Terraform / TFE provider or the TFE API to create and configure projects. Avoid manual UI workflows — they do not scale.

📝

Naming Conventions

▼

Create and maintain a naming convention document covering teams, projects, workspaces, Stacks, and other entities. Pass this to all new internal clients to standardize operations.

Recommended org naming: <customer-name>-prod (production workloads) and <customer-name>-test (integration testing and PoCs).

Cloud Provisioning Key Steps

Plan the project: installation/config of TFE, self-service capability design. Consider onboarding early adopters before general availability — use stepwise refinement.
Consider platform team size and bandwidth — if any onboarding step is not automated, it compounds with scale.
Plan a landing zone provisioning workflow.
Ensure the platform team is fully trained on IaC with Terraform and the cloud providers in use. All contributors must adhere to the HashiCorp Terraform language style guide.
Set up a TFE Stack or workspace with cloud credentials.
Store Terraform code in your strategic VCS.
Provision cloud resources from the VCS-backed Stack or workspace.
Consider configuration required to enable end-to-end deployment within org security and compliance requirements. Identify manual steps and what would be needed to automate each.

Landing Zones

A cloud landing zone is a foundational, standardized environment for secure, scalable cloud operations. The platform team deploys a control workspace during onboarding of each internal customer.

🏗️

What Landing Zones Address

▼

Networking: VPCs, subnets, connectivity settings
IAM: Policies, RBAC, permissions enforcement
Security & Compliance: Encryption, security groups, logging
Operations: Monitoring, logging, automation tools
Cost Management: Tagging policies, budget alerts, cost reporting

Major cloud providers: AWS Control Tower, Azure Landing Zone Accelerator, Cloud Foundation Toolkit (GCP).

📋

HCP Terraform Landing Zone Design

▼

Core Requirements

Use the Terraform Enterprise provider for state representation
Define a VCS template with boilerplate Terraform code and a directory structure managed by the platform team
Create a Terraform module for the private registry that the platform team calls during automated onboarding. This module creates:
- A control workspace for the application team
- A VCS repository for the application team
- Variables/sets as needed
- Public and private cloud resources for the team
Hook the onboarding pipeline into other org platforms (credential generation, observability, etc.)
Dedicate a TFE project to house landing zone control workspaces separately from other platform team workspaces

Iteration is Expected

Expect to update the module and VCS template multiple times as you scale. If you onboard ten teams and then need to change the process, subsequent teams get the update but earlier teams may need retrospective updates. This is why stepwise onboarding with early adopters is critical.

ServiceNow Integration

If using ServiceNow as the front door to onboarding, liaise with the ServiceNow team as early as possible — there is always a lead time for remediation.

🔄

Landing Zone Workflow Summary

▼

Ticket raised: Audit trail created, approval acquired
Landing zone child module call code added: Top-level workspace collects and manages onboarded teams. Automate the addition of each child module call — do not do this manually at scale.
Run the top-level workspace: Creates the control workspace for the application team, VCS repo, variables, cloud resources
Application team onboards and begins using their workspace

Workflow Personas

Persona	Role	Example Titles
Developer	Develops infrastructure and application code	Software engineer, application developer
Lead Developer	Helps efforts of product developer teams	Development lead, technical team lead
Release Engineer	Coordinates deployment to production using automation	Release engineer, release manager
Platform Engineer	Writes pipeline definitions; enables developers to use pipelines	DevOps engineer, operations engineer
Infrastructure Operator	Maintenance, configuration, administration	SRE, site reliability engineer, systems admin

VCS-Driven Workflow

📦

VCS-Driven Workflow Overview

Recommended for most teams — webhooks trigger runs automatically

▼

A specific VCS repository backs each workspace or Stack. HCP Terraform uses webhooks to monitor commits, pull requests, and tags. Changes trigger plan runs; PRs trigger speculative plans.

Prerequisites

VCS repository containing source code for the deployment
VCS authentication enabling secure access for HCP Terraform
VCS permissions configured (read-only, merge, etc.)
Workspace or Stack naming conventions and permissions defined

High-Level Steps

Configure VCS integration for your organization
Connect workspace or Stack to the desired branch in your VCS repository
Adopt a branching strategy (e.g., standard feature branching)
Enable speculative plan runs for each branch
Define PR process per organizational standard
(Optional) Configure automatic run triggers based on git tag pushes

🏷️

Run Trigger Options

Three methods to specify which changes trigger runs

▼

Branch change: Trigger when a specific branch changes — long-running or merged feature branches
Tag match (pattern): Trigger only for changes with a specific tag format. Supports semantic versioning, prefix, suffix, or custom regex.

Tag Format	Regex Pattern
Semantic Versioning	`^\d+.\d+.\d+$`
Version with prefix	`\d+.\d+.\d+$`
Version with suffix	`^\d+.\d+.\d+`

⚡

Auto Apply

Applies changes from successful plans without prompting for approval

▼

Auto apply is configurable per workspace. Useful in non-interactive non-production environments to automatically run terraform apply on a successful plan. Works regardless of how the plan triggers (VCS, API, etc.).

Use Case

Enable auto apply in dev/staging environments for CI speed. Require manual approval in production for change control and auditability.

API-Driven Workflow

For customers with strategic CI/CD pipeline orchestrators, HCP Terraform and TFE form the infrastructure management component of those pipelines. In this model, the CI/CD tool drives the Terraform run via the TFE API rather than through VCS webhooks. This supports more complex orchestration patterns where infrastructure changes are part of a larger pipeline.

Use the Import Block

Terraform 1.5.0 introduced the import block. It allows declaration of resources to import in configuration, bulk import via for_each, a planning phase before import, and a sub-command to generate configuration for imported resources. This is the recommended approach.

What to Import vs. What Not to Import

✓ Import These

Resources created by other methods

IaC tools, other tooling, or manual click-ops. Resources that cannot be rebuilt with downtime using snapshot, imaging, or backup methods.

✗ Do Not Import

Dynamic, temporary, or unsupported resources

Dynamic infrastructure with changes outside Terraform. Temporary or experimental resources. Resources unsupported by providers (would require null_resource). Resources managed by teams not using Terraform. Resources that can be rebuilt with downtime.

Team Responsibilities

Team	Responsibilities
Platform Team	Set up/maintain workspaces, projects, orgs, policy sets, cloud auth. Develop import best practices and module usage guidance. Provide training to application teams. Convert repeated configs into modules.
Application Team	Identify which resources to import. Work with platform team on accurate transition. Provide resource attribute information. Run day-to-day plans/applies through standardized module usage.

Phased Approach

1️⃣

Phase 1 — Pilot and Expand

▼

Application team gets guidance from platform team on which resources to manage
Start with a small pilot set; complete to documentation and review
Gradually expand resources from the same team; then move to other teams
Continuously review and refine the process; ensure application teams maintain Day 2 responsibilities
Platform team enables drift detection features to catch out-of-band changes

2️⃣

Phase 2 — Modularize

▼

After Phase 1, identify common configuration patterns
Place common resources into modules to scale Terraform maturity and consumption
Add granularity between modules accounting for permissions and security (you cannot partially instantiate a module)
Use projects, variable sets, and Sentinel policy sets to organize the new structure

Feature Availability by Product

Feature	HCP Terraform	Terraform Enterprise
Operational Logs	No (HashiCorp SRE manages)	Yes (your SRE team manages)
Audit Trail	Yes	Yes
Metrics	No (HashiCorp SRE manages)	Yes
HCP Terraform Agent Logs	Yes	Yes

Observability Feature Definitions

Operational logs: Track performance and behavior — error messages, warnings, events. Used by SRE to monitor and maintain service.
Audit trail: Security-focused — login attempts, access control changes, suspicious activities. Used by security analysts and incident response teams.
Metrics: Application component performance and usage data — detect service quality issues or inform capacity planning.

TFE Monitoring Focus Areas

📊

Prometheus + Grafana

▼

Configure Prometheus to gather metrics from TFE and its underlying components. TFE generates operational metrics Prometheus can collect: CPU, memory usage, request latency, and more.

Use Grafana to visualize metrics. The official Terraform Enterprise Grafana dashboard (ID: 15630) provides real-time insights into resource utilization trends and request rates.

📝

Agent Logs

▼

If using HCP Terraform agents, include agent logs in your log collection and analysis. This ensures a complete picture of all activities. Analyze agent logs alongside TFE logs to better optimize and improve deployment.

Each team member should have an account created through the HashiCorp support portal
Configure each team member permitted to open support tickets as an Authorized Technical Contact. Provide the list to your assigned HashiCorp Solutions Engineer for configuration.
Familiarize the team with documentation on how to open a support ticket and generate/upload a support bundle when applicable
Be aware of your support plan level to manage response time expectations
Understand the severity level definitions when opening a ticket — see the Customer Success Enterprise Support page

Health Assessment Components

🔍

Drift Detection

Identifies when actual state deviates from Terraform configuration

▼

Drift detection identifies configuration drift: when changes are made directly to infrastructure outside Terraform's managed processes, creating discrepancies between live state and the code-defined desired state.

Drift vs. State Drift

Configuration drift is a change in the configuration of an app or infrastructure. State drift (external changes that don't invalidate your config) is different — drift detection does not detect state drift.

Limitations

Unmanaged attributes: Drift detection only covers attributes explicitly managed by Terraform. Manually modified settings outside Terraform's control won't be flagged.
External additions: Resources added entirely outside Terraform (e.g., manually created IAM users) are not detected.

Prescriptive Guidance

Enable health assessments for all workspaces — set globally in Settings → Health
Enable workspace notifications so admins are alerted via Slack or email when drift is detected
TFE admins can modify assessment frequency and maximum concurrent assessments from admin settings console

✅

Continuous Validation

Covered in detail in the Scaling Guide — see that section

▼

Continuous validation enforces standards over time — set rules for security, cost, or other requirements, and Continuous Validation checks they're always being met. Covered in the Scaling Operating Guide.

Drift Resolution Workflow

Once drift is detected, the workspace notifies the application team. They decide the best resolution:

Is the drift an undesired change that should be reverted?

✓ Undesired Drift

→ Overwrite Drift

Initiate a new Terraform plan and apply to revert resources to their configuration-defined state.

Intentional Change

→ Update Terraform Config

Modify Terraform configuration to include the changes and push a new configuration version. Prevents TFE from reverting the change during the next apply.

Refresh State vs. Update Configuration

Operation	What it Does	When to Use
Refresh State	Updates Terraform's internal state file to match actual infrastructure — does NOT modify infrastructure	To ensure Terraform's understanding is accurate for identifying drift
Update Terraform Configuration	Submits updated config files and executes plan/apply — DOES modify infrastructure	When incorporating intentional drift changes or correcting configuration to match desired state

Registry Roles

Role	Responsibilities	Personas
Registry Administrator	Publish and delete modules/providers from the private registry (public or private sources). Organization-level permission — assign to a specific team.	Platform team members, CI/CD pipelines automating the publish process
Producer	Create and maintain modules/providers. Publish new releases. Needs commit access to the VCS repo + Registry Administrator permissions.	Platform team responsible for custom modules, service owner teams
Consumer	Find and use providers/modules necessary to provision infrastructure. Needs commit access to VCS repo and write access to TFE workspaces.	Application team members, platform team using the TFE provider

Delegate Publishing via CI/CD

HashiCorp recommends delegating module release publishing to a CI/CD pipeline with a team API token that has Registry Administration permission. Rotate or manage this token with HashiCorp Vault and the Terraform Cloud secrets engine.

Module Requirements

📝

Requirements Before Publishing

▼

VCS Repository: Module code must be hosted in a supported VCS repository
Semantic Versioning: Must use semantic versioning scheme for version constraints to work
Naming Convention: Must follow terraform-<PROVIDER>-<name> format (e.g., terraform-aws-ec2-instance)
Standard Module Structure: Enables the registry to generate documentation, track resource usage, parse examples, run tests
VCS Provider Configured: Must have a VCS provider configured with administrative access to the module repositories

🚀

Publishing Options

▼

Method	Best For	Notes
Tag-based	Modules associated with release tags in VCS	Registry auto-detects and publishes new versions based on tags. Consider implementing tag protection rules in VCS.
Branch-based	Enhanced flexibility; required for integrated testing	Allows selection of a specific branch with an assigned version number
API / CI/CD Pipeline	Automation and scale	Recommended for delegating publishing to CI/CD

Additional Benefits of Custom Modules

Beyond reusability, custom modules enable organizations to encode:

Naming conventions
Security controls
FinOps standards (tagging, cost allocation)

Sentinel vs. OPA

HashiCorp supports both Sentinel and Open Policy Agent (OPA). Recommend Sentinel for its maturity and performance characteristics. Recommend OPA when the organization has standardized on OPA and has existing policies. Hybrid use of both is also encouraged.

Key Benefits of Sentinel

🛡️

Risk Mitigation & Governance

▼

Risk mitigation: Actively lowers chances of errors and vulnerabilities by enforcing rules during planning and execution
Regulatory governance: Ensures every action aligns with org policies, regulatory guidelines, and security standards at scale; simplifies auditing
Separation of concerns: Policies managed by platform/compliance/security teams, separate from the application teams deploying infrastructure; workspace owners cannot opt out without explicit permissions
Sandboxing: Policies act as core guardrails — reduces need for manual verification

💡

Codification & Automation Benefits

▼

Codification: Makes governance clearer, more efficient, consistent, and operationally reproducible. Eliminates reliance on oral traditions. Fosters transparency.
Version control: History tracking, diffs, PRs — demonstrable and auditable policy evolution
Testing: Sentinel's built-in testing framework enables automated CI testing — reduces TCO for governance
Automation: Policy deployment is far faster than manual work and requires fewer humans; ensures consistency at scale

Policy Enforcement Workflow

Define governance/compliance policies and translate them into Sentinel policy requirements
Code policies using the Sentinel language; arrange in policy sets
Scope policy sets to the entire organization, or to one or more projects/workspaces
Policy enforcement levels: advisory (warns but doesn't block), soft-mandatory (can be overridden by authorized users), hard-mandatory (cannot be overridden)

People and Responsibilities

Sentinel policies owned by the Security team with input from regulatory compliance areas
CCoE/Platform Team understands Sentinel code and how to employ it in production pipelines
CCoE/Platform Team partners with Security Team to manage ownership and RBAC over policy code

Repository Organization Best Practices

📁

Repository Structure

▼

policies/: Main Sentinel policies, organized by environment. Each environment subdirectory contains policies and a test subdirectory (test files, mock data, Terraform config used to generate mock data).
modules/: Reusable policy modules. Common TF import functions stored here — flat structure with illustrative filenames.
docs/: Always fully document policies and their tests.

✍️

Writing Policies — Getting Started

▼

Acquire IT security policies relevant to infrastructure deployment from your security team — translate into Sentinel policy-as-code
If no policy list exists, agree on general controls internally and design them to be extended over time
Staff responsible for policy development must understand the Sentinel language — read official language documentation
Attend HashiCorp Sentinel Academy training (hands-on labs and real-world examples) — contact your HashiCorp Solutions Engineer or Customer Success Manager
Use Sentinel modules (0.15.0+) to specify reusable functions that reduce codebase length
Clone the terraform-sentinel-policies GitHub repo — provides prewritten policies for public/private cloud providers and reusable function modules

Common Starting Policies

Governance of maintenance windows (protecting from adverse change at wrong times)
Enforcement of metadata tagging of cloud resources
IaC style enforcement (e.g., Terraform module versions pinned, only from private registry)

How Run Tasks Work

Run tasks send an API payload to an external service at a specific run stage. The service processes the data, evaluates whether the run passes or fails, and sends a response back to HCP Terraform. Based on the enforcement level, HCP Terraform determines if the run can proceed.

Run Stages

Stage	Available Data	Use Case
Pre-plan	Code and other attributes	Examine code to determine if entering the plan stage should be allowed
Post-plan / Pre-apply	Plan results	Examine the plan and determine whether an apply should be allowed (most common stage)
Post-apply	Provisioned infrastructure data	Testing and gathering/storing information about provisioned infrastructure

Multiple Run Tasks at a Stage

Multiple run tasks can be defined at a single stage. They execute in sequence. The continuation of the run is determined by the most restrictive enforcement level: if a mandatory task fails and an advisory task succeeds, the run fails. If advisory fails but mandatory succeeds, the run succeeds.

Common Use Cases

Category	Example Tools
Security & Compliance	Palo Alto Networks Prisma Cloud, Zscaler, Snyk, Tenable, Sophos, Aqua Security, Firefly
Cost Control	Infracost, Vantage, Kion
Visibility	Pluralith (resource visualization)
Image Compliance	HCP Packer run task (verify approved golden images)

Implementation Flow

Select the desired run task from the public Terraform Registry; review requirements and documentation
Establish and verify two-way connectivity between HCP Terraform platform and the run task endpoint (network/security modifications may be required)
Create the run task in the Terraform Organization (connect endpoints, test communication path)
Associate the run task with a workspace; configure the stage and enforcement level (Advisory or Mandatory)
Run task executes as part of normal run cycles; review results in run completion output

Run Tasks vs. Sentinel

Run tasks can replace some Sentinel policies where a third-party product is better suited to test specific intricacies. This can reduce development and testing time while providing more stringent and accurate security/compliance testing. Requires Terraform Core version 1.1.9 or later.

When to Use Self-hosted Agents

Reason 1

Increase concurrent runs

For TFE: scale capacity beyond vertical VM scaling limits

Reason 2

Access restricted networks

Access isolated APIs or private data center resources not reachable from HCP Terraform

Reason 3

Custom tooling via hooks

Add additional tools to the Terraform runtime via pre/post hooks

HCP Terraform Customers

For HCP Terraform (not TFE), only use self-hosted agents for Reasons 2 and 3 above. If neither applies, use the native built-in workers available in HCP Terraform.

TFE Scaling Strategies (Priority Order)

Migrate to active-active operational mode and increase TFE node count
Add more resources to VMs hosting TFE nodes; increase capacity config params (TFE_CAPACITY_CONCURRENCY, TFE_CAPACITY_CPU, TFE_CAPACITY_MEMORY)
Deploy HCP Terraform agents — the next logical step once the above limits are reached

Agent Pool Design

🏊

Agent Pool Strategy

▼

Design agent pools based on:

Product limits (HCP Terraform enforces concurrency/agent limits per tier; TFE has no agent concurrency limit)
Cloud provider — dedicated pools per provider simplify credential management
Environment — dedicated pools per environment (dev/staging/prod) prevent accidental cross-environment changes
Operations — separate pools for highly privileged operations; granular permissions and cleaner audit trails

Dedicated Pools per Secured Environment

HashiCorp recommends dedicating an agent pool for every secured environment (on-premises with strict controls, GCP VPC Service Controls, AWS/Azure equivalents). Reduces blast radius in a security incident.

🏷️

Naming Convention

▼

Recommended pattern: {environment}-{cloud-provider}-agentpool

Examples: dev-aws-agentpool, staging-azure-agentpool, prod-gcp-agentpool

Use standardized abbreviations; use hyphens or underscores as delimiters; keep lowercase; avoid special characters and spaces
Document the naming convention in org wiki; communicate to all relevant stakeholders

📦

Deployment: VM vs. Kubernetes

▼

Two primary deployment modes:

Virtual machines: Run the agent on VMs (e.g., EC2 instances). Use Packer + HCP Packer integration to build agent images. Use autoscaling groups/managed instance groups/scale sets. Use rolling upgrade features.
Kubernetes (recommended for K8s-skilled customers): Use the K8s Operator for autoscaling capabilities. See the K8s Operator section for details.

Custom Agent Image

Since adding custom pre/post hooks is a key benefit of self-hosted agents, automate image building via CI/CD. Monitor the HCP Terraform Agent Changelog and HashiCorp Releases API for new versions. Test new versions in a separate pool before rolling out.

📊

Scaling Metrics

▼

Use the following metrics to drive VM-based agent scaling decisions:

tfc-agent.core.status.busy — number of agents in busy status at a point in time
tfc-agent.core.status.idle — number of agents in idle status at a point in time

API endpoints also provide information for automating scaling decisions — review the agent documentation for current endpoints.

Custom Resources Introduced

CRD	Purpose
AgentPool	Manages HCP Terraform Agent Pools and Agent Tokens. Supports on-demand scaling operations for HCP Terraform agents.
Module	Facilitates API-driven Run Workflows; streamlines execution of Terraform configurations.
Project	Manages HCP Terraform Projects — organized and efficient project handling.
Workspace	Manages HCP Terraform Workspaces — structured environment for resource provisioning and state management.

Use Case 1: Auto-scaling Agent Pools

The Operator manages agent pool lifecycle and deployment via the AgentPool CRD. It can monitor workspace queues to trigger autoscaling based on defined min and max replicas.

Increase agents up to autoscaling.maxReplicas or licensed limit (whichever is reached first)
Reduce agents to autoscaling.minReplicas within autoscaling.cooldownPeriodSeconds when no pending runs exist

HashiCorp Recommendations

Use a dedicated Kubernetes cluster or logical node separation for HCP Terraform agents. Use cluster autoscaling where available — particularly important with high variance between peak and off-peak concurrency. Set minReplicas based on baseline run concurrency for health checks (drift detection and continuous validation).

Sizing AgentPool Autoscaling

maxReplicas: Determined by peak-run concurrency demand and HCP Terraform tier constraints. Scale-test your cluster to ensure peak load is handled.
minReplicas: Consider baseline run concurrency from health checks (drift detection, continuous validation).
Set memory limits and resource requests on agents — helps efficient node placement; critical if using cluster scaling technologies like Karpenter.

Use Case 2: Self-Service Infrastructure via Kubernetes Native Consumption

The Operator lets application developers define infrastructure configuration using Kubernetes configuration files. It delegates the reconciliation phase to HCP Terraform. This frees developers from needing to learn HCL for infrastructure management tasks — useful when your application teams are Kubernetes-native and prefer K8s manifest-based workflows.

Security Considerations

Agent tokens stored in the Kubernetes cluster must be secured using your organization's K8s secrets management approach. Review the Operator documentation for specific security guidance. Egress requirements for the HCP Terraform agent apply when agents are deployed via the Operator — includes provider endpoint connectivity, Terraform registry access, and Terraform releases access.

Drift Detection vs. Continuous Validation

Drift detection flags out-of-band changes to managed infrastructure that can affect a Terraform apply. Continuous validation addresses use cases where more customizable detection rules are necessary — particularly for infrastructure managed outside the workspace or third-party service health checks.

Why Continuous Validation?

Failed infrastructure changes can introduce project delays and expose the organization to operational or security risks. Continuous validation gives advance notice of issues preventing successful changes, so they can be addressed before a Terraform apply fails in production.

Best Practice Recommendation

When a new workspace is created, enable continuous validation (explicitly at workspace level or implicitly at org level)
Include necessary logic in Terraform configuration to validate important components whose health may change over time
If infrastructure changes fail in the future due to an unchecked condition, update the Terraform configuration to incorporate the new validation — and apply this pattern to existing infrastructure code

Rule of Thumb — Which Resources to Validate

Check the status of any critical resource that can fail (e.g., VMs)
Check validity of resources with user-defined time frames whose failure impacts the application stack (e.g., TLS certificates)
Not necessary for inherently durable resources (e.g., S3 buckets — native to cloud provider)

Implementation Requirements

Language Feature	Minimum Terraform Version
Preconditions and postconditions	1.2 and later
Check block	1.5 and later

Permissions required: Organization health settings require Owners team membership. Individual workspace settings require Workspace Admin access.

Notification Event Categories

🏥

Workspace Events (Health-Related)

All considered critical

▼

Event	Trigger	Priority
Check Failed	Continuous validation check returns unknown or failed	Critical
Drift Detected	Every time drift is detected on this workspace	Critical
Health Assessment Errored	Health assessment cannot complete successfully	Critical
Auto-destroy Reminder	Sends reminder 12 and 24 hours before auto-destroy run	Critical
Auto-destroy Results	Results of an auto-destroy run	Critical

Minimum Required

If notification volume is too high, at minimum enable: Check Failed, Health Assessment Errored, and Auto-destroy Reminder.

🏃

Run Events

▼

Event	Trigger	Priority
Created	Run created, enters Pending state	Low
Planning	Run acquires lock and starts executing	Low
Needs Attention	Human decision required — plan changed, not auto-applied, or policy override required	Critical
Applying	Plan confirmed or auto-applied	Low
Completed	Run completed successfully	Low
Errored	Run terminated early due to error or cancellation	Critical

Implementation Guidance

Configure notifications via WebUI, the API (tfe_notification_configuration), or using the Terraform TFE provider. HashiCorp recommends using the TFE provider to configure notifications as part of the project/workspace creation process.

Notification Strategy

Choose appropriate destination: Slack is popular, but use whatever fits team workflow (email, Teams, etc.)
Granular notifications: Avoid broad notifications that cause alert fatigue — focus on critical events
Integration with incident management: Integrate with incident management tools so alerts lead to actionable items

Maintenance

Periodically review notification settings and adjust based on changing infrastructure needs and team feedback
Test when making changes — trigger events manually to verify notifications are received
Continuously monitor and solicit feedback to reduce noise and improve relevance

No-code provisioning is deployed into new TFE workspaces. This is a consideration for platform teams managing license consumption — each no-code deployment creates a new workspace.

Roles and Responsibilities

Role	Responsibilities
Registry Administrator (Platform Team)	Design, build, and publish no-code modules to the private registry. Ensure modules are configured to allow no-code provisioning. Define and document required variable values.
Project Admin (Application Team)	Configure and deploy no-code modules within their project. Manage the resulting workspace lifecycle.

Permissions

Marking a module as no-code enabled requires the Manage Private Registry permission or Owners team membership
Deploying a no-code module requires Project Admin permission or higher
HCP Terraform/TFE uses the module's configured variable set or workspace variables for cloud credentials

Configuring at Scale

Use the TFE provider or API to automate no-code module configuration when setting up new projects
Define variable sets at the project level to provide the necessary cloud credentials — these are inherited by no-code workspaces
Document the no-code provisioning process for consumers so they understand what is available and how to use it

Use Cases

Spin up and tear down feature-branch infrastructure automatically
CI/CD environments that need fresh infrastructure per pipeline run
Time-boxed customer or internal demos and proof-of-concept deployments
Any scenario where infrastructure should not persist beyond a defined lifecycle

Roles and Responsibilities

Role	Responsibilities
Platform Team	Define standards for ephemeral workspace usage. Configure auto-destroy schedules and notification settings. Provide automation patterns for teams to create and destroy ephemeral workspaces.
Application Team	Create ephemeral workspaces following platform team standards. Manage the lifecycle within defined parameters. Monitor notifications for auto-destroy events.

Permissions

Creating ephemeral workspaces and configuring auto-destroy requires Workspace Admin permission or higher at the project level

Configuring at Scale

Use the TFE provider or API to automate ephemeral workspace creation as part of CI/CD pipelines
Set auto-destroy notifications (12h and 24h reminders are available) so teams are aware of impending destruction
Define standard auto-destroy schedules in project-level documentation — prevents unintended persistence of temporary resources
Integrate auto-destroy results notifications with your incident management or team communication channels