Home
Terraform HVD Navigator
🏗️
SDG Solution Design Guide
Architecture, deployment paths, sizing, security hardening, and multi-region DR for Terraform Enterprise. Applies across all maturity stages — the platform layer never expires.
🌱
Adopt Operating Guide: Adoption
People & process, consumption models, org structure, VCS workflows, observability, and Day 2 operations. Foundation for standing up TFE as a shared service.
📋
Standardize Operating Guide: Standardization
Health assessments, private registry, Sentinel policy-as-code, and run tasks. Focus shifts to governance and making TFE available org-wide.
📈
Scale Operating Guide: Scaling (Beta)
Self-hosted agents, Kubernetes operator, continuous validation, no-code provisioning, ephemeral workspaces. For customers with a mature platform expanding capabilities.

The Maturity Model

Adopt
IaC & cloud provisioning, secure variables, VCS integration, RBAC, observability. The platform is new and onboarding early teams.
Standardize
Private registry, policy-as-code, run tasks, drift detection. The platform becomes a shared org-wide service with guardrails.
Scale
Cost management, self-service workflows, agent-based execution, no-code provisioning, ephemeral workspaces. Mature platform, operational efficiency focus.

Four Guides at a Glance

GuidePrimary AudienceCore QuestionPrerequisite
SDG Platform engineers, infra architects, DevOps admins How do we deploy TFE correctly and keep it running? None
Adopt Platform team standing up TFE as a shared service How do we configure TFE and onboard teams onto it? SDG completed
Standardize Platform team maturing toward org-wide governance How do we enforce guardrails and scale to the whole org? Adopt + maturity review
Scale (Beta) Platform team at scale, enabling self-service workflows How do we extend TFE capabilities once the platform is mature? Adopt + Standardize
Source Constraint
All content in this navigator is sourced strictly from the four Terraform HVD PDFs: Solution Design Guide, and Operating Guides for Adoption, Standardization, and Scaling. Nothing is invented outside those sources.

TFE Components

ComponentRoleKey Notes
TFE ApplicationCore application container provided by HashiCorpConsult HVD Module code for default machine size per cloud/K8s env
HCP Terraform AgentOptional isolated execution environments for runsHighly recommended; TFE includes built-in agents if not deployed separately
PostgreSQLPrimary store for workspace settings, user data, application stateReview supported versions; required for all modes except disk
Redis CacheCaching & coordination between core web/background workersRequired for active-active. Native Redis services from AWS/Azure/GCP validated. Redis Cluster not supported. Redis Sentinel not supported.
Object StorageState files, plan files, config, output logsAll objects symmetrically encrypted (AES-128 CTR). S3-compatible, GCS, or Azure Blob

Operational Modes

💿
disk mode
Dev, testing, familiarization only — never production

All TFE services — including PostgreSQL and object storage — deploy onto a single node using localized Docker disk volumes. No failover. No active/active. Fully self-contained.

Hard Limit
Do not use disk mode for any production environment. It lacks enterprise-grade resiliency and cannot scale performance without service interruption. Acceptable only for functional testing, training, and gaining familiarity.

Characteristics

  • Minimal resource requirements; no specialized expertise to deploy
  • Rapid to stand up
  • No failover capability — single AZ, single node
  • Cannot scale without downtime
🔌
external mode
External DB + object storage, internal cache, single compute node

PostgreSQL and object storage move to dedicated infrastructure. Cache remains internal (transient only). Application runs on a single compute node. Each public cloud provides native services for PostgreSQL and object storage that support this pattern.

Characteristics

  • Stateless application front-end; distributed core components
  • Improved resilience — eliminates single points of vulnerability for data
  • Single compute node is still a single point of failure for the application layer
  • Does not provide performance scalability
Positioning Note
External mode is a stepping stone, not a destination. If the customer's RTO requires zero downtime from AZ failures, push toward active-active. External is suitable when resilience matters but full HA is not yet required.
active-active mode
HashiCorp's recommendation for all production deployments

Multiple stateless TFE instances across at least three AZs, connected to external PostgreSQL, shared object storage, and external Redis. The SDG explicitly recommends this for production.

Why This Mode

  • n-2 failure profile — survives failure of two AZs
  • Eliminates potential service failure points
  • Safeguards against revenue loss from unscheduled interruptions
  • Ensures data integrity and data residency compliance
  • Improves workload distribution and overall performance
Hard Requirements
Active-active mode requires an automated deployment process — this is non-negotiable. Redis is also a hard requirement. Redis Cluster is not supported. Use native Redis services from AWS, Azure, or GCP. Redis Sentinel is not supported.

Operational Complexity Trade-offs

  • Automated TFE deployment process is mandatory
  • Monitoring must account for multiple instances
  • Custom automation required to manage application node lifecycle
  • Note: Redis does not need to be external when running a single TFE node — HVD modules provision Redis automatically when active-active parameter is true

Operational Mode Decision

Does the customer require zero downtime from expected failures (instance or AZ failure)?
✓ Yes
→ active-active mode
Non-negotiable for production services that cannot tolerate downtime. Requires automated deployment and external Redis (not Cluster or Sentinel).
Stepping stone
→ external mode
Acceptable if resilience matters but full HA is not yet required. Still production-viable; plan migration to active-active over time.
Dev / Test only
→ disk mode
Acceptable only for functional testing, training, and familiarization. Never production.

Design Attributes

The SDG evaluates decisions against four non-functional attributes. Each recommendation in the guide explicitly notes its impact:

AttributeDescription
AvailabilityMinimizes the impact of subsystem failures on uptime (e.g., multiple load-balanced app instances)
Operational ComplexityOccasionally introduces upfront complexity to reduce ongoing operational burden (e.g., Packer-based immutable images)
ScalabilityAvoids choices that introduce overhead at scale (e.g., automated onboarding vs. manual UI processes)
SecurityNotes how decisions change security posture (e.g., workload identity/OIDC vs. long-lived credentials)

Personnel Roles

🧑‍💼
Project Leader
Coordination, resource facilitation, duty assignment

Coordinates events, facilitates resources, and assigns duties to the Cloud Administration Team. Responsible for project-level planning including timeline and access acquisition.

☁️
Cloud Administration Team
Carries out functional installation tasks

Assumed knowledge

  • Cloud architecture and administration
  • Administration-level experience with Linux
  • Practical knowledge of Docker
  • Practical knowledge of Terraform
🔒
Security Operations Team
Strongly recommended — integrates formal security controls

Focuses on integrating formal security controls required for services hosted in the chosen cloud environment. Critical for regulated industries.

🖥️
Production Services Team
Takes over once the deployment goes live

Designated to own the TFE service post-deployment. Handover planning and documentation should occur before go-live.

Access Requirements

The installation team requires direct (including administrator) access to the following before starting:

Resource TypeExamples
Compute & Storage InstancesVMs, storage volumes, EBS/managed disks
Network ObjectsFirewall rules, load balancers, security groups
TLS Certificate MaterialCertificate + private key matching TFE hostname (SAN, not CN only). PEM encoded. Signed by public or private CA. No self-signed certificates recommended.
Identity & IAMAWS IAM, GCP Cloud Identity, Azure Active Directory
Secrets ManagementAWS Secrets Manager / KMS, GCP Secret Manager / Cloud KMS, Azure Key Vault, VMware vSphere Native Key Provider
TFE License FileMust be obtained from HashiCorp account team. Save as terraform.hclic. Single line, no newline character. Treat as a company asset.
DNS RecordDNS record must exist matching the SAN in the TLS certificate
Architectural Summary
TFE deploys on a VM (AWS/Azure/GCP) regardless of topology. For active-active, scale additional VMs across different AZs in the same region. Use managed object storage, PostgreSQL, and (for active-active) managed Redis — specifically not Redis Cluster. Use a layer 4 load balancer for ingress. TFE is a single-region application architecture.

Network Egress Requirements

TFE should not be exposed to the public internet for ingress. Users must be on the company network. TFE does need outbound access to:

  • registry.terraform.io — public module registry (official providers index here; restrict via Sentinel/OPA for community content)
  • releases.hashicorp.com — Terraform binary releases (stay within two minor releases of latest)
  • reporting.hashicorp.services — license usage aggregation (strongly recommend allow-listing)
  • Algolia — used by the Terraform Registry for indexing
  • VCS/SAML endpoints, public cloud cost estimation APIs as applicable
Air-Gapped Environments
If outbound access is not available, TFE can run fully air-gapped. Manually download and host provider and Terraform binary versions in the TFE registry as releases occur.

Resource Sizing

ComponentAWSAzureGCP
DiskEBS gp3 (3000 IOPS)Premium SSD (5000 IOPS)Balanced Persistent SSD (10000 IOPS)
Machine (default)m7i.2xlarge (8 vCPU, 30 GB)Standard_D8s_v4 (8 vCPU, 30 GB)n2-standard-8 (8 vCPU, 30 GB)
Machine (scaled)m7i.4xlarge (16 vCPU, 61 GB)Standard_D16s_v4 (16 vCPU, 61 GB)n2-standard-16 (16 vCPU, 61 GB)
Databasedb.r6i.xlargeGP_Standard_D4ds_v4db-custom-4-16384
Cache (Redis)cache.m5.largePremium P1STANDARD_HA

CPU Sizing Rules (All Providers)

  • Avoid burstable CPU instances (AWS T-type, Azure B-type, GCP e2-/f1-/g1-series)
  • Choose latest generation general-purpose x86-64 instances
  • Use CPU/RAM ratio of 1:4 or higher
  • Do not use memory-optimized instances

Concurrency & RAM Calculations

The TFE_CAPACITY_CONCURRENCY variable controls concurrent workspace runs. Default RAM per agent is 2048 MiB. Formula:

RAM Sizing Formula
Max agent RAM = TFE_CAPACITY_CONCURRENCY × 2 GB
Add 10% overhead for OS and TFE application (~4 GB). For 30 concurrent agents: 30 × 2 GB = 60 GB + overhead = ~66 GB total.

Default machine (8 vCPU, 30 GB): maximum concurrency of 11. Scaled machine (16 vCPU, 61 GB): maximum concurrency of 26.

HVD Module Deployment Process (High-Level)

  1. Import TFE modules into your VCS repository
  2. Configure remote state storage (S3/Blob/GCS or HCP Terraform free tier)
  3. Select a machine with Terraform CLI available and cloud credentials instantiated
  4. Read the module GitHub README in its entirety before starting
  5. Prepare TLS certificate and private key (SAN must match FQDN; no self-signed certificates)
  6. Run terraform init, plan, apply
  7. Tail installation logs post-deployment; watch for errors
  8. Retrieve the Initial Admin Creation Token (IACT) within 60 minutes of deployment
State File Security
TFE deployment state contains sensitive information. Do not store in VCS or any unprotected location. This is the only state file that requires separate protection — TFE protects all other state your org generates.
K8s Architectural Summary
Kubernetes deployments require active-active mode — a Redis instance is always separate. Use managed PostgreSQL, object storage, and Redis (not Redis Cluster). Layer 4 load balancer for ingress. Three AZs (one pod per AZ) provides n-2 failure profile. Single-region architecture.

General Guidance

  • Separate TFE pods and HCP Terraform agent worker pods — agent workload is inconsistent under load
  • Use HCP Terraform Operator instead of the internal Kubernetes driver run pipeline for customers going beyond default concurrency per TFE pod
  • Three TFE pods is sufficient for HA — HCP Terraform agent cluster capacity has the greatest impact on run success at scale
  • TFE supports x86-64 on all versions; ARM requires v1.0.0 or later
  • Do not use burstable instance types (AWS T-type, Azure B-type, GCP e2-/f1-/g1-series)

K8s Resource Sizing

ComponentEKSAKSGKE
DiskEBS gp3Premium SSD Managed DisksPersistent SSD Disks
Machine (3-node cluster)m7i.2xlarge (8 vCPU, 32 GB)Standard_D8s_v5 (8 vCPU, 32 GB)n2-standard-8 (8 vCPU, 32 GB)
Machine (5-node cluster)m7i.xlarge (4 vCPU, 16 GB)Standard_D4s_v5 (4 vCPU, 16 GB)n2-standard-4 (4 vCPU, 16 GB)

Approximate minimum cluster sizing for HCP Terraform agents with 3 TFE pods at system defaults:

  • 3-node cluster: 96 GB total memory, 64 GB (n-1)
  • 5-node cluster: 80 GB total memory, 64 GB (n-1)

Agent Count Formula

Agent Capacity Calculation
Number of agents = TFE_CAPACITY_CONCURRENCY × number of TFE pods
Default: TFE_CAPACITY_CONCURRENCY = 10. With 3 TFE pods: expect capacity of 30 concurrent agents. RAM per agent = 2 GB by default, configured via agentWorkerPodTemplate in Helm overrides.

Internal Run Pipeline vs. HCP Terraform Operator

Is the customer going beyond default concurrency per TFE pod?
✓ Yes / At Scale
→ HCP Terraform Operator
Prevents thundering herd issues. Establishes minimum agent replicas and spreads demand more efficiently.
Smaller Deployment
→ Internal run pipeline may suffice
Simpler to start. All agents come online at once under load — acceptable at small scale.

Network Considerations

  • Specify a version tag for HCP Terraform agent image (e.g., tfc-agent:<tag>) — using :latest pulls the image on every run, adding unnecessary network load
  • HVD Modules deploy layer 4 load balancers (highest throughput available)
  • Load HCP Terraform agent Docker image from a region-local source (ECR) rather than public internet when possible
  • Do not use instances with burstable network characteristics

Common Troubleshooting

ErrorLikely CauseResolution
ImagePullBackOffCluster cannot pull TFE container from HashiCorp registryCheck permissions, image version in locals_helm_overrides.tf, and that license file has no newline (run cat tfe.hclic | base64)
CrashLoopBackOffTFE container failing to startOpen two terminals — one to tail terraform-enterprise.log, one to run helm install. Capture startup error for support ticket.
Active-Active is Still the Target
Even for private cloud, HashiCorp recommends the active-active deployment topology for HA, resilience, and scale. Automated deployment is a hard requirement for active-active — engineering automation must be a headline project planning item.

Private Cloud Component Considerations

🗄️
Redis
Hard requirement for active-active — must be deployed on-prem

Multiple concurrent TFE nodes require an external Redis instance. This is a hard requirement for active-active. If Redis is not feasible on-premise, HashiCorp recommends considering public cloud deployment or HCP Terraform (SaaS). Redis Sentinel is not supported.

The only alternative is to use external operational mode with a single TFE container — acceptable if RTO allows, but forward plan for Redis in due course as business HA requirements typically increase over time.

🐘
PostgreSQL
Specific version and schema requirements

TFE depends specifically on PostgreSQL. Private cloud requires an organizational pattern for deploying and production-managing PostgreSQL at the supported versions. Recommended: version 15.x or later (14.x has some unsupported versions). Version 17.x supported through end of 2029.

Liaise with your DBA team early — there are specific schema requirements.

Sizing

  • CPU: 4 core
  • Memory: 32 GB RAM
  • Disk: 2 TB
📦
S3-Compatible Object Storage
Required — third-party tech to present S3-compatible API

TFE requires S3-compatible storage. In a private datacenter, this requires a third-party technology. HashiCorp sees significant success with Dell ECS and MinIO. If the organization already has an S3-compatible pattern, use that.

Compute Sizing (Private Cloud)

Recommended operating systems: Red Hat Enterprise Linux or Ubuntu LTS.

ComponentRecommended Spec
TFE Compute4 vCPU, 32 GB RAM, 1 TB disk (many scaled customers use 8 vCPU/32 GB as initial production spec)
Disk (Docker)Min 40 GB available to /var/lib/docker; recommend 3000 IOPS minimum
PostgreSQL4 core, 32 GB RAM, 2 TB disk
Redis Cache4 core, 16 GB RAM, 500 GB disk. Redis 6.2+ or 7. Recommend 7.

Monitoring PostgreSQL and Redis

Monitor CPU, memory, available disk space, and disk IO using organizational telemetry. Create alerts at 50% and 70% utilization thresholds. If any parameter consistently exceeds 70%, increase the resource.

Network Requirements

TFE has specific ingress and egress requirements — refer to the TFE network requirements page for the latest. If a corporate proxy is filtering outbound traffic, add required destinations to the allow-list. Use a layer 4 load balancer. Air-gapped mode is available if external access is not possible.

Operating System

  • Use OS configurations compliant with the CIS benchmark for the chosen operating system
  • Limit CLI access to machines to a shortlist of well-known staff
  • Ensure the organization's SIEM/audit log reflects all access

Application

  • Use single sign-on (SSO) with multi-factor authentication (MFA) for all users
  • TCP port restrictions for ingress/egress are configured by the deployment. Do not alter unless advised by HashiCorp support, a solutions engineer, or certified partner.
  • Enable the Strict-Transport-Security response header
  • For manual installs: set restrict_worker_metadata_access as a Docker environment variable to prevent Terraform operations from accessing the cloud instance metadata service
  • HVD Module automated deployments restrict access to the AWS metadata service — do not re-enable this
  • After deployment, do not create the initial administrator immediately — coordinate a handoff to the operations team
Critical Architectural Constraint
TFE is a single-region application even in active-active mode. PostgreSQL does not operate active-active cross-region. Multi-region support is for DR/business continuity only — run two separate TFE instances in two regions. Do not promote the secondary DB replica to read-write while the primary is still online — this causes data divergence that cannot be automatically reconciled.

Primary Planning Considerations

💰
Cost vs. Risk

The benefit of duplicating system tiers in a secondary region outweighs cost given TFE's mission-critical role. Calculate TCO of two TFE instances (one per region) including geo-redundant data layer costs. Also calculate the cost to the business if developers cannot deploy applications — this is often the more compelling number. Include any committed spend under enterprise discount programs.

🤖
Automation

Use automated means to deploy all infrastructure in both regions. Use HVD Modules to mirror resources across regions — deploy and maintain state for each region separately.

Do Not Automate Failover Triggering
Automated health checking is recommended. Automated fail-over detection + execution together is not recommended — transient network outages can cause unnecessary failovers with data replication lag consequences. Senior management or dedicated business continuity staff should declare an outage.
🧪
Testing
  • Test region failover capability on a regular cadence — at least twice annually
  • Document both failover and failback processes step-by-step in run books
  • Have team members who did not write the document use it — validates clarity and trains staff
  • Deploy an engineering pair of TFE instances (one per region) mirroring production; use for meaningful failover tests
  • Maintain independent instances: each environment must have its own DNS, storage, and supporting services
  • Perform fault injection testing using cloud provider features or third-party tooling

Component-Specific Guidance

🖥️
Compute
  • Keep VM/container images version-controlled and available in the failover region. Use Packer as the standard for machine image creation.
  • Do not run TFE containers in the secondary region while primary is online — risk of premature DB read replica promotion causing corruption
  • Keep compute cluster infrastructure deployed but scaled down until failover
  • Keep TFE containers at the same version in both regions; upgrade during the same change window
  • Co-locate primary and secondary compute layers in the same regions as their respective storage and database components
🪣
Object Storage — by Cloud
CloudRecommendationRPO Consideration
AWSS3 Cross-Region Replication (CRR), live replication, bidirectional during failover, S3 Replication Time Control for monitoring99.99% of objects replicate within 15 min. Check for missing objects in run book before failing over.
AzureGeo-zone-redundant storage (GZRS) — 16 nines durability. Use Standard general-purpose v2 storage accounts. Deploy only in paired regions with AZ support.Azure Storage Geo Priority Replication guarantees 99% of blobs replicate within 15 min.
GCPDual-region GCS buckets with Turbo ReplicationTurbo Replication guarantees 100% of objects replicate within 15 min. Premium feature — recommended for mission-critical TFE.
VMwareDeploy identical object store in each region; use strategic inter-DC connections for migration. Most customers solve with vSAN.Work with VMware team to understand replication SLA between data centers.
🐘
PostgreSQL — by Cloud
CloudRecommendation
AWSUse Aurora with cross-region read replicas. Use aws_rds_global_cluster resource. Aurora global databases replicate in ~1 second. Monitor AuroraReplicaLag and AuroraGlobalDBReplicationLag metrics.
AzureUse Azure Database for PostgreSQL read replicas. Set geo_redundant_backup_enabled = true. Monitor replication lag — Azure auto-sets DB to read-only below 5% storage, which adversely affects TFE.
GCPUse Cloud SQL with cross-region read replicas. Enable PITR.
Split-Brain Prevention
Be irrefutably certain the primary database replica is offline and stays offline before promoting the secondary read replica to read-write. Automated failover execution is not recommended for this reason.
🔴
Redis

Redis does not require cross-region replication. Deploy Redis in both regions and ensure it is ready in the failover region before starting TFE. HVD Modules handle this when used iteratively for both regions.

The Platform Team

The platform team serves as the central hub, orchestrating functions and ensuring streamlined operations across teams. It may consist of one or more teams with separate areas of responsibility.

🏛️
Cloud Center of Excellence (CCoE)
Strategic component of the platform team

Drives cloud adoption and aligns cloud strategies with business objectives. Establishes governance frameworks, fosters knowledge sharing, and optimizes cloud resources to maximize value and ensure compliance.

🔧
Automation / Tools
Tactical component — manages tooling, implements golden workflows

Manages essential tools and automation to support efficient system operations. Implements golden workflows and reusable modules. Collaborates with stakeholders to ensure services meet consumer needs. May or may not be a separate sub-team.

📐
IaC / PaC
Sets intent for infrastructure-as-code and policy-as-code

Collaborates with security to ensure adherence to standards and best practices. Promotes standardization, scalability, and compliance across projects.

🚀
Internal Developer Platform (IDP)
Self-service golden paths for development teams

Streamlines development cycles and reduces organizational complexity. Uses a product management approach, allowing development teams to consume golden paths in a self-service mode, enhancing productivity and efficiency.

Security Team Collaboration

The security team collaborates with the platform team to establish governance policies and deploy monitoring tools. Key cloud security functions include:

  • Subscribing to security updates and vulnerability alerts; collaborating with SRE on patches
  • Managing TLS certificate validity and renewal for TFE
  • Scanning TFE instances for vulnerabilities; integrating SIEM tools for audit logging
  • Providing governance inputs to CCoE (PaC guardrails, CIS benchmarks)

Producers and Consumers

RoleResponsibilities
Producers
(Platform Team)
Provide seamless onboarding to HCP Terraform platform. Manage the private registry. Oversee Policy-as-Code implementation. Offer enablement to consumers.
Consumers
(Application Teams)
Initiate requests for platform access. Write Terraform code based on available private registry modules and platform team recommended practices.

Golden IaC Workflows

A golden workflow is a standardized, repeatable process for completing a specific task. The platform team shifts these workflows left using pre-approved Terraform configurations and curated modules, empowering teams to independently provision infrastructure while maintaining centralized management.

Workflow TypeDescription
Producer — Module DevelopmentCreate Terraform modules and register them with the private registry
Producer — PaC DevelopmentDevelop Sentinel/OPA code and register policy sets in HCP Terraform
Consumer — Landing ZoneProvision cloud accounts/credentials and core TFE configuration elements from reusable modules; deploy VCS repos, projects, workspaces, and Stacks
Consumer — DeveloperCreate IaC relevant to support application components under remit
What is the target user group's technical knowledge level and need for customization?
Low Tech / Limited Customization
→ Service Catalog Model
Standardized infrastructure patterns. Business users, project managers, any persona. Vending portal (UI, ServiceNow, etc.).
High Tech / High Customization
→ Infrastructure Franchise Model
Build anything within guardrails. Development teams, SRE. Git, API, CI/CD as primary UX.
🛍️
Service Catalog Model
Centralized vending portal of pre-certified infrastructure components

Key Features

  • Vending portal: Central platform where users view, request, and build standardized components. Platform team maintains; offers self-service experience with audit and chargeback policy support.
  • Pre-configured components: Range from full application templates to specific infrastructure aspects — guarantees deployment uniformity.
  • Accelerated consumption path: Business units choose from validated patterns for faster delivery, or opt for custom architecture (slower process).

Supporting TFE Features

FeatureUse
Private RegistryHost and share internal modules/providers; versioned and searchable
Private Providers & ModulesRestrict access to org members; cross-org sharing available in TFE
Run TasksDirect integration with third-party tools at specific run lifecycle stages
🏪
Infrastructure Franchise Model
Platform team sets rules; consumers build independently within guardrails

Key Components

  • Central control team (franchiser): Provides workflow and resources, keeps system running, sets rules, adds capabilities.
  • Consumption workflow: End users have a path to access provisioning resources and manage their own infrastructure.
  • Upfront governance: Governance at the outset ensures compliance while enabling provisioning — avoids separate compliance sign-off at go-live.
  • Controlled vending: Proactive controls ensure legal, regulatory, and enterprise standards are met.

Supporting TFE Features

FeatureUse
Terraform WorkspacesPersistent working directory per collection of infrastructure resources
Workspace ProjectsEnables self-managed portions of TFE with same policies as root org
Sentinel PoliciesPolicy-as-code for fine-grained, logic-based policy decisions using external source data
⚙️
Post-Installation Tasks

After automated installation, the Initial Admin Creation Token (IACT) may be created as an optional final step. Two viable options for post-provision configuration:

  • API scripts
  • Terraform Enterprise provider (best practice — use this to derive state for the configuration)
🔑
SSO and Teams

Automate team creation alongside projects, Stacks, and workspaces using the TFE provider. Automate user addition within your IdP. Configure a team attribute name (default: MemberOf) in your IdP to automatically assign users to groups in SAML.

  • For service accounts used by pipelines: use IsServiceAccount: true in SAML
  • Create teams in TFE with the exact name of the group in the IdP
  • Do not create users manually — TFE creates them automatically at SAML assertion login
🔗
VCS Connection

Connect HCP Terraform/TFE to your VCS provider to enable workflows for managing modules, policy sets, and connecting VCS-backed Stacks and workspaces.

Which VCS provider is the customer using?
GitHub
→ GitHub App (preferred)
Not tied to a specific user. No personal access token. Safe as a static VCS connection — no token rotation needed. TFE supports a single site-wide GitHub App across all organizations.
Other VCS Providers
→ OAuth
Authorize using a service account with access to all IaC and PaC repositories. Rotate tokens periodically. Recommend one OAuth connection per VCS provider per TFE organization.
📊
Audit & Agents

Audit: For TFE, forward logs for monitoring and auditing. For HCP Terraform, use the Audit Trails API. Include audit, logging, and monitoring in the target architecture — do not wait until after go-live to implement observability.

HCP Terraform Agents: Define agent pools and assign Stacks/workspaces using the TFE provider. Multiple agents can run concurrently on a single instance (license limits apply to HCP Terraform but not TFE). For containerized agents, use single-execution mode for a clean working environment per run.

Agent Upgrade Strategy
Test new agent versions in a separate pool before rolling out to main pools. Use the platform team's dedicated org for this. Patches require little testing; major changes require more extensive tests. Use the rolling deployment approach illustrated in the HVD.

Hierarchy Overview

LevelScopeKey Notes
OrganizationEncompasses all components — teams, projects, workspaces, Stacks, policies, registry, VCS, variable sets, SSH keysCentralize core provisioning in a single org. Naming: <customer-name>-prod and <customer-name>-test.
ProjectContainer for workspaces and Stacks. Inherits team permissions, variable sets, and policy sets.Primary tool for delegating config/management in multi-tenant setups. Typically allocated to an application team.
Workspace / StackManages a Terraform configuration and its associated state fileWorkspace-level permission applies to that workspace only
Concurrency Defaults
Default concurrency is 10 (TFE spawns up to 10 worker containers simultaneously). Default RAM per agent container is 2048 MiB. For active-active, concurrency is per active node (max 5 nodes). If concurrency needs to scale without vertical VM scaling, use HCP Terraform agents with a dedicated agent pool.

Access Management

HCP Terraform access is built on three components: User accounts, Teams, and Permissions. Implement SAML/SSO for user management combined with RBAC.

👑
Owners Team
Created automatically; cannot be removed

Comprehensive access to all org aspects. Certain tasks are reserved to owners only: creating/deleting teams, managing org-level permissions, viewing the full team list (including secret teams). Limit membership to a small group of trusted platform team members.

🔐
Key Organization-Level Permissions
PermissionPurpose
Manage PoliciesCreate, edit, read, list, delete Sentinel policies; access to read runs on all workspaces for enforcement
Manage Run TasksCreate, edit, delete run tasks within the org
Manage Policy OverridesOverride soft-mandatory policy checks
Manage VCS SettingsManage VCS providers and SSH keys
Manage Private RegistryPublish and delete providers/modules — owners only
Manage MembershipInvite/remove users; add/remove from teams. Cannot create teams or view secret teams.
Principle of Least Privilege
Grant org-wide permissions only to the most senior platform team members. Restrict development team access. Treat the organization token as a secret.
📁
Project Design Principles

Projects are containers for workspaces and Stacks. Configuration elements attached at project level are inherited by all workspaces/Stacks within: team permissions, variable sets, policy sets.

Standard Project-Level Permissions

  • Admin: Full control over the project (including deleting it)
  • Maintain: Full control of everything in the project, except the project itself

Benefits of Project Delegation

  • Agility: Teams can create and manage infrastructure in their designated project without requesting org-admin access
  • Reduced Risk: Project permissions give admin access to a subset of workspaces/Stacks without cross-team interference
  • Self-Service: Projects integrate no-code provisioning — project admins can deploy no-code modules without org-wide workspace management privileges
Automated Onboarding
Automated onboarding is critical for project success at scale. Use the HCP Terraform / TFE provider or the TFE API to create and configure projects. Avoid manual UI workflows — they do not scale.
📝
Naming Conventions

Create and maintain a naming convention document covering teams, projects, workspaces, Stacks, and other entities. Pass this to all new internal clients to standardize operations.

Recommended org naming: <customer-name>-prod (production workloads) and <customer-name>-test (integration testing and PoCs).

Cloud Provisioning Key Steps

  1. Plan the project: installation/config of TFE, self-service capability design. Consider onboarding early adopters before general availability — use stepwise refinement.
  2. Consider platform team size and bandwidth — if any onboarding step is not automated, it compounds with scale.
  3. Plan a landing zone provisioning workflow.
  4. Ensure the platform team is fully trained on IaC with Terraform and the cloud providers in use. All contributors must adhere to the HashiCorp Terraform language style guide.
  5. Set up a TFE Stack or workspace with cloud credentials.
  6. Store Terraform code in your strategic VCS.
  7. Provision cloud resources from the VCS-backed Stack or workspace.
  8. Consider configuration required to enable end-to-end deployment within org security and compliance requirements. Identify manual steps and what would be needed to automate each.

Landing Zones

A cloud landing zone is a foundational, standardized environment for secure, scalable cloud operations. The platform team deploys a control workspace during onboarding of each internal customer.

🏗️
What Landing Zones Address
  • Networking: VPCs, subnets, connectivity settings
  • IAM: Policies, RBAC, permissions enforcement
  • Security & Compliance: Encryption, security groups, logging
  • Operations: Monitoring, logging, automation tools
  • Cost Management: Tagging policies, budget alerts, cost reporting

Major cloud providers: AWS Control Tower, Azure Landing Zone Accelerator, Cloud Foundation Toolkit (GCP).

📋
HCP Terraform Landing Zone Design

Core Requirements

  • Use the Terraform Enterprise provider for state representation
  • Define a VCS template with boilerplate Terraform code and a directory structure managed by the platform team
  • Create a Terraform module for the private registry that the platform team calls during automated onboarding. This module creates:
    • A control workspace for the application team
    • A VCS repository for the application team
    • Variables/sets as needed
    • Public and private cloud resources for the team
  • Hook the onboarding pipeline into other org platforms (credential generation, observability, etc.)
  • Dedicate a TFE project to house landing zone control workspaces separately from other platform team workspaces
Iteration is Expected
Expect to update the module and VCS template multiple times as you scale. If you onboard ten teams and then need to change the process, subsequent teams get the update but earlier teams may need retrospective updates. This is why stepwise onboarding with early adopters is critical.
ServiceNow Integration
If using ServiceNow as the front door to onboarding, liaise with the ServiceNow team as early as possible — there is always a lead time for remediation.
🔄
Landing Zone Workflow Summary
  1. Ticket raised: Audit trail created, approval acquired
  2. Landing zone child module call code added: Top-level workspace collects and manages onboarded teams. Automate the addition of each child module call — do not do this manually at scale.
  3. Run the top-level workspace: Creates the control workspace for the application team, VCS repo, variables, cloud resources
  4. Application team onboards and begins using their workspace

Workflow Personas

PersonaRoleExample Titles
DeveloperDevelops infrastructure and application codeSoftware engineer, application developer
Lead DeveloperHelps efforts of product developer teamsDevelopment lead, technical team lead
Release EngineerCoordinates deployment to production using automationRelease engineer, release manager
Platform EngineerWrites pipeline definitions; enables developers to use pipelinesDevOps engineer, operations engineer
Infrastructure OperatorMaintenance, configuration, administrationSRE, site reliability engineer, systems admin

VCS-Driven Workflow

📦
VCS-Driven Workflow Overview
Recommended for most teams — webhooks trigger runs automatically

A specific VCS repository backs each workspace or Stack. HCP Terraform uses webhooks to monitor commits, pull requests, and tags. Changes trigger plan runs; PRs trigger speculative plans.

Prerequisites

  • VCS repository containing source code for the deployment
  • VCS authentication enabling secure access for HCP Terraform
  • VCS permissions configured (read-only, merge, etc.)
  • Workspace or Stack naming conventions and permissions defined

High-Level Steps

  1. Configure VCS integration for your organization
  2. Connect workspace or Stack to the desired branch in your VCS repository
  3. Adopt a branching strategy (e.g., standard feature branching)
  4. Enable speculative plan runs for each branch
  5. Define PR process per organizational standard
  6. (Optional) Configure automatic run triggers based on git tag pushes
🏷️
Run Trigger Options
Three methods to specify which changes trigger runs
  • Branch change: Trigger when a specific branch changes — long-running or merged feature branches
  • Tag match (pattern): Trigger only for changes with a specific tag format. Supports semantic versioning, prefix, suffix, or custom regex.
Tag FormatRegex Pattern
Semantic Versioning^\d+.\d+.\d+$
Version with prefix\d+.\d+.\d+$
Version with suffix^\d+.\d+.\d+
Auto Apply
Applies changes from successful plans without prompting for approval

Auto apply is configurable per workspace. Useful in non-interactive non-production environments to automatically run terraform apply on a successful plan. Works regardless of how the plan triggers (VCS, API, etc.).

Use Case
Enable auto apply in dev/staging environments for CI speed. Require manual approval in production for change control and auditability.

API-Driven Workflow

For customers with strategic CI/CD pipeline orchestrators, HCP Terraform and TFE form the infrastructure management component of those pipelines. In this model, the CI/CD tool drives the Terraform run via the TFE API rather than through VCS webhooks. This supports more complex orchestration patterns where infrastructure changes are part of a larger pipeline.

Use the Import Block
Terraform 1.5.0 introduced the import block. It allows declaration of resources to import in configuration, bulk import via for_each, a planning phase before import, and a sub-command to generate configuration for imported resources. This is the recommended approach.

What to Import vs. What Not to Import

✓ Import These
Resources created by other methods
IaC tools, other tooling, or manual click-ops. Resources that cannot be rebuilt with downtime using snapshot, imaging, or backup methods.
✗ Do Not Import
Dynamic, temporary, or unsupported resources
Dynamic infrastructure with changes outside Terraform. Temporary or experimental resources. Resources unsupported by providers (would require null_resource). Resources managed by teams not using Terraform. Resources that can be rebuilt with downtime.

Team Responsibilities

TeamResponsibilities
Platform Team Set up/maintain workspaces, projects, orgs, policy sets, cloud auth. Develop import best practices and module usage guidance. Provide training to application teams. Convert repeated configs into modules.
Application Team Identify which resources to import. Work with platform team on accurate transition. Provide resource attribute information. Run day-to-day plans/applies through standardized module usage.

Phased Approach

1️⃣
Phase 1 — Pilot and Expand
  1. Application team gets guidance from platform team on which resources to manage
  2. Start with a small pilot set; complete to documentation and review
  3. Gradually expand resources from the same team; then move to other teams
  4. Continuously review and refine the process; ensure application teams maintain Day 2 responsibilities
  5. Platform team enables drift detection features to catch out-of-band changes
2️⃣
Phase 2 — Modularize
  1. After Phase 1, identify common configuration patterns
  2. Place common resources into modules to scale Terraform maturity and consumption
  3. Add granularity between modules accounting for permissions and security (you cannot partially instantiate a module)
  4. Use projects, variable sets, and Sentinel policy sets to organize the new structure

Feature Availability by Product

FeatureHCP TerraformTerraform Enterprise
Operational LogsNo (HashiCorp SRE manages)Yes (your SRE team manages)
Audit TrailYesYes
MetricsNo (HashiCorp SRE manages)Yes
HCP Terraform Agent LogsYesYes

Observability Feature Definitions

  • Operational logs: Track performance and behavior — error messages, warnings, events. Used by SRE to monitor and maintain service.
  • Audit trail: Security-focused — login attempts, access control changes, suspicious activities. Used by security analysts and incident response teams.
  • Metrics: Application component performance and usage data — detect service quality issues or inform capacity planning.

TFE Monitoring Focus Areas

📊
Prometheus + Grafana

Configure Prometheus to gather metrics from TFE and its underlying components. TFE generates operational metrics Prometheus can collect: CPU, memory usage, request latency, and more.

Use Grafana to visualize metrics. The official Terraform Enterprise Grafana dashboard (ID: 15630) provides real-time insights into resource utilization trends and request rates.

📝
Agent Logs

If using HCP Terraform agents, include agent logs in your log collection and analysis. This ensures a complete picture of all activities. Analyze agent logs alongside TFE logs to better optimize and improve deployment.

  • Each team member should have an account created through the HashiCorp support portal
  • Configure each team member permitted to open support tickets as an Authorized Technical Contact. Provide the list to your assigned HashiCorp Solutions Engineer for configuration.
  • Familiarize the team with documentation on how to open a support ticket and generate/upload a support bundle when applicable
  • Be aware of your support plan level to manage response time expectations
  • Understand the severity level definitions when opening a ticket — see the Customer Success Enterprise Support page

Health Assessment Components

🔍
Drift Detection
Identifies when actual state deviates from Terraform configuration

Drift detection identifies configuration drift: when changes are made directly to infrastructure outside Terraform's managed processes, creating discrepancies between live state and the code-defined desired state.

Drift vs. State Drift
Configuration drift is a change in the configuration of an app or infrastructure. State drift (external changes that don't invalidate your config) is different — drift detection does not detect state drift.

Limitations

  • Unmanaged attributes: Drift detection only covers attributes explicitly managed by Terraform. Manually modified settings outside Terraform's control won't be flagged.
  • External additions: Resources added entirely outside Terraform (e.g., manually created IAM users) are not detected.

Prescriptive Guidance

  • Enable health assessments for all workspaces — set globally in Settings → Health
  • Enable workspace notifications so admins are alerted via Slack or email when drift is detected
  • TFE admins can modify assessment frequency and maximum concurrent assessments from admin settings console
Continuous Validation
Covered in detail in the Scaling Guide — see that section

Continuous validation enforces standards over time — set rules for security, cost, or other requirements, and Continuous Validation checks they're always being met. Covered in the Scaling Operating Guide.

Drift Resolution Workflow

Once drift is detected, the workspace notifies the application team. They decide the best resolution:

Is the drift an undesired change that should be reverted?
✓ Undesired Drift
→ Overwrite Drift
Initiate a new Terraform plan and apply to revert resources to their configuration-defined state.
Intentional Change
→ Update Terraform Config
Modify Terraform configuration to include the changes and push a new configuration version. Prevents TFE from reverting the change during the next apply.

Refresh State vs. Update Configuration

OperationWhat it DoesWhen to Use
Refresh StateUpdates Terraform's internal state file to match actual infrastructure — does NOT modify infrastructureTo ensure Terraform's understanding is accurate for identifying drift
Update Terraform ConfigurationSubmits updated config files and executes plan/apply — DOES modify infrastructureWhen incorporating intentional drift changes or correcting configuration to match desired state

Registry Roles

RoleResponsibilitiesPersonas
Registry Administrator Publish and delete modules/providers from the private registry (public or private sources). Organization-level permission — assign to a specific team. Platform team members, CI/CD pipelines automating the publish process
Producer Create and maintain modules/providers. Publish new releases. Needs commit access to the VCS repo + Registry Administrator permissions. Platform team responsible for custom modules, service owner teams
Consumer Find and use providers/modules necessary to provision infrastructure. Needs commit access to VCS repo and write access to TFE workspaces. Application team members, platform team using the TFE provider
Delegate Publishing via CI/CD
HashiCorp recommends delegating module release publishing to a CI/CD pipeline with a team API token that has Registry Administration permission. Rotate or manage this token with HashiCorp Vault and the Terraform Cloud secrets engine.

Module Requirements

📝
Requirements Before Publishing
  • VCS Repository: Module code must be hosted in a supported VCS repository
  • Semantic Versioning: Must use semantic versioning scheme for version constraints to work
  • Naming Convention: Must follow terraform-<PROVIDER>-<name> format (e.g., terraform-aws-ec2-instance)
  • Standard Module Structure: Enables the registry to generate documentation, track resource usage, parse examples, run tests
  • VCS Provider Configured: Must have a VCS provider configured with administrative access to the module repositories
🚀
Publishing Options
MethodBest ForNotes
Tag-basedModules associated with release tags in VCSRegistry auto-detects and publishes new versions based on tags. Consider implementing tag protection rules in VCS.
Branch-basedEnhanced flexibility; required for integrated testingAllows selection of a specific branch with an assigned version number
API / CI/CD PipelineAutomation and scaleRecommended for delegating publishing to CI/CD

Additional Benefits of Custom Modules

Beyond reusability, custom modules enable organizations to encode:

  • Naming conventions
  • Security controls
  • FinOps standards (tagging, cost allocation)
Sentinel vs. OPA
HashiCorp supports both Sentinel and Open Policy Agent (OPA). Recommend Sentinel for its maturity and performance characteristics. Recommend OPA when the organization has standardized on OPA and has existing policies. Hybrid use of both is also encouraged.

Key Benefits of Sentinel

🛡️
Risk Mitigation & Governance
  • Risk mitigation: Actively lowers chances of errors and vulnerabilities by enforcing rules during planning and execution
  • Regulatory governance: Ensures every action aligns with org policies, regulatory guidelines, and security standards at scale; simplifies auditing
  • Separation of concerns: Policies managed by platform/compliance/security teams, separate from the application teams deploying infrastructure; workspace owners cannot opt out without explicit permissions
  • Sandboxing: Policies act as core guardrails — reduces need for manual verification
💡
Codification & Automation Benefits
  • Codification: Makes governance clearer, more efficient, consistent, and operationally reproducible. Eliminates reliance on oral traditions. Fosters transparency.
  • Version control: History tracking, diffs, PRs — demonstrable and auditable policy evolution
  • Testing: Sentinel's built-in testing framework enables automated CI testing — reduces TCO for governance
  • Automation: Policy deployment is far faster than manual work and requires fewer humans; ensures consistency at scale

Policy Enforcement Workflow

  1. Define governance/compliance policies and translate them into Sentinel policy requirements
  2. Code policies using the Sentinel language; arrange in policy sets
  3. Scope policy sets to the entire organization, or to one or more projects/workspaces
  4. Policy enforcement levels: advisory (warns but doesn't block), soft-mandatory (can be overridden by authorized users), hard-mandatory (cannot be overridden)

People and Responsibilities

  • Sentinel policies owned by the Security team with input from regulatory compliance areas
  • CCoE/Platform Team understands Sentinel code and how to employ it in production pipelines
  • CCoE/Platform Team partners with Security Team to manage ownership and RBAC over policy code

Repository Organization Best Practices

📁
Repository Structure
  • policies/: Main Sentinel policies, organized by environment. Each environment subdirectory contains policies and a test subdirectory (test files, mock data, Terraform config used to generate mock data).
  • modules/: Reusable policy modules. Common TF import functions stored here — flat structure with illustrative filenames.
  • docs/: Always fully document policies and their tests.
✍️
Writing Policies — Getting Started
  • Acquire IT security policies relevant to infrastructure deployment from your security team — translate into Sentinel policy-as-code
  • If no policy list exists, agree on general controls internally and design them to be extended over time
  • Staff responsible for policy development must understand the Sentinel language — read official language documentation
  • Attend HashiCorp Sentinel Academy training (hands-on labs and real-world examples) — contact your HashiCorp Solutions Engineer or Customer Success Manager
  • Use Sentinel modules (0.15.0+) to specify reusable functions that reduce codebase length
  • Clone the terraform-sentinel-policies GitHub repo — provides prewritten policies for public/private cloud providers and reusable function modules

Common Starting Policies

  • Governance of maintenance windows (protecting from adverse change at wrong times)
  • Enforcement of metadata tagging of cloud resources
  • IaC style enforcement (e.g., Terraform module versions pinned, only from private registry)

How Run Tasks Work

Run tasks send an API payload to an external service at a specific run stage. The service processes the data, evaluates whether the run passes or fails, and sends a response back to HCP Terraform. Based on the enforcement level, HCP Terraform determines if the run can proceed.

Run Stages

StageAvailable DataUse Case
Pre-planCode and other attributesExamine code to determine if entering the plan stage should be allowed
Post-plan / Pre-applyPlan resultsExamine the plan and determine whether an apply should be allowed (most common stage)
Post-applyProvisioned infrastructure dataTesting and gathering/storing information about provisioned infrastructure
Multiple Run Tasks at a Stage
Multiple run tasks can be defined at a single stage. They execute in sequence. The continuation of the run is determined by the most restrictive enforcement level: if a mandatory task fails and an advisory task succeeds, the run fails. If advisory fails but mandatory succeeds, the run succeeds.

Common Use Cases

CategoryExample Tools
Security & CompliancePalo Alto Networks Prisma Cloud, Zscaler, Snyk, Tenable, Sophos, Aqua Security, Firefly
Cost ControlInfracost, Vantage, Kion
VisibilityPluralith (resource visualization)
Image ComplianceHCP Packer run task (verify approved golden images)

Implementation Flow

  1. Select the desired run task from the public Terraform Registry; review requirements and documentation
  2. Establish and verify two-way connectivity between HCP Terraform platform and the run task endpoint (network/security modifications may be required)
  3. Create the run task in the Terraform Organization (connect endpoints, test communication path)
  4. Associate the run task with a workspace; configure the stage and enforcement level (Advisory or Mandatory)
  5. Run task executes as part of normal run cycles; review results in run completion output
Run Tasks vs. Sentinel
Run tasks can replace some Sentinel policies where a third-party product is better suited to test specific intricacies. This can reduce development and testing time while providing more stringent and accurate security/compliance testing. Requires Terraform Core version 1.1.9 or later.

When to Use Self-hosted Agents

Reason 1
Increase concurrent runs
For TFE: scale capacity beyond vertical VM scaling limits
Reason 2
Access restricted networks
Access isolated APIs or private data center resources not reachable from HCP Terraform
Reason 3
Custom tooling via hooks
Add additional tools to the Terraform runtime via pre/post hooks
HCP Terraform Customers
For HCP Terraform (not TFE), only use self-hosted agents for Reasons 2 and 3 above. If neither applies, use the native built-in workers available in HCP Terraform.

TFE Scaling Strategies (Priority Order)

  1. Migrate to active-active operational mode and increase TFE node count
  2. Add more resources to VMs hosting TFE nodes; increase capacity config params (TFE_CAPACITY_CONCURRENCY, TFE_CAPACITY_CPU, TFE_CAPACITY_MEMORY)
  3. Deploy HCP Terraform agents — the next logical step once the above limits are reached

Agent Pool Design

🏊
Agent Pool Strategy

Design agent pools based on:

  • Product limits (HCP Terraform enforces concurrency/agent limits per tier; TFE has no agent concurrency limit)
  • Cloud provider — dedicated pools per provider simplify credential management
  • Environment — dedicated pools per environment (dev/staging/prod) prevent accidental cross-environment changes
  • Operations — separate pools for highly privileged operations; granular permissions and cleaner audit trails
Dedicated Pools per Secured Environment
HashiCorp recommends dedicating an agent pool for every secured environment (on-premises with strict controls, GCP VPC Service Controls, AWS/Azure equivalents). Reduces blast radius in a security incident.
🏷️
Naming Convention

Recommended pattern: {environment}-{cloud-provider}-agentpool

Examples: dev-aws-agentpool, staging-azure-agentpool, prod-gcp-agentpool

  • Use standardized abbreviations; use hyphens or underscores as delimiters; keep lowercase; avoid special characters and spaces
  • Document the naming convention in org wiki; communicate to all relevant stakeholders
📦
Deployment: VM vs. Kubernetes

Two primary deployment modes:

  • Virtual machines: Run the agent on VMs (e.g., EC2 instances). Use Packer + HCP Packer integration to build agent images. Use autoscaling groups/managed instance groups/scale sets. Use rolling upgrade features.
  • Kubernetes (recommended for K8s-skilled customers): Use the K8s Operator for autoscaling capabilities. See the K8s Operator section for details.
Custom Agent Image
Since adding custom pre/post hooks is a key benefit of self-hosted agents, automate image building via CI/CD. Monitor the HCP Terraform Agent Changelog and HashiCorp Releases API for new versions. Test new versions in a separate pool before rolling out.
📊
Scaling Metrics

Use the following metrics to drive VM-based agent scaling decisions:

  • tfc-agent.core.status.busy — number of agents in busy status at a point in time
  • tfc-agent.core.status.idle — number of agents in idle status at a point in time

API endpoints also provide information for automating scaling decisions — review the agent documentation for current endpoints.

Custom Resources Introduced

CRDPurpose
AgentPoolManages HCP Terraform Agent Pools and Agent Tokens. Supports on-demand scaling operations for HCP Terraform agents.
ModuleFacilitates API-driven Run Workflows; streamlines execution of Terraform configurations.
ProjectManages HCP Terraform Projects — organized and efficient project handling.
WorkspaceManages HCP Terraform Workspaces — structured environment for resource provisioning and state management.

Use Case 1: Auto-scaling Agent Pools

The Operator manages agent pool lifecycle and deployment via the AgentPool CRD. It can monitor workspace queues to trigger autoscaling based on defined min and max replicas.

  • Increase agents up to autoscaling.maxReplicas or licensed limit (whichever is reached first)
  • Reduce agents to autoscaling.minReplicas within autoscaling.cooldownPeriodSeconds when no pending runs exist
HashiCorp Recommendations
Use a dedicated Kubernetes cluster or logical node separation for HCP Terraform agents. Use cluster autoscaling where available — particularly important with high variance between peak and off-peak concurrency. Set minReplicas based on baseline run concurrency for health checks (drift detection and continuous validation).

Sizing AgentPool Autoscaling

  • maxReplicas: Determined by peak-run concurrency demand and HCP Terraform tier constraints. Scale-test your cluster to ensure peak load is handled.
  • minReplicas: Consider baseline run concurrency from health checks (drift detection, continuous validation).
  • Set memory limits and resource requests on agents — helps efficient node placement; critical if using cluster scaling technologies like Karpenter.

Use Case 2: Self-Service Infrastructure via Kubernetes Native Consumption

The Operator lets application developers define infrastructure configuration using Kubernetes configuration files. It delegates the reconciliation phase to HCP Terraform. This frees developers from needing to learn HCL for infrastructure management tasks — useful when your application teams are Kubernetes-native and prefer K8s manifest-based workflows.

Security Considerations

Agent tokens stored in the Kubernetes cluster must be secured using your organization's K8s secrets management approach. Review the Operator documentation for specific security guidance. Egress requirements for the HCP Terraform agent apply when agents are deployed via the Operator — includes provider endpoint connectivity, Terraform registry access, and Terraform releases access.

Drift Detection vs. Continuous Validation
Drift detection flags out-of-band changes to managed infrastructure that can affect a Terraform apply. Continuous validation addresses use cases where more customizable detection rules are necessary — particularly for infrastructure managed outside the workspace or third-party service health checks.

Why Continuous Validation?

Failed infrastructure changes can introduce project delays and expose the organization to operational or security risks. Continuous validation gives advance notice of issues preventing successful changes, so they can be addressed before a Terraform apply fails in production.

Best Practice Recommendation

  • When a new workspace is created, enable continuous validation (explicitly at workspace level or implicitly at org level)
  • Include necessary logic in Terraform configuration to validate important components whose health may change over time
  • If infrastructure changes fail in the future due to an unchecked condition, update the Terraform configuration to incorporate the new validation — and apply this pattern to existing infrastructure code

Rule of Thumb — Which Resources to Validate

  • Check the status of any critical resource that can fail (e.g., VMs)
  • Check validity of resources with user-defined time frames whose failure impacts the application stack (e.g., TLS certificates)
  • Not necessary for inherently durable resources (e.g., S3 buckets — native to cloud provider)

Implementation Requirements

Language FeatureMinimum Terraform Version
Preconditions and postconditions1.2 and later
Check block1.5 and later

Permissions required: Organization health settings require Owners team membership. Individual workspace settings require Workspace Admin access.

Notification Event Categories

🏥
Workspace Events (Health-Related)
All considered critical
EventTriggerPriority
Check FailedContinuous validation check returns unknown or failedCritical
Drift DetectedEvery time drift is detected on this workspaceCritical
Health Assessment ErroredHealth assessment cannot complete successfullyCritical
Auto-destroy ReminderSends reminder 12 and 24 hours before auto-destroy runCritical
Auto-destroy ResultsResults of an auto-destroy runCritical
Minimum Required
If notification volume is too high, at minimum enable: Check Failed, Health Assessment Errored, and Auto-destroy Reminder.
🏃
Run Events
EventTriggerPriority
CreatedRun created, enters Pending stateLow
PlanningRun acquires lock and starts executingLow
Needs AttentionHuman decision required — plan changed, not auto-applied, or policy override requiredCritical
ApplyingPlan confirmed or auto-appliedLow
CompletedRun completed successfullyLow
ErroredRun terminated early due to error or cancellationCritical

Implementation Guidance

Configure notifications via WebUI, the API (tfe_notification_configuration), or using the Terraform TFE provider. HashiCorp recommends using the TFE provider to configure notifications as part of the project/workspace creation process.

Notification Strategy

  • Choose appropriate destination: Slack is popular, but use whatever fits team workflow (email, Teams, etc.)
  • Granular notifications: Avoid broad notifications that cause alert fatigue — focus on critical events
  • Integration with incident management: Integrate with incident management tools so alerts lead to actionable items

Maintenance

  • Periodically review notification settings and adjust based on changing infrastructure needs and team feedback
  • Test when making changes — trigger events manually to verify notifications are received
  • Continuously monitor and solicit feedback to reduce noise and improve relevance

No-code provisioning is deployed into new TFE workspaces. This is a consideration for platform teams managing license consumption — each no-code deployment creates a new workspace.

Roles and Responsibilities

RoleResponsibilities
Registry Administrator (Platform Team)Design, build, and publish no-code modules to the private registry. Ensure modules are configured to allow no-code provisioning. Define and document required variable values.
Project Admin (Application Team)Configure and deploy no-code modules within their project. Manage the resulting workspace lifecycle.

Permissions

  • Marking a module as no-code enabled requires the Manage Private Registry permission or Owners team membership
  • Deploying a no-code module requires Project Admin permission or higher
  • HCP Terraform/TFE uses the module's configured variable set or workspace variables for cloud credentials

Configuring at Scale

  • Use the TFE provider or API to automate no-code module configuration when setting up new projects
  • Define variable sets at the project level to provide the necessary cloud credentials — these are inherited by no-code workspaces
  • Document the no-code provisioning process for consumers so they understand what is available and how to use it

Use Cases

  • Spin up and tear down feature-branch infrastructure automatically
  • CI/CD environments that need fresh infrastructure per pipeline run
  • Time-boxed customer or internal demos and proof-of-concept deployments
  • Any scenario where infrastructure should not persist beyond a defined lifecycle

Roles and Responsibilities

RoleResponsibilities
Platform TeamDefine standards for ephemeral workspace usage. Configure auto-destroy schedules and notification settings. Provide automation patterns for teams to create and destroy ephemeral workspaces.
Application TeamCreate ephemeral workspaces following platform team standards. Manage the lifecycle within defined parameters. Monitor notifications for auto-destroy events.

Permissions

  • Creating ephemeral workspaces and configuring auto-destroy requires Workspace Admin permission or higher at the project level

Configuring at Scale

  • Use the TFE provider or API to automate ephemeral workspace creation as part of CI/CD pipelines
  • Set auto-destroy notifications (12h and 24h reminders are available) so teams are aware of impending destruction
  • Define standard auto-destroy schedules in project-level documentation — prevents unintended persistence of temporary resources
  • Integrate auto-destroy results notifications with your incident management or team communication channels