The Maturity Model
Four Guides at a Glance
| Guide | Primary Audience | Core Question | Prerequisite |
|---|---|---|---|
| SDG | Platform engineers, infra architects, DevOps admins | How do we deploy TFE correctly and keep it running? | None |
| Adopt | Platform team standing up TFE as a shared service | How do we configure TFE and onboard teams onto it? | SDG completed |
| Standardize | Platform team maturing toward org-wide governance | How do we enforce guardrails and scale to the whole org? | Adopt + maturity review |
| Scale (Beta) | Platform team at scale, enabling self-service workflows | How do we extend TFE capabilities once the platform is mature? | Adopt + Standardize |
TFE Components
| Component | Role | Key Notes |
|---|---|---|
| TFE Application | Core application container provided by HashiCorp | Consult HVD Module code for default machine size per cloud/K8s env |
| HCP Terraform Agent | Optional isolated execution environments for runs | Highly recommended; TFE includes built-in agents if not deployed separately |
| PostgreSQL | Primary store for workspace settings, user data, application state | Review supported versions; required for all modes except disk |
| Redis Cache | Caching & coordination between core web/background workers | Required for active-active. Native Redis services from AWS/Azure/GCP validated. Redis Cluster not supported. Redis Sentinel not supported. |
| Object Storage | State files, plan files, config, output logs | All objects symmetrically encrypted (AES-128 CTR). S3-compatible, GCS, or Azure Blob |
Operational Modes
All TFE services — including PostgreSQL and object storage — deploy onto a single node using localized Docker disk volumes. No failover. No active/active. Fully self-contained.
Characteristics
- Minimal resource requirements; no specialized expertise to deploy
- Rapid to stand up
- No failover capability — single AZ, single node
- Cannot scale without downtime
PostgreSQL and object storage move to dedicated infrastructure. Cache remains internal (transient only). Application runs on a single compute node. Each public cloud provides native services for PostgreSQL and object storage that support this pattern.
Characteristics
- Stateless application front-end; distributed core components
- Improved resilience — eliminates single points of vulnerability for data
- Single compute node is still a single point of failure for the application layer
- Does not provide performance scalability
Multiple stateless TFE instances across at least three AZs, connected to external PostgreSQL, shared object storage, and external Redis. The SDG explicitly recommends this for production.
Why This Mode
- n-2 failure profile — survives failure of two AZs
- Eliminates potential service failure points
- Safeguards against revenue loss from unscheduled interruptions
- Ensures data integrity and data residency compliance
- Improves workload distribution and overall performance
Operational Complexity Trade-offs
- Automated TFE deployment process is mandatory
- Monitoring must account for multiple instances
- Custom automation required to manage application node lifecycle
- Note: Redis does not need to be external when running a single TFE node — HVD modules provision Redis automatically when active-active parameter is true
Operational Mode Decision
Design Attributes
The SDG evaluates decisions against four non-functional attributes. Each recommendation in the guide explicitly notes its impact:
| Attribute | Description |
|---|---|
| Availability | Minimizes the impact of subsystem failures on uptime (e.g., multiple load-balanced app instances) |
| Operational Complexity | Occasionally introduces upfront complexity to reduce ongoing operational burden (e.g., Packer-based immutable images) |
| Scalability | Avoids choices that introduce overhead at scale (e.g., automated onboarding vs. manual UI processes) |
| Security | Notes how decisions change security posture (e.g., workload identity/OIDC vs. long-lived credentials) |
Personnel Roles
Coordinates events, facilitates resources, and assigns duties to the Cloud Administration Team. Responsible for project-level planning including timeline and access acquisition.
Assumed knowledge
- Cloud architecture and administration
- Administration-level experience with Linux
- Practical knowledge of Docker
- Practical knowledge of Terraform
Focuses on integrating formal security controls required for services hosted in the chosen cloud environment. Critical for regulated industries.
Designated to own the TFE service post-deployment. Handover planning and documentation should occur before go-live.
Access Requirements
The installation team requires direct (including administrator) access to the following before starting:
| Resource Type | Examples |
|---|---|
| Compute & Storage Instances | VMs, storage volumes, EBS/managed disks |
| Network Objects | Firewall rules, load balancers, security groups |
| TLS Certificate Material | Certificate + private key matching TFE hostname (SAN, not CN only). PEM encoded. Signed by public or private CA. No self-signed certificates recommended. |
| Identity & IAM | AWS IAM, GCP Cloud Identity, Azure Active Directory |
| Secrets Management | AWS Secrets Manager / KMS, GCP Secret Manager / Cloud KMS, Azure Key Vault, VMware vSphere Native Key Provider |
| TFE License File | Must be obtained from HashiCorp account team. Save as terraform.hclic. Single line, no newline character. Treat as a company asset. |
| DNS Record | DNS record must exist matching the SAN in the TLS certificate |
Network Egress Requirements
TFE should not be exposed to the public internet for ingress. Users must be on the company network. TFE does need outbound access to:
registry.terraform.io— public module registry (official providers index here; restrict via Sentinel/OPA for community content)releases.hashicorp.com— Terraform binary releases (stay within two minor releases of latest)reporting.hashicorp.services— license usage aggregation (strongly recommend allow-listing)- Algolia — used by the Terraform Registry for indexing
- VCS/SAML endpoints, public cloud cost estimation APIs as applicable
Resource Sizing
| Component | AWS | Azure | GCP |
|---|---|---|---|
| Disk | EBS gp3 (3000 IOPS) | Premium SSD (5000 IOPS) | Balanced Persistent SSD (10000 IOPS) |
| Machine (default) | m7i.2xlarge (8 vCPU, 30 GB) | Standard_D8s_v4 (8 vCPU, 30 GB) | n2-standard-8 (8 vCPU, 30 GB) |
| Machine (scaled) | m7i.4xlarge (16 vCPU, 61 GB) | Standard_D16s_v4 (16 vCPU, 61 GB) | n2-standard-16 (16 vCPU, 61 GB) |
| Database | db.r6i.xlarge | GP_Standard_D4ds_v4 | db-custom-4-16384 |
| Cache (Redis) | cache.m5.large | Premium P1 | STANDARD_HA |
CPU Sizing Rules (All Providers)
- Avoid burstable CPU instances (AWS T-type, Azure B-type, GCP e2-/f1-/g1-series)
- Choose latest generation general-purpose x86-64 instances
- Use CPU/RAM ratio of 1:4 or higher
- Do not use memory-optimized instances
Concurrency & RAM Calculations
The TFE_CAPACITY_CONCURRENCY variable controls concurrent workspace runs. Default RAM per agent is 2048 MiB. Formula:
Add 10% overhead for OS and TFE application (~4 GB). For 30 concurrent agents: 30 × 2 GB = 60 GB + overhead = ~66 GB total.
Default machine (8 vCPU, 30 GB): maximum concurrency of 11. Scaled machine (16 vCPU, 61 GB): maximum concurrency of 26.
HVD Module Deployment Process (High-Level)
- Import TFE modules into your VCS repository
- Configure remote state storage (S3/Blob/GCS or HCP Terraform free tier)
- Select a machine with Terraform CLI available and cloud credentials instantiated
- Read the module GitHub README in its entirety before starting
- Prepare TLS certificate and private key (SAN must match FQDN; no self-signed certificates)
- Run
terraform init,plan,apply - Tail installation logs post-deployment; watch for errors
- Retrieve the Initial Admin Creation Token (IACT) within 60 minutes of deployment
General Guidance
- Separate TFE pods and HCP Terraform agent worker pods — agent workload is inconsistent under load
- Use HCP Terraform Operator instead of the internal Kubernetes driver run pipeline for customers going beyond default concurrency per TFE pod
- Three TFE pods is sufficient for HA — HCP Terraform agent cluster capacity has the greatest impact on run success at scale
- TFE supports x86-64 on all versions; ARM requires v1.0.0 or later
- Do not use burstable instance types (AWS T-type, Azure B-type, GCP e2-/f1-/g1-series)
K8s Resource Sizing
| Component | EKS | AKS | GKE |
|---|---|---|---|
| Disk | EBS gp3 | Premium SSD Managed Disks | Persistent SSD Disks |
| Machine (3-node cluster) | m7i.2xlarge (8 vCPU, 32 GB) | Standard_D8s_v5 (8 vCPU, 32 GB) | n2-standard-8 (8 vCPU, 32 GB) |
| Machine (5-node cluster) | m7i.xlarge (4 vCPU, 16 GB) | Standard_D4s_v5 (4 vCPU, 16 GB) | n2-standard-4 (4 vCPU, 16 GB) |
Approximate minimum cluster sizing for HCP Terraform agents with 3 TFE pods at system defaults:
- 3-node cluster: 96 GB total memory, 64 GB (n-1)
- 5-node cluster: 80 GB total memory, 64 GB (n-1)
Agent Count Formula
Default: TFE_CAPACITY_CONCURRENCY = 10. With 3 TFE pods: expect capacity of 30 concurrent agents. RAM per agent = 2 GB by default, configured via
agentWorkerPodTemplate in Helm overrides.
Internal Run Pipeline vs. HCP Terraform Operator
Network Considerations
- Specify a version tag for HCP Terraform agent image (e.g.,
tfc-agent:<tag>) — using:latestpulls the image on every run, adding unnecessary network load - HVD Modules deploy layer 4 load balancers (highest throughput available)
- Load HCP Terraform agent Docker image from a region-local source (ECR) rather than public internet when possible
- Do not use instances with burstable network characteristics
Common Troubleshooting
| Error | Likely Cause | Resolution |
|---|---|---|
| ImagePullBackOff | Cluster cannot pull TFE container from HashiCorp registry | Check permissions, image version in locals_helm_overrides.tf, and that license file has no newline (run cat tfe.hclic | base64) |
| CrashLoopBackOff | TFE container failing to start | Open two terminals — one to tail terraform-enterprise.log, one to run helm install. Capture startup error for support ticket. |
Private Cloud Component Considerations
Multiple concurrent TFE nodes require an external Redis instance. This is a hard requirement for active-active. If Redis is not feasible on-premise, HashiCorp recommends considering public cloud deployment or HCP Terraform (SaaS). Redis Sentinel is not supported.
The only alternative is to use external operational mode with a single TFE container — acceptable if RTO allows, but forward plan for Redis in due course as business HA requirements typically increase over time.
TFE depends specifically on PostgreSQL. Private cloud requires an organizational pattern for deploying and production-managing PostgreSQL at the supported versions. Recommended: version 15.x or later (14.x has some unsupported versions). Version 17.x supported through end of 2029.
Liaise with your DBA team early — there are specific schema requirements.
Sizing
- CPU: 4 core
- Memory: 32 GB RAM
- Disk: 2 TB
TFE requires S3-compatible storage. In a private datacenter, this requires a third-party technology. HashiCorp sees significant success with Dell ECS and MinIO. If the organization already has an S3-compatible pattern, use that.
Compute Sizing (Private Cloud)
Recommended operating systems: Red Hat Enterprise Linux or Ubuntu LTS.
| Component | Recommended Spec |
|---|---|
| TFE Compute | 4 vCPU, 32 GB RAM, 1 TB disk (many scaled customers use 8 vCPU/32 GB as initial production spec) |
| Disk (Docker) | Min 40 GB available to /var/lib/docker; recommend 3000 IOPS minimum |
| PostgreSQL | 4 core, 32 GB RAM, 2 TB disk |
| Redis Cache | 4 core, 16 GB RAM, 500 GB disk. Redis 6.2+ or 7. Recommend 7. |
Monitoring PostgreSQL and Redis
Monitor CPU, memory, available disk space, and disk IO using organizational telemetry. Create alerts at 50% and 70% utilization thresholds. If any parameter consistently exceeds 70%, increase the resource.
Network Requirements
TFE has specific ingress and egress requirements — refer to the TFE network requirements page for the latest. If a corporate proxy is filtering outbound traffic, add required destinations to the allow-list. Use a layer 4 load balancer. Air-gapped mode is available if external access is not possible.
Operating System
- Use OS configurations compliant with the CIS benchmark for the chosen operating system
- Limit CLI access to machines to a shortlist of well-known staff
- Ensure the organization's SIEM/audit log reflects all access
Application
- Use single sign-on (SSO) with multi-factor authentication (MFA) for all users
- TCP port restrictions for ingress/egress are configured by the deployment. Do not alter unless advised by HashiCorp support, a solutions engineer, or certified partner.
- Enable the
Strict-Transport-Securityresponse header - For manual installs: set
restrict_worker_metadata_accessas a Docker environment variable to prevent Terraform operations from accessing the cloud instance metadata service - HVD Module automated deployments restrict access to the AWS metadata service — do not re-enable this
- After deployment, do not create the initial administrator immediately — coordinate a handoff to the operations team
Primary Planning Considerations
The benefit of duplicating system tiers in a secondary region outweighs cost given TFE's mission-critical role. Calculate TCO of two TFE instances (one per region) including geo-redundant data layer costs. Also calculate the cost to the business if developers cannot deploy applications — this is often the more compelling number. Include any committed spend under enterprise discount programs.
Use automated means to deploy all infrastructure in both regions. Use HVD Modules to mirror resources across regions — deploy and maintain state for each region separately.
- Test region failover capability on a regular cadence — at least twice annually
- Document both failover and failback processes step-by-step in run books
- Have team members who did not write the document use it — validates clarity and trains staff
- Deploy an engineering pair of TFE instances (one per region) mirroring production; use for meaningful failover tests
- Maintain independent instances: each environment must have its own DNS, storage, and supporting services
- Perform fault injection testing using cloud provider features or third-party tooling
Component-Specific Guidance
- Keep VM/container images version-controlled and available in the failover region. Use Packer as the standard for machine image creation.
- Do not run TFE containers in the secondary region while primary is online — risk of premature DB read replica promotion causing corruption
- Keep compute cluster infrastructure deployed but scaled down until failover
- Keep TFE containers at the same version in both regions; upgrade during the same change window
- Co-locate primary and secondary compute layers in the same regions as their respective storage and database components
| Cloud | Recommendation | RPO Consideration |
|---|---|---|
| AWS | S3 Cross-Region Replication (CRR), live replication, bidirectional during failover, S3 Replication Time Control for monitoring | 99.99% of objects replicate within 15 min. Check for missing objects in run book before failing over. |
| Azure | Geo-zone-redundant storage (GZRS) — 16 nines durability. Use Standard general-purpose v2 storage accounts. Deploy only in paired regions with AZ support. | Azure Storage Geo Priority Replication guarantees 99% of blobs replicate within 15 min. |
| GCP | Dual-region GCS buckets with Turbo Replication | Turbo Replication guarantees 100% of objects replicate within 15 min. Premium feature — recommended for mission-critical TFE. |
| VMware | Deploy identical object store in each region; use strategic inter-DC connections for migration. Most customers solve with vSAN. | Work with VMware team to understand replication SLA between data centers. |
| Cloud | Recommendation |
|---|---|
| AWS | Use Aurora with cross-region read replicas. Use aws_rds_global_cluster resource. Aurora global databases replicate in ~1 second. Monitor AuroraReplicaLag and AuroraGlobalDBReplicationLag metrics. |
| Azure | Use Azure Database for PostgreSQL read replicas. Set geo_redundant_backup_enabled = true. Monitor replication lag — Azure auto-sets DB to read-only below 5% storage, which adversely affects TFE. |
| GCP | Use Cloud SQL with cross-region read replicas. Enable PITR. |
Redis does not require cross-region replication. Deploy Redis in both regions and ensure it is ready in the failover region before starting TFE. HVD Modules handle this when used iteratively for both regions.
The Platform Team
The platform team serves as the central hub, orchestrating functions and ensuring streamlined operations across teams. It may consist of one or more teams with separate areas of responsibility.
Drives cloud adoption and aligns cloud strategies with business objectives. Establishes governance frameworks, fosters knowledge sharing, and optimizes cloud resources to maximize value and ensure compliance.
Manages essential tools and automation to support efficient system operations. Implements golden workflows and reusable modules. Collaborates with stakeholders to ensure services meet consumer needs. May or may not be a separate sub-team.
Collaborates with security to ensure adherence to standards and best practices. Promotes standardization, scalability, and compliance across projects.
Streamlines development cycles and reduces organizational complexity. Uses a product management approach, allowing development teams to consume golden paths in a self-service mode, enhancing productivity and efficiency.
Security Team Collaboration
The security team collaborates with the platform team to establish governance policies and deploy monitoring tools. Key cloud security functions include:
- Subscribing to security updates and vulnerability alerts; collaborating with SRE on patches
- Managing TLS certificate validity and renewal for TFE
- Scanning TFE instances for vulnerabilities; integrating SIEM tools for audit logging
- Providing governance inputs to CCoE (PaC guardrails, CIS benchmarks)
Producers and Consumers
| Role | Responsibilities |
|---|---|
| Producers (Platform Team) |
Provide seamless onboarding to HCP Terraform platform. Manage the private registry. Oversee Policy-as-Code implementation. Offer enablement to consumers. |
| Consumers (Application Teams) |
Initiate requests for platform access. Write Terraform code based on available private registry modules and platform team recommended practices. |
Golden IaC Workflows
A golden workflow is a standardized, repeatable process for completing a specific task. The platform team shifts these workflows left using pre-approved Terraform configurations and curated modules, empowering teams to independently provision infrastructure while maintaining centralized management.
| Workflow Type | Description |
|---|---|
| Producer — Module Development | Create Terraform modules and register them with the private registry |
| Producer — PaC Development | Develop Sentinel/OPA code and register policy sets in HCP Terraform |
| Consumer — Landing Zone | Provision cloud accounts/credentials and core TFE configuration elements from reusable modules; deploy VCS repos, projects, workspaces, and Stacks |
| Consumer — Developer | Create IaC relevant to support application components under remit |
Key Features
- Vending portal: Central platform where users view, request, and build standardized components. Platform team maintains; offers self-service experience with audit and chargeback policy support.
- Pre-configured components: Range from full application templates to specific infrastructure aspects — guarantees deployment uniformity.
- Accelerated consumption path: Business units choose from validated patterns for faster delivery, or opt for custom architecture (slower process).
Supporting TFE Features
| Feature | Use |
|---|---|
| Private Registry | Host and share internal modules/providers; versioned and searchable |
| Private Providers & Modules | Restrict access to org members; cross-org sharing available in TFE |
| Run Tasks | Direct integration with third-party tools at specific run lifecycle stages |
Key Components
- Central control team (franchiser): Provides workflow and resources, keeps system running, sets rules, adds capabilities.
- Consumption workflow: End users have a path to access provisioning resources and manage their own infrastructure.
- Upfront governance: Governance at the outset ensures compliance while enabling provisioning — avoids separate compliance sign-off at go-live.
- Controlled vending: Proactive controls ensure legal, regulatory, and enterprise standards are met.
Supporting TFE Features
| Feature | Use |
|---|---|
| Terraform Workspaces | Persistent working directory per collection of infrastructure resources |
| Workspace Projects | Enables self-managed portions of TFE with same policies as root org |
| Sentinel Policies | Policy-as-code for fine-grained, logic-based policy decisions using external source data |
After automated installation, the Initial Admin Creation Token (IACT) may be created as an optional final step. Two viable options for post-provision configuration:
- API scripts
- Terraform Enterprise provider (best practice — use this to derive state for the configuration)
Automate team creation alongside projects, Stacks, and workspaces using the TFE provider. Automate user addition within your IdP. Configure a team attribute name (default: MemberOf) in your IdP to automatically assign users to groups in SAML.
- For service accounts used by pipelines: use
IsServiceAccount: truein SAML - Create teams in TFE with the exact name of the group in the IdP
- Do not create users manually — TFE creates them automatically at SAML assertion login
Connect HCP Terraform/TFE to your VCS provider to enable workflows for managing modules, policy sets, and connecting VCS-backed Stacks and workspaces.
Audit: For TFE, forward logs for monitoring and auditing. For HCP Terraform, use the Audit Trails API. Include audit, logging, and monitoring in the target architecture — do not wait until after go-live to implement observability.
HCP Terraform Agents: Define agent pools and assign Stacks/workspaces using the TFE provider. Multiple agents can run concurrently on a single instance (license limits apply to HCP Terraform but not TFE). For containerized agents, use single-execution mode for a clean working environment per run.
Hierarchy Overview
| Level | Scope | Key Notes |
|---|---|---|
| Organization | Encompasses all components — teams, projects, workspaces, Stacks, policies, registry, VCS, variable sets, SSH keys | Centralize core provisioning in a single org. Naming: <customer-name>-prod and <customer-name>-test. |
| Project | Container for workspaces and Stacks. Inherits team permissions, variable sets, and policy sets. | Primary tool for delegating config/management in multi-tenant setups. Typically allocated to an application team. |
| Workspace / Stack | Manages a Terraform configuration and its associated state file | Workspace-level permission applies to that workspace only |
Access Management
HCP Terraform access is built on three components: User accounts, Teams, and Permissions. Implement SAML/SSO for user management combined with RBAC.
Comprehensive access to all org aspects. Certain tasks are reserved to owners only: creating/deleting teams, managing org-level permissions, viewing the full team list (including secret teams). Limit membership to a small group of trusted platform team members.
| Permission | Purpose |
|---|---|
| Manage Policies | Create, edit, read, list, delete Sentinel policies; access to read runs on all workspaces for enforcement |
| Manage Run Tasks | Create, edit, delete run tasks within the org |
| Manage Policy Overrides | Override soft-mandatory policy checks |
| Manage VCS Settings | Manage VCS providers and SSH keys |
| Manage Private Registry | Publish and delete providers/modules — owners only |
| Manage Membership | Invite/remove users; add/remove from teams. Cannot create teams or view secret teams. |
Projects are containers for workspaces and Stacks. Configuration elements attached at project level are inherited by all workspaces/Stacks within: team permissions, variable sets, policy sets.
Standard Project-Level Permissions
- Admin: Full control over the project (including deleting it)
- Maintain: Full control of everything in the project, except the project itself
Benefits of Project Delegation
- Agility: Teams can create and manage infrastructure in their designated project without requesting org-admin access
- Reduced Risk: Project permissions give admin access to a subset of workspaces/Stacks without cross-team interference
- Self-Service: Projects integrate no-code provisioning — project admins can deploy no-code modules without org-wide workspace management privileges
Create and maintain a naming convention document covering teams, projects, workspaces, Stacks, and other entities. Pass this to all new internal clients to standardize operations.
Recommended org naming: <customer-name>-prod (production workloads) and <customer-name>-test (integration testing and PoCs).
Cloud Provisioning Key Steps
- Plan the project: installation/config of TFE, self-service capability design. Consider onboarding early adopters before general availability — use stepwise refinement.
- Consider platform team size and bandwidth — if any onboarding step is not automated, it compounds with scale.
- Plan a landing zone provisioning workflow.
- Ensure the platform team is fully trained on IaC with Terraform and the cloud providers in use. All contributors must adhere to the HashiCorp Terraform language style guide.
- Set up a TFE Stack or workspace with cloud credentials.
- Store Terraform code in your strategic VCS.
- Provision cloud resources from the VCS-backed Stack or workspace.
- Consider configuration required to enable end-to-end deployment within org security and compliance requirements. Identify manual steps and what would be needed to automate each.
Landing Zones
A cloud landing zone is a foundational, standardized environment for secure, scalable cloud operations. The platform team deploys a control workspace during onboarding of each internal customer.
- Networking: VPCs, subnets, connectivity settings
- IAM: Policies, RBAC, permissions enforcement
- Security & Compliance: Encryption, security groups, logging
- Operations: Monitoring, logging, automation tools
- Cost Management: Tagging policies, budget alerts, cost reporting
Major cloud providers: AWS Control Tower, Azure Landing Zone Accelerator, Cloud Foundation Toolkit (GCP).
Core Requirements
- Use the Terraform Enterprise provider for state representation
- Define a VCS template with boilerplate Terraform code and a directory structure managed by the platform team
- Create a Terraform module for the private registry that the platform team calls during automated onboarding. This module creates:
- A control workspace for the application team
- A VCS repository for the application team
- Variables/sets as needed
- Public and private cloud resources for the team
- Hook the onboarding pipeline into other org platforms (credential generation, observability, etc.)
- Dedicate a TFE project to house landing zone control workspaces separately from other platform team workspaces
- Ticket raised: Audit trail created, approval acquired
- Landing zone child module call code added: Top-level workspace collects and manages onboarded teams. Automate the addition of each child module call — do not do this manually at scale.
- Run the top-level workspace: Creates the control workspace for the application team, VCS repo, variables, cloud resources
- Application team onboards and begins using their workspace
Workflow Personas
| Persona | Role | Example Titles |
|---|---|---|
| Developer | Develops infrastructure and application code | Software engineer, application developer |
| Lead Developer | Helps efforts of product developer teams | Development lead, technical team lead |
| Release Engineer | Coordinates deployment to production using automation | Release engineer, release manager |
| Platform Engineer | Writes pipeline definitions; enables developers to use pipelines | DevOps engineer, operations engineer |
| Infrastructure Operator | Maintenance, configuration, administration | SRE, site reliability engineer, systems admin |
VCS-Driven Workflow
A specific VCS repository backs each workspace or Stack. HCP Terraform uses webhooks to monitor commits, pull requests, and tags. Changes trigger plan runs; PRs trigger speculative plans.
Prerequisites
- VCS repository containing source code for the deployment
- VCS authentication enabling secure access for HCP Terraform
- VCS permissions configured (read-only, merge, etc.)
- Workspace or Stack naming conventions and permissions defined
High-Level Steps
- Configure VCS integration for your organization
- Connect workspace or Stack to the desired branch in your VCS repository
- Adopt a branching strategy (e.g., standard feature branching)
- Enable speculative plan runs for each branch
- Define PR process per organizational standard
- (Optional) Configure automatic run triggers based on git tag pushes
- Branch change: Trigger when a specific branch changes — long-running or merged feature branches
- Tag match (pattern): Trigger only for changes with a specific tag format. Supports semantic versioning, prefix, suffix, or custom regex.
| Tag Format | Regex Pattern |
|---|---|
| Semantic Versioning | ^\d+.\d+.\d+$ |
| Version with prefix | \d+.\d+.\d+$ |
| Version with suffix | ^\d+.\d+.\d+ |
Auto apply is configurable per workspace. Useful in non-interactive non-production environments to automatically run terraform apply on a successful plan. Works regardless of how the plan triggers (VCS, API, etc.).
API-Driven Workflow
For customers with strategic CI/CD pipeline orchestrators, HCP Terraform and TFE form the infrastructure management component of those pipelines. In this model, the CI/CD tool drives the Terraform run via the TFE API rather than through VCS webhooks. This supports more complex orchestration patterns where infrastructure changes are part of a larger pipeline.
import block. It allows declaration of resources to import in configuration, bulk import via for_each, a planning phase before import, and a sub-command to generate configuration for imported resources. This is the recommended approach.
What to Import vs. What Not to Import
null_resource). Resources managed by teams not using Terraform. Resources that can be rebuilt with downtime.Team Responsibilities
| Team | Responsibilities |
|---|---|
| Platform Team | Set up/maintain workspaces, projects, orgs, policy sets, cloud auth. Develop import best practices and module usage guidance. Provide training to application teams. Convert repeated configs into modules. |
| Application Team | Identify which resources to import. Work with platform team on accurate transition. Provide resource attribute information. Run day-to-day plans/applies through standardized module usage. |
Phased Approach
- Application team gets guidance from platform team on which resources to manage
- Start with a small pilot set; complete to documentation and review
- Gradually expand resources from the same team; then move to other teams
- Continuously review and refine the process; ensure application teams maintain Day 2 responsibilities
- Platform team enables drift detection features to catch out-of-band changes
- After Phase 1, identify common configuration patterns
- Place common resources into modules to scale Terraform maturity and consumption
- Add granularity between modules accounting for permissions and security (you cannot partially instantiate a module)
- Use projects, variable sets, and Sentinel policy sets to organize the new structure
Feature Availability by Product
| Feature | HCP Terraform | Terraform Enterprise |
|---|---|---|
| Operational Logs | No (HashiCorp SRE manages) | Yes (your SRE team manages) |
| Audit Trail | Yes | Yes |
| Metrics | No (HashiCorp SRE manages) | Yes |
| HCP Terraform Agent Logs | Yes | Yes |
Observability Feature Definitions
- Operational logs: Track performance and behavior — error messages, warnings, events. Used by SRE to monitor and maintain service.
- Audit trail: Security-focused — login attempts, access control changes, suspicious activities. Used by security analysts and incident response teams.
- Metrics: Application component performance and usage data — detect service quality issues or inform capacity planning.
TFE Monitoring Focus Areas
Configure Prometheus to gather metrics from TFE and its underlying components. TFE generates operational metrics Prometheus can collect: CPU, memory usage, request latency, and more.
Use Grafana to visualize metrics. The official Terraform Enterprise Grafana dashboard (ID: 15630) provides real-time insights into resource utilization trends and request rates.
If using HCP Terraform agents, include agent logs in your log collection and analysis. This ensures a complete picture of all activities. Analyze agent logs alongside TFE logs to better optimize and improve deployment.
- Each team member should have an account created through the HashiCorp support portal
- Configure each team member permitted to open support tickets as an Authorized Technical Contact. Provide the list to your assigned HashiCorp Solutions Engineer for configuration.
- Familiarize the team with documentation on how to open a support ticket and generate/upload a support bundle when applicable
- Be aware of your support plan level to manage response time expectations
- Understand the severity level definitions when opening a ticket — see the Customer Success Enterprise Support page
Health Assessment Components
Drift detection identifies configuration drift: when changes are made directly to infrastructure outside Terraform's managed processes, creating discrepancies between live state and the code-defined desired state.
Limitations
- Unmanaged attributes: Drift detection only covers attributes explicitly managed by Terraform. Manually modified settings outside Terraform's control won't be flagged.
- External additions: Resources added entirely outside Terraform (e.g., manually created IAM users) are not detected.
Prescriptive Guidance
- Enable health assessments for all workspaces — set globally in Settings → Health
- Enable workspace notifications so admins are alerted via Slack or email when drift is detected
- TFE admins can modify assessment frequency and maximum concurrent assessments from admin settings console
Continuous validation enforces standards over time — set rules for security, cost, or other requirements, and Continuous Validation checks they're always being met. Covered in the Scaling Operating Guide.
Drift Resolution Workflow
Once drift is detected, the workspace notifies the application team. They decide the best resolution:
Refresh State vs. Update Configuration
| Operation | What it Does | When to Use |
|---|---|---|
| Refresh State | Updates Terraform's internal state file to match actual infrastructure — does NOT modify infrastructure | To ensure Terraform's understanding is accurate for identifying drift |
| Update Terraform Configuration | Submits updated config files and executes plan/apply — DOES modify infrastructure | When incorporating intentional drift changes or correcting configuration to match desired state |
Registry Roles
| Role | Responsibilities | Personas |
|---|---|---|
| Registry Administrator | Publish and delete modules/providers from the private registry (public or private sources). Organization-level permission — assign to a specific team. | Platform team members, CI/CD pipelines automating the publish process |
| Producer | Create and maintain modules/providers. Publish new releases. Needs commit access to the VCS repo + Registry Administrator permissions. | Platform team responsible for custom modules, service owner teams |
| Consumer | Find and use providers/modules necessary to provision infrastructure. Needs commit access to VCS repo and write access to TFE workspaces. | Application team members, platform team using the TFE provider |
Module Requirements
- VCS Repository: Module code must be hosted in a supported VCS repository
- Semantic Versioning: Must use semantic versioning scheme for version constraints to work
- Naming Convention: Must follow
terraform-<PROVIDER>-<name>format (e.g.,terraform-aws-ec2-instance) - Standard Module Structure: Enables the registry to generate documentation, track resource usage, parse examples, run tests
- VCS Provider Configured: Must have a VCS provider configured with administrative access to the module repositories
| Method | Best For | Notes |
|---|---|---|
| Tag-based | Modules associated with release tags in VCS | Registry auto-detects and publishes new versions based on tags. Consider implementing tag protection rules in VCS. |
| Branch-based | Enhanced flexibility; required for integrated testing | Allows selection of a specific branch with an assigned version number |
| API / CI/CD Pipeline | Automation and scale | Recommended for delegating publishing to CI/CD |
Additional Benefits of Custom Modules
Beyond reusability, custom modules enable organizations to encode:
- Naming conventions
- Security controls
- FinOps standards (tagging, cost allocation)
Key Benefits of Sentinel
- Risk mitigation: Actively lowers chances of errors and vulnerabilities by enforcing rules during planning and execution
- Regulatory governance: Ensures every action aligns with org policies, regulatory guidelines, and security standards at scale; simplifies auditing
- Separation of concerns: Policies managed by platform/compliance/security teams, separate from the application teams deploying infrastructure; workspace owners cannot opt out without explicit permissions
- Sandboxing: Policies act as core guardrails — reduces need for manual verification
- Codification: Makes governance clearer, more efficient, consistent, and operationally reproducible. Eliminates reliance on oral traditions. Fosters transparency.
- Version control: History tracking, diffs, PRs — demonstrable and auditable policy evolution
- Testing: Sentinel's built-in testing framework enables automated CI testing — reduces TCO for governance
- Automation: Policy deployment is far faster than manual work and requires fewer humans; ensures consistency at scale
Policy Enforcement Workflow
- Define governance/compliance policies and translate them into Sentinel policy requirements
- Code policies using the Sentinel language; arrange in policy sets
- Scope policy sets to the entire organization, or to one or more projects/workspaces
- Policy enforcement levels: advisory (warns but doesn't block), soft-mandatory (can be overridden by authorized users), hard-mandatory (cannot be overridden)
People and Responsibilities
- Sentinel policies owned by the Security team with input from regulatory compliance areas
- CCoE/Platform Team understands Sentinel code and how to employ it in production pipelines
- CCoE/Platform Team partners with Security Team to manage ownership and RBAC over policy code
Repository Organization Best Practices
policies/: Main Sentinel policies, organized by environment. Each environment subdirectory contains policies and a test subdirectory (test files, mock data, Terraform config used to generate mock data).modules/: Reusable policy modules. Common TF import functions stored here — flat structure with illustrative filenames.docs/: Always fully document policies and their tests.
- Acquire IT security policies relevant to infrastructure deployment from your security team — translate into Sentinel policy-as-code
- If no policy list exists, agree on general controls internally and design them to be extended over time
- Staff responsible for policy development must understand the Sentinel language — read official language documentation
- Attend HashiCorp Sentinel Academy training (hands-on labs and real-world examples) — contact your HashiCorp Solutions Engineer or Customer Success Manager
- Use Sentinel modules (0.15.0+) to specify reusable functions that reduce codebase length
- Clone the
terraform-sentinel-policiesGitHub repo — provides prewritten policies for public/private cloud providers and reusable function modules
Common Starting Policies
- Governance of maintenance windows (protecting from adverse change at wrong times)
- Enforcement of metadata tagging of cloud resources
- IaC style enforcement (e.g., Terraform module versions pinned, only from private registry)
How Run Tasks Work
Run tasks send an API payload to an external service at a specific run stage. The service processes the data, evaluates whether the run passes or fails, and sends a response back to HCP Terraform. Based on the enforcement level, HCP Terraform determines if the run can proceed.
Run Stages
| Stage | Available Data | Use Case |
|---|---|---|
| Pre-plan | Code and other attributes | Examine code to determine if entering the plan stage should be allowed |
| Post-plan / Pre-apply | Plan results | Examine the plan and determine whether an apply should be allowed (most common stage) |
| Post-apply | Provisioned infrastructure data | Testing and gathering/storing information about provisioned infrastructure |
Common Use Cases
| Category | Example Tools |
|---|---|
| Security & Compliance | Palo Alto Networks Prisma Cloud, Zscaler, Snyk, Tenable, Sophos, Aqua Security, Firefly |
| Cost Control | Infracost, Vantage, Kion |
| Visibility | Pluralith (resource visualization) |
| Image Compliance | HCP Packer run task (verify approved golden images) |
Implementation Flow
- Select the desired run task from the public Terraform Registry; review requirements and documentation
- Establish and verify two-way connectivity between HCP Terraform platform and the run task endpoint (network/security modifications may be required)
- Create the run task in the Terraform Organization (connect endpoints, test communication path)
- Associate the run task with a workspace; configure the stage and enforcement level (Advisory or Mandatory)
- Run task executes as part of normal run cycles; review results in run completion output
When to Use Self-hosted Agents
TFE Scaling Strategies (Priority Order)
- Migrate to active-active operational mode and increase TFE node count
- Add more resources to VMs hosting TFE nodes; increase capacity config params (
TFE_CAPACITY_CONCURRENCY,TFE_CAPACITY_CPU,TFE_CAPACITY_MEMORY) - Deploy HCP Terraform agents — the next logical step once the above limits are reached
Agent Pool Design
Design agent pools based on:
- Product limits (HCP Terraform enforces concurrency/agent limits per tier; TFE has no agent concurrency limit)
- Cloud provider — dedicated pools per provider simplify credential management
- Environment — dedicated pools per environment (dev/staging/prod) prevent accidental cross-environment changes
- Operations — separate pools for highly privileged operations; granular permissions and cleaner audit trails
Recommended pattern: {environment}-{cloud-provider}-agentpool
Examples: dev-aws-agentpool, staging-azure-agentpool, prod-gcp-agentpool
- Use standardized abbreviations; use hyphens or underscores as delimiters; keep lowercase; avoid special characters and spaces
- Document the naming convention in org wiki; communicate to all relevant stakeholders
Two primary deployment modes:
- Virtual machines: Run the agent on VMs (e.g., EC2 instances). Use Packer + HCP Packer integration to build agent images. Use autoscaling groups/managed instance groups/scale sets. Use rolling upgrade features.
- Kubernetes (recommended for K8s-skilled customers): Use the K8s Operator for autoscaling capabilities. See the K8s Operator section for details.
Use the following metrics to drive VM-based agent scaling decisions:
tfc-agent.core.status.busy— number of agents in busy status at a point in timetfc-agent.core.status.idle— number of agents in idle status at a point in time
API endpoints also provide information for automating scaling decisions — review the agent documentation for current endpoints.
Custom Resources Introduced
| CRD | Purpose |
|---|---|
| AgentPool | Manages HCP Terraform Agent Pools and Agent Tokens. Supports on-demand scaling operations for HCP Terraform agents. |
| Module | Facilitates API-driven Run Workflows; streamlines execution of Terraform configurations. |
| Project | Manages HCP Terraform Projects — organized and efficient project handling. |
| Workspace | Manages HCP Terraform Workspaces — structured environment for resource provisioning and state management. |
Use Case 1: Auto-scaling Agent Pools
The Operator manages agent pool lifecycle and deployment via the AgentPool CRD. It can monitor workspace queues to trigger autoscaling based on defined min and max replicas.
- Increase agents up to
autoscaling.maxReplicasor licensed limit (whichever is reached first) - Reduce agents to
autoscaling.minReplicaswithinautoscaling.cooldownPeriodSecondswhen no pending runs exist
minReplicas based on baseline run concurrency for health checks (drift detection and continuous validation).
Sizing AgentPool Autoscaling
- maxReplicas: Determined by peak-run concurrency demand and HCP Terraform tier constraints. Scale-test your cluster to ensure peak load is handled.
- minReplicas: Consider baseline run concurrency from health checks (drift detection, continuous validation).
- Set memory limits and resource requests on agents — helps efficient node placement; critical if using cluster scaling technologies like Karpenter.
Use Case 2: Self-Service Infrastructure via Kubernetes Native Consumption
The Operator lets application developers define infrastructure configuration using Kubernetes configuration files. It delegates the reconciliation phase to HCP Terraform. This frees developers from needing to learn HCL for infrastructure management tasks — useful when your application teams are Kubernetes-native and prefer K8s manifest-based workflows.
Security Considerations
Agent tokens stored in the Kubernetes cluster must be secured using your organization's K8s secrets management approach. Review the Operator documentation for specific security guidance. Egress requirements for the HCP Terraform agent apply when agents are deployed via the Operator — includes provider endpoint connectivity, Terraform registry access, and Terraform releases access.
Why Continuous Validation?
Failed infrastructure changes can introduce project delays and expose the organization to operational or security risks. Continuous validation gives advance notice of issues preventing successful changes, so they can be addressed before a Terraform apply fails in production.
Best Practice Recommendation
- When a new workspace is created, enable continuous validation (explicitly at workspace level or implicitly at org level)
- Include necessary logic in Terraform configuration to validate important components whose health may change over time
- If infrastructure changes fail in the future due to an unchecked condition, update the Terraform configuration to incorporate the new validation — and apply this pattern to existing infrastructure code
Rule of Thumb — Which Resources to Validate
- Check the status of any critical resource that can fail (e.g., VMs)
- Check validity of resources with user-defined time frames whose failure impacts the application stack (e.g., TLS certificates)
- Not necessary for inherently durable resources (e.g., S3 buckets — native to cloud provider)
Implementation Requirements
| Language Feature | Minimum Terraform Version |
|---|---|
| Preconditions and postconditions | 1.2 and later |
| Check block | 1.5 and later |
Permissions required: Organization health settings require Owners team membership. Individual workspace settings require Workspace Admin access.
Notification Event Categories
| Event | Trigger | Priority |
|---|---|---|
| Check Failed | Continuous validation check returns unknown or failed | Critical |
| Drift Detected | Every time drift is detected on this workspace | Critical |
| Health Assessment Errored | Health assessment cannot complete successfully | Critical |
| Auto-destroy Reminder | Sends reminder 12 and 24 hours before auto-destroy run | Critical |
| Auto-destroy Results | Results of an auto-destroy run | Critical |
| Event | Trigger | Priority |
|---|---|---|
| Created | Run created, enters Pending state | Low |
| Planning | Run acquires lock and starts executing | Low |
| Needs Attention | Human decision required — plan changed, not auto-applied, or policy override required | Critical |
| Applying | Plan confirmed or auto-applied | Low |
| Completed | Run completed successfully | Low |
| Errored | Run terminated early due to error or cancellation | Critical |
Implementation Guidance
Configure notifications via WebUI, the API (tfe_notification_configuration), or using the Terraform TFE provider. HashiCorp recommends using the TFE provider to configure notifications as part of the project/workspace creation process.
Notification Strategy
- Choose appropriate destination: Slack is popular, but use whatever fits team workflow (email, Teams, etc.)
- Granular notifications: Avoid broad notifications that cause alert fatigue — focus on critical events
- Integration with incident management: Integrate with incident management tools so alerts lead to actionable items
Maintenance
- Periodically review notification settings and adjust based on changing infrastructure needs and team feedback
- Test when making changes — trigger events manually to verify notifications are received
- Continuously monitor and solicit feedback to reduce noise and improve relevance
No-code provisioning is deployed into new TFE workspaces. This is a consideration for platform teams managing license consumption — each no-code deployment creates a new workspace.
Roles and Responsibilities
| Role | Responsibilities |
|---|---|
| Registry Administrator (Platform Team) | Design, build, and publish no-code modules to the private registry. Ensure modules are configured to allow no-code provisioning. Define and document required variable values. |
| Project Admin (Application Team) | Configure and deploy no-code modules within their project. Manage the resulting workspace lifecycle. |
Permissions
- Marking a module as no-code enabled requires the Manage Private Registry permission or Owners team membership
- Deploying a no-code module requires Project Admin permission or higher
- HCP Terraform/TFE uses the module's configured variable set or workspace variables for cloud credentials
Configuring at Scale
- Use the TFE provider or API to automate no-code module configuration when setting up new projects
- Define variable sets at the project level to provide the necessary cloud credentials — these are inherited by no-code workspaces
- Document the no-code provisioning process for consumers so they understand what is available and how to use it
Use Cases
- Spin up and tear down feature-branch infrastructure automatically
- CI/CD environments that need fresh infrastructure per pipeline run
- Time-boxed customer or internal demos and proof-of-concept deployments
- Any scenario where infrastructure should not persist beyond a defined lifecycle
Roles and Responsibilities
| Role | Responsibilities |
|---|---|
| Platform Team | Define standards for ephemeral workspace usage. Configure auto-destroy schedules and notification settings. Provide automation patterns for teams to create and destroy ephemeral workspaces. |
| Application Team | Create ephemeral workspaces following platform team standards. Manage the lifecycle within defined parameters. Monitor notifications for auto-destroy events. |
Permissions
- Creating ephemeral workspaces and configuring auto-destroy requires Workspace Admin permission or higher at the project level
Configuring at Scale
- Use the TFE provider or API to automate ephemeral workspace creation as part of CI/CD pipelines
- Set auto-destroy notifications (12h and 24h reminders are available) so teams are aware of impending destruction
- Define standard auto-destroy schedules in project-level documentation — prevents unintended persistence of temporary resources
- Integrate auto-destroy results notifications with your incident management or team communication channels