ADR-0024: Automated External Provisioning
Accepted Cross-Project UniversalDate: 2026-05-01
Compliance: ISO 27001 (A.8.15) — applicable in territories under that regime.
Context
Several territories provision external resources (DNS records, TLS certificates, cloud buckets, third-party API resources) when a domain entity is created or modified. The pattern recurs:
- kendo — tenant signup creates
Domainrows that need DNS records (Cloudflare) and TLS certificates (Fly) before the tenant can serve HTTPS traffic. Active campaign: PR #1028 / KD-0580. - ublgenie — tenant onboarding writes per-tenant resources to Fly and equivalent infrastructure.
- emmie — per-tenant Cloudflare R2 / AWS S3 buckets created at tenant onboarding (
CreateAwsBucket.phpalready exists; not currently behind a provisioning state machine). - daymate-api — speculative future adopter.
Without a shared architecture pattern, each territory invents its own version. The kendo PR #1028 critique surfaced six recurring risks in one campaign:
- Inconsistent failure semantics. Does provisioning failure roll back the parent record? Cancel the user's onboarding? Leave a half-state?
- No retry contract. Manual retry, automatic retry, both? Bounded by what?
- Authorization at the retry seam. Tier-1 vs Tier-2 (per ADR-0006) decided ad-hoc per territory.
- Audit trail divergence. Provisioning is exactly the surface ISO 27001 A.8.15 demands history for; each territory may add or omit audit logging differently.
- Timeout discipline drift. Doctrine #8 (war-room) requires explicit per-call
->timeout()on external HTTP. Easy to forget when the provider is hidden behind a generic SDK. - Provider lock-in. Direct SDK calls in Actions make territory migrations (e.g., Cloudflare → Route53) painful.
This ADR captures the architecture before kendo locks the implementation, so the next territory adopts the pattern instead of re-inventing it.
Decision
External provisioning operations are async, provider-abstracted, retain-on-failure, single-dispatch, audit-mandatory, Tier-1 retry, flag-gated. The shape:
1. State machine on the provisioned entity
The entity that owns the external resource carries provisioning lifecycle state in its own row. Minimum schema (territory adapts column names; semantics are universal):
provisioning_status enum / string // pending → dns_pending → dns_active → cert_pending → cert_active → health_pending → active; failed terminal
provisioning_failed_step enum / string nullable // null when not failed; identifies which step terminated
provisioning_attempts unsigned int default 0
provisioning_last_error text nullable
provisioned_at timestamp nullable
{provider}_record_id string nullable
{provider}_certificate_id string nullableA single failed state with a string last_error is rejected — recovery branches differently for DNS-fail vs cert-fail vs health-fail. failed_step makes the recovery path type-safe.
2. Provider abstraction
Each external system is fronted by an interface in app/Contracts/Provisioning/. Concrete implementations live in app/Services/Provisioning/. Provider methods are idempotent: ensureRecord() creates-or-updates and never errors on already-exists.
Provider interfaces declare an explicit timeout contract per Doctrine #8 + 2026-04-22 library-author extension. Either:
- Constructor-required
int $timeoutSeconds, OR - Per-method
->withTimeout(int $seconds)returning a configured client.
Implementations must not inherit framework defaults. An arch test fails the build if a class under app/Services/Provisioning/* references Http:: or an HTTP client without a visible timeout configuration.
3. Single dispatch path
The Action that creates the provisioned entity is the only code path that writes the entity row. Sibling code paths (signup actions, tenant-creation actions, batch-import actions) delegate to it rather than writing directly. This is the canonical path the dispatch hook attaches to.
Where sibling code paths exist today writing the entity directly, refactor them to delegate as part of the territory's adoption campaign — not as a deferred follow-up. An arch test enforces the single-writer constraint per territory.
After the entity-creation transaction commits, the creation Action dispatches the queued provisioning job. Failure of the job does not roll back the entity row — provisioning failure is a state, not a deletion.
4. Retain-on-failure
Failed provisioning leaves the entity row with provisioning_status = 'failed' and provisioning_last_error populated. The row is not deleted; the parent (e.g., the tenant) is not deleted. Retry — automatic with bounded backoff (deferrable per phase), manual via operator endpoint (always present from phase 1) — recovers from failed to active without re-creating the row.
Provisioning failures are external-system failures (DNS propagation slow, LE rate limits, provider API outage). Deleting the entity on failure conflates "transient external problem" with "operator decision to cancel". The user's signup or admin action commits independently of provisioning success.
5. Audit logging mandatory (ADR-0001)
Every provisioning state transition emits a row to the entity's audit log. The audit log records the actor (system actor for queued-job-driven transitions), previous state, new state, failed step (if applicable), and a RequestContext (system context for queued jobs).
Audit emission is not optional — provisioning is the surface A.8.15 demands forensic visibility for. For territories adopting this ADR, the audit logger is a prerequisite, not a phase-2 polish.
Tenant-cascade deletes (where the parent's deletion implies the entity's deletion) must iterate-and-log per entity, not bulk-delete. Bulk-delete bypasses the entity audit logger and creates a forensic gap.
6. Authorization at the retry seam (ADR-0006)
The retry endpoint (POST /.../{entity}/{id}/provisioning/retry) is a Tier-1 ability — User + Entity, no extra runtime context. Implemented as a Policy method (retryProvisioning) on the entity's Policy class. The route declares ->can('retryProvisioning', '{entity}').
For territories without an existing Policy on the entity (kendo central Domain is the case in point), the Policy must be created as a prerequisite to the provisioning campaign. Provisioning does not introduce the first Policy for an entity in the same PR.
7. Feature flag rollout
A territory-level config key gates the dispatch:
{TERRITORY}_PROVISIONING_ENABLED=true|falseWhen false, schema / API / UI ship but no external mutation occurs. Enables staging-first validation and production-ready-but-not-enabled deployments. The flag is removable once provisioning has soaked in production for the territory's defined burn-in period (territory chooses; default 30 days).
Options Considered
| Option | Verdict | Reason |
|---|---|---|
| Synchronous provisioning in the create endpoint | Rejected | DNS propagation and cert issuance are non-deterministic and slow. Blocks the user's request, causes timeouts. Couples external-system failure to user-visible request failure. |
| Wildcard DNS + wildcard cert | Rejected (per kendo) | Loses per-tenant cert visibility. Removes the operator seam for "is this tenant routable?" Doesn't extend to BYOD. Some territories may revisit; default architecture is per-host. |
| Roll back entity on provisioning failure | Rejected | Conflates external-system transient failure with user cancel. Loses operator visibility. Loses retry capability. |
| Direct SDK calls in Actions, no provider interface | Rejected | Creates territory-by-territory drift. Hides the timeout-discipline surface. Provider migrations become refactors. |
Single failed state with text error message | Rejected | Collapses recovery branches. DNS-fail and cert-fail need different recovery; a string field is operator-grep where a state-machine field is type-safe. |
| Audit opt-in per territory | Rejected | Two of four likely-adopters are ISO 27001 certified; one carries ISO 27001 + AVG + NEN 7510. Encoding doctrinal asymmetry across the alliance is unsafe. |
| Async + provider-abstracted + retain-on-failure + single-dispatch + audit-mandatory + Tier-1 retry + flag-gated | Accepted | Resolves all six observed risks. kendo first adopter; pattern transfers. |
Consequences
Positive
- Predictable failure semantics across territories.
- Provider migrations are interface-swaps, not refactors.
- Audit trail satisfies A.8.15 by construction (mandatory, enforced).
- Authorization is doctrinally consistent (Tier-1 per ADR-0006).
- Timeout discipline is interface-enforced, not convention-enforced.
- Operators see provisioning state explicitly; retries are first-class.
Negative
- Schema cost: 5–7 new columns on every provisioned-entity table.
- Single-dispatch refactor cost: territories with multiple direct-write paths (kendo signup + central-create) must refactor to delegate.
- Provider abstraction adds one layer of indirection above each external SDK.
- Feature-flag plumbing is additional config surface that must eventually be removed.
- Hot-path read cost: a
provisioning_statuspredicate in tenant-resolution middleware (where applicable) makes a previously-covering index non-covering. Worth knowing, marginal at typical cardinality.
Risks
- External rate limits. Let's Encrypt (50 certs / registered domain / week), Cloudflare API (1200 / 5min / token), Fly API. Mitigation: observability counters on
cert_pending→cert_activelatency; alert on sustained queue depth; territory documents rate-limit caps in its CLAUDE.md. - Certificate Transparency log enumeration. Per-host certs publish to CT logs, exposing tenant subdomain enumeration. Mitigation: documented trade-off; territories with strict tenant-existence privacy needs may need wildcard or explicit DECISIONS.md acknowledgement.
- Concurrency on retry. Two jobs running for one entity (auto-retry tick + operator manual retry). Mitigation: advisory lock or
processing_started_attimestamp + skip-if-recent guard in the job's first step. - Reserved-name claims pre-provisioning. Subdomain blocklists / reserved-prefix lists enforced asymmetrically across territory write surfaces become security holes when provisioning auto-issues real resources. Mitigation: enforcement of reserved-name lists is a prerequisite, not a phase-2 polish — territories must verify all write surfaces consult the same list before enabling the provisioning flag.
- DNS/cert resource drift. Manual edits to provider-side records out of band. Mitigation: providers are idempotent (
ensureRecord()reconciles); manual retry triggers reconciliation; periodic audit reconciliation deferred until rate of drift is observed.
Enforcement
| What | Mechanism | Scope |
|---|---|---|
| Provider classes declare a timeout contract | Pest arch test, territory-local (candidate for phpstan-warroom-rules Phase 2) | app/Contracts/Provisioning/* and app/Services/Provisioning/* |
| Single-writer constraint on provisioned entities | Pest arch test, territory-local | Models for provisioned entities |
| Audit logger exists for provisioned entities | Pest arch test, extends the tests/Arch/AuditTest.php pattern | app/Audit/{Entity}AuditLogger |
| Tier-1 Policy method exists for retry endpoint | tests/Arch/RoutesAuthorizationTest.php (extended for central / equivalent surfaces per kendo distress signal) | Routes matching */provisioning/retry |
| Provisioning state column present on provisioned entity | Pest arch test, territory-local | Migration history |
Resolved Questions
Why retain-on-failure rather than transactional rollback?
Resolved 2026-05-01. Provisioning failure is an external-system condition, not a user-input condition. Rolling back the entity (and parent) on provisioning failure conflates "Cloudflare returned 503" with "user cancelled signup". The user's request committed; the external system's state is what's incomplete. Keeping the entity row preserves operator visibility (failed entities are visible and retryable) and decouples user-visible flow from provider availability.
Why mandatory audit, not opt-in per territory?
Resolved 2026-05-01. Provisioning is exactly A.8.15 surface (security-relevant configuration changes) and at least three of the four likely-adopter territories are under ISO 27001. Making audit opt-in encodes a doctrinal asymmetry across the alliance. Mandatory audit at the architectural level avoids that.
Why Tier-1 retry, not Tier-2?
Resolved 2026-05-01. Retry is a User + Entity decision: "can this central operator retry this domain's provisioning?" No runtime data beyond the route bindings is required. Per ADR-0006's rule of thumb, that is Tier-1 (Policy, route-level ->can()). Tier-2 would be appropriate if retry depended on, say, the originating request's tenant context — it doesn't.
Why single-dispatch (refactor signup/create-tenant to delegate), rather than event-driven?
Resolved 2026-05-01. Two of three production writers in kendo bypass the dedicated CreateDomainAction today. An event-driven hook would preserve the bypass shape (three emitters, one listener) and accept a permanent three-call-site dispatch contract policed by tests. The Commander chose the refactor: paying the cost now to land the canonical seam beats deferring and accepting drift risk forever. Trade-off: the provisioning campaign's blast radius widens to include two well-tested production write paths. Trade accepted.
Implementation
| Territory | State | Notes |
|---|---|---|
| kendo | Not Started | First adopter. PR #1028 (KD-0580) being re-planned against this ADR. Prerequisite work (DomainAuditLogger, DomainPolicy + RoutesAuthorizationTest central-route extension, KD-0596 reserved-subdomain) sequenced ahead. |
| ublgenie | Not Started | Tenant onboarding will adopt when surfaced by reconnaissance. Cartographer pass needed before scoping. |
| emmie | Not Started | CreateAwsBucket.php exists pre-ADR; revisit for adoption when tenant onboarding flow is reformed. |
| daymate-api | Not Started | Speculative. Will revisit if tenant-resource provisioning becomes a campaign type. |
| Other territories | Not in scope | No tenant-resource provisioning surface today. |