ADR-0024: Automated External Provisioning

Accepted Cross-Project Universal

Date: 2026-05-01 (amended 2026-05-20)

Compliance: ISO 27001 (A.8.15) — applicable in territories under that regime.

Amendment 2026-05-20

The original 2026-05-01 ADR specified retain-on-failure semantics, requiring a manual operator retry endpoint and a Tier-1 retryProvisioning policy ability. On 2026-05-20 the rule was flipped to rollback-on-failure to align canonical doctrine with the production behavior already implemented in kendo's SignupAction (which pre-dated the ADR), and to favor prospect-facing self-service retry over operator-facing failure-row visibility. The §4 Failure handling, §6 Authorization, Options Considered table, Enforcement table, and Resolved Questions sections were rewritten. See campaigns/cross-territory/2026-05-20-adr-0024-rollback-semantics-amendment.md for the deliberation record.

Context

Several territories provision external resources (DNS records, TLS certificates, cloud buckets, third-party API resources) when a domain entity is created or modified. The pattern recurs:

kendo — tenant signup creates Domain rows that need DNS records (Cloudflare) and TLS certificates (Fly) before the tenant can serve HTTPS traffic. Active campaign: PR #1028 / KD-0580.
ublgenie — tenant onboarding writes per-tenant resources to Fly and equivalent infrastructure.
emmie — per-tenant Cloudflare R2 / AWS S3 buckets created at tenant onboarding (CreateAwsBucket.php already exists; not currently behind a provisioning state machine).
daymate-api — speculative future adopter.

Without a shared architecture pattern, each territory invents its own version. The kendo PR #1028 critique surfaced six recurring risks in one campaign:

Inconsistent failure semantics. Does provisioning failure roll back the parent record? Cancel the user's onboarding? Leave a half-state?
No retry contract. Manual retry, automatic retry, both? Bounded by what?
Authorization at the retry seam. Tier-1 vs Tier-2 (per ADR-0006) decided ad-hoc per territory.
Audit trail divergence. Provisioning is exactly the surface ISO 27001 A.8.15 demands history for; each territory may add or omit audit logging differently.
Timeout discipline drift. Doctrine #8 (war-room) requires explicit per-call ->timeout() on external HTTP. Easy to forget when the provider is hidden behind a generic SDK.
Provider lock-in. Direct SDK calls in Actions make territory migrations (e.g., Cloudflare → Route53) painful.

This ADR captures the architecture before kendo locks the implementation, so the next territory adopts the pattern instead of re-inventing it.

Decision

External provisioning operations are async, provider-abstracted, rollback-on-failure, single-dispatch, audit-mandatory, flag-gated. The shape:

1. State machine on the provisioned entity

The entity that owns the external resource carries provisioning lifecycle state in its own row. Minimum schema (territory adapts column names; semantics are universal):

provisioning_status         enum / string  // pending → dns_pending → dns_active → cert_pending → cert_active → health_pending → active; failed terminal
provisioning_failed_step    enum / string nullable  // null when not failed; identifies which step terminated
provisioning_attempts       unsigned int default 0
provisioning_last_error     text nullable
provisioned_at              timestamp nullable
{provider}_record_id        string nullable
{provider}_certificate_id   string nullable

A single failed state with a string last_error is rejected — recovery branches differently for DNS-fail vs cert-fail vs health-fail. failed_step makes the recovery path type-safe.

2. Provider abstraction

Each external system is fronted by an interface in app/Contracts/Provisioning/. Concrete implementations live in app/Services/Provisioning/. Provider methods are idempotent: ensureRecord() creates-or-updates and never errors on already-exists.

Provider interfaces declare an explicit timeout contract per Doctrine #8 + 2026-04-22 library-author extension. Either:

Constructor-required int $timeoutSeconds, OR
Per-method ->withTimeout(int $seconds) returning a configured client.

Implementations must not inherit framework defaults. An arch test fails the build if a class under app/Services/Provisioning/* references Http:: or an HTTP client without a visible timeout configuration.

3. Single dispatch path

The Action that creates the provisioned entity is the only code path that writes the entity row. Sibling code paths (signup actions, tenant-creation actions, batch-import actions) delegate to it rather than writing directly. This is the canonical path the dispatch hook attaches to.

Where sibling code paths exist today writing the entity directly, refactor them to delegate as part of the territory's adoption campaign — not as a deferred follow-up. An arch test enforces the single-writer constraint per territory.

After the entity-creation transaction commits, the creation Action dispatches the queued provisioning job. The job advances the state machine for successful and in-flight transitions; terminal failure triggers a rollback of the entity row (and parent, where coupled).

4. Rollback-on-failure

Terminal provisioning failure deletes the entity row inside a transaction and, where the entity's lifecycle is coupled to a parent (e.g., the tenant whose first signup created the domain), deletes the parent rows and drops any provisioned tenant-scoped resources (databases, buckets) in a finally block so the rollback completes even if the central deletion itself fails. The caller sees the original exception; the rolled-back signup or admin action leaves no central rows behind. Both rollback paths log a structured error with the entity identifier (e.g., subdomain) and exception before deleting.

Provisioning failures from the user-facing perspective become fully reversible: the same subdomain and email become immediately available again, and the prospect simply retries the signup form. The central tables never accumulate orphaned tenant/domain rows; there is no operator triage queue of stuck failed records.

The state machine still drives the lifecycle for successful and in-flight transitions (pending → dns_pending → dns_active → cert_pending → cert_active → health_pending → active). A failed terminal state exists transiently before the rollback runs — long enough for the rollback transaction's audit-log row to record what step terminated, but not long enough to require operator triage. Provider idempotency (ensureRecord()) means a fresh signup attempting the same subdomain re-runs the steps cleanly without colliding with leftover provider-side state from the rolled-back attempt.

5. Audit logging mandatory (ADR-0001)

Every provisioning state transition emits a row to the entity's audit log, including the deletion row written during the rollback. The audit log records the actor (system actor for queued-job-driven transitions), previous state, new state, failed step (if applicable), and a RequestContext (system context for queued jobs). Under rollback semantics, the audit row that captures the failure is the row emitted by the rollback's DeleteEntityAction invocation — the audit logger sees a final transition out of failed (or the last in-flight state) into deletion, with the failed step preserved in the snapshot. Audit emission is not optional — provisioning is the surface A.8.15 demands forensic visibility for. For territories adopting this ADR, the audit logger is a prerequisite, not a phase-2 polish.

Operator forensic workflow under rollback semantics: rolled-back signups leave no central row to inspect, so the audit log + structured LogManager::error from the rollback handler are the only forensic signals. The rollback handler's log payload must name the failed provisioning step (the column populated transiently in the failed state before rollback) so that operators can correlate audit rows with log lines. This is a phase-3 implementation detail; the audit-log row alone preserves the A.8.15 forensic property.

Tenant-cascade deletes (where the parent's deletion implies the entity's deletion) must iterate-and-log per entity, not bulk-delete. Bulk-delete bypasses the entity audit logger and creates a forensic gap. The rollback path inherits this constraint — it must invoke the entity's Delete*Action (or equivalent audit-emitting path) per row, not call $model->delete() directly.

6. Authorization

Under rollback semantics there is no operator retry endpoint — failed provisioning self-cleans, and the prospect retries via the public signup form. Tier-1 retry abilities introduced for the previous (retain-on-failure) version of this ADR (e.g., kendo's DomainPolicy::retryProvisioning) are dead code under the amended rule and should be removed alongside the doctrine flip in the adopting territory.

Standard ADR-0006 authorization applies to the entity's CRUD surface (create/read/update/delete on /central/domains or equivalent) — that is unchanged from the entity's pre-provisioning state and is not specific to this ADR.

7. Feature flag rollout

A territory-level config key gates the dispatch:

{TERRITORY}_PROVISIONING_ENABLED=true|false

When false, schema / API / UI ship but no external mutation occurs. Enables staging-first validation and production-ready-but-not-enabled deployments. The flag is removable once provisioning has soaked in production for the territory's defined burn-in period (territory chooses; default 30 days).

Options Considered

Option	Verdict	Reason
Synchronous provisioning in the create endpoint	Rejected	DNS propagation and cert issuance are non-deterministic and slow. Blocks the user's request, causes timeouts. Couples external-system failure to user-visible request failure.
Wildcard DNS + wildcard cert	Rejected (per kendo)	Loses per-tenant cert visibility. Removes the operator seam for "is this tenant routable?" Doesn't extend to BYOD. Some territories may revisit; default architecture is per-host.
Retain entity on provisioning failure (`failed` state + operator retry endpoint)	Rejected (2026-05-20 amendment)	Preserves operator visibility on failed signups but creates an operator triage queue (stuck rows) and forces prospects through an operator-mediated recovery path instead of self-service re-signup. ISO 27001 A.8.15 forensic visibility is achievable via audit log + structured-log trail without the operator-facing `failed` row. Trade favored prospect self-service over operator visibility.
Roll back entity on provisioning failure	Accepted (2026-05-20 amendment)	Provisioning failure becomes fully reversible: the same subdomain and email become immediately available again, prospect retries via the public form, central tables accumulate no orphaned rows. State machine still drives successful and in-flight transitions; audit log captures the deletion row to satisfy A.8.15. Aligns with the production behavior kendo's `SignupAction` already implemented (pre-ADR).
Direct SDK calls in Actions, no provider interface	Rejected	Creates territory-by-territory drift. Hides the timeout-discipline surface. Provider migrations become refactors.
Single `failed` state with text error message	Rejected	Collapses recovery branches. DNS-fail and cert-fail need different recovery; a string field is operator-grep where a state-machine field is type-safe (still relevant in the transient pre-rollback `failed` state and in audit-log snapshots).
Audit opt-in per territory	Rejected	Two of four likely-adopters are ISO 27001 certified; one carries ISO 27001 + AVG + NEN 7510. Encoding doctrinal asymmetry across the alliance is unsafe.
Async + provider-abstracted + rollback-on-failure + single-dispatch + audit-mandatory + flag-gated	Accepted (amended 2026-05-20)	Resolves all six observed risks while favoring prospect self-service recovery. kendo first adopter; pattern transfers.

Consequences

Positive

Predictable failure semantics across territories.
Provider migrations are interface-swaps, not refactors.
Audit trail satisfies A.8.15 by construction (mandatory, enforced).
Timeout discipline is interface-enforced, not convention-enforced.
Failed signups are self-cleaning: no orphaned central rows, no operator triage queue, prospect self-service retry via the public form.

Negative

Schema cost: 5–7 new columns on every provisioned-entity table. Some columns (provisioning_failed_step, provisioning_last_error) carry data transiently before rollback and become forensic-snapshot fodder via the audit log, rather than load-bearing operator-query columns.
Single-dispatch refactor cost: territories with multiple direct-write paths (kendo signup + central-create) must refactor to delegate.
Provider abstraction adds one layer of indirection above each external SDK.
Feature-flag plumbing is additional config surface that must eventually be removed.
Hot-path read cost: a provisioning_status predicate in tenant-resolution middleware (where applicable) makes a previously-covering index non-covering. Worth knowing, marginal at typical cardinality.
Operator forensic visibility on failed signups is log-trail-only. No persistent central row to inspect after rollback — operators correlate the audit-log deletion row with the structured LogManager::error from the rollback handler. Workable for typical ISO 27001 A.8.15 forensic queries; harder for "show me all the failed signups in the last week" dashboards.

Risks

External rate limits. Let's Encrypt (50 certs / registered domain / week), Cloudflare API (1200 / 5min / token), Fly API. Under rollback semantics, rapid prospect re-signup attempts on the same subdomain can compound LE rate-limit pressure (every failed-then-retried attempt issues a fresh cert request). Mitigation: observability counters on cert_pending → cert_active latency; alert on sustained queue depth and on per-subdomain retry rate; territory documents rate-limit caps in its CLAUDE.md.
Certificate Transparency log enumeration. Per-host certs publish to CT logs, exposing tenant subdomain enumeration. Mitigation: documented trade-off; territories with strict tenant-existence privacy needs may need wildcard or explicit DECISIONS.md acknowledgement.
Reserved-name claims pre-provisioning. Subdomain blocklists / reserved-prefix lists enforced asymmetrically across territory write surfaces become security holes when provisioning auto-issues real resources. Mitigation: enforcement of reserved-name lists is a prerequisite, not a phase-2 polish — territories must verify all write surfaces consult the same list before enabling the provisioning flag.
DNS/cert resource drift. Manual edits to provider-side records out of band. Mitigation: providers are idempotent (ensureRecord() reconciles); a rolled-back-then-re-signed-up subdomain re-runs the steps cleanly via provider idempotency. Periodic audit reconciliation of orphaned provider-side records (where rollback succeeded centrally but provider-side cleanup was skipped) is deferred until rate of drift is observed.
Rollback transaction failure. The rollback path itself can fail (central DB unavailable, tenant-DB DROP failing because of an active connection). Mitigation: the rollback handler runs the central deletion in a transaction and the tenant-resource teardown (e.g., DROP DATABASE) in a finally block so the resource teardown completes even if the central deletion fails — the failure mode degrades to "central row remains, tenant resource gone" rather than "tenant resource remains, central row gone", which is the safer asymmetry for ISO 27001 A.8.15 (audit row exists; resource does not).

Enforcement

What	Mechanism	Scope
Provider classes declare a timeout contract	Pest arch test, territory-local (candidate for `phpstan-warroom-rules` Phase 2)	`app/Contracts/Provisioning/` and `app/Services/Provisioning/`
Single-writer constraint on provisioned entities	Pest arch test, territory-local	Models for provisioned entities
Audit logger exists for provisioned entities	Pest arch test, extends the `tests/Arch/AuditTest.php` pattern	`app/Audit/{Entity}AuditLogger`
Provisioning state column present on provisioned entity	Pest arch test, territory-local	Migration history
Rollback path delegates to entity `Delete*Action` (no direct `$model->delete()` in rollback handlers)	Pest arch test, territory-local	Signup / tenant-create / equivalent creation Actions that may roll back

Resolved Questions

Why rollback-on-failure rather than retain?

Resolved 2026-05-20 (amendment). The original 2026-05-01 ADR specified retain-on-failure to preserve operator visibility on stuck rows. In practice the kendo SignupAction had always rolled back (pre-dating the ADR), and the trade-off favored prospect self-service: under rollback semantics, a failed signup is fully reversible — the same subdomain and email become immediately available again, the prospect retries via the public form, and the central tables accumulate no orphaned rows. ISO 27001 A.8.15 forensic visibility is satisfied via the audit-log deletion row + structured LogManager::error from the rollback handler. Operator-facing dashboards lose the "list all failed signups" query but gain a zero-triage queue.

The trade-off the Commander explicitly accepted: lose persistent operator visibility on rolled-back failures (forensic visibility moves to log+audit trail) in exchange for prospect self-service recovery and zero orphan accumulation.

Why mandatory audit, not opt-in per territory?

Resolved 2026-05-01. Provisioning is exactly A.8.15 surface (security-relevant configuration changes) and at least three of the four likely-adopter territories are under ISO 27001. Making audit opt-in encodes a doctrinal asymmetry across the alliance. Mandatory audit at the architectural level avoids that.

Resolved 2026-05-01. Two of three production writers in kendo bypass the dedicated CreateDomainAction today. An event-driven hook would preserve the bypass shape (three emitters, one listener) and accept a permanent three-call-site dispatch contract policed by tests. The Commander chose the refactor: paying the cost now to land the canonical seam beats deferring and accepting drift risk forever. Trade-off: the provisioning campaign's blast radius widens to include two well-tested production write paths. Trade accepted. Note (2026-05-20 amendment): the refactor target still stands under rollback semantics — the single-writer arch test is the canonical seam regardless of failure-handling rule. What changes is that the rollback path in the refactored signup must delegate the deletion to DeleteDomainAction (or equivalent audit-emitting path), not call $model->delete() directly.

Implementation

Territory	State	Notes
kendo	Not Started	First adopter. PR #1028 (KD-0580) being re-planned against this ADR. Prerequisite work (DomainAuditLogger, DomainPolicy + RoutesAuthorizationTest central-route extension, KD-0596 reserved-subdomain) sequenced ahead.
ublgenie	Not Started	Tenant onboarding will adopt when surfaced by reconnaissance. Cartographer pass needed before scoping.
emmie	Not Started	`CreateAwsBucket.php` exists pre-ADR; revisit for adoption when tenant onboarding flow is reformed.
daymate-api	Not Started	Speculative. Will revisit if tenant-resource provisioning becomes a campaign type.
Other territories	Not in scope	No tenant-resource provisioning surface today.

ADR-0024: Automated External Provisioning ​

Context ​

Decision ​

1. State machine on the provisioned entity ​

2. Provider abstraction ​

3. Single dispatch path ​

4. Rollback-on-failure ​

5. Audit logging mandatory (ADR-0001) ​

6. Authorization ​

7. Feature flag rollout ​

Options Considered ​

Consequences ​

Positive ​

Negative ​

Risks ​

Enforcement ​

Resolved Questions ​

Why rollback-on-failure rather than retain? ​

Why mandatory audit, not opt-in per territory? ​

Why single-dispatch (refactor signup/create-tenant to delegate), rather than event-driven? ​

Implementation ​