Skip to content

ADR-0024: Automated External Provisioning

Accepted Cross-Project Universal

Date: 2026-05-01

Compliance: ISO 27001 (A.8.15) — applicable in territories under that regime.

Context

Several territories provision external resources (DNS records, TLS certificates, cloud buckets, third-party API resources) when a domain entity is created or modified. The pattern recurs:

  • kendo — tenant signup creates Domain rows that need DNS records (Cloudflare) and TLS certificates (Fly) before the tenant can serve HTTPS traffic. Active campaign: PR #1028 / KD-0580.
  • ublgenie — tenant onboarding writes per-tenant resources to Fly and equivalent infrastructure.
  • emmie — per-tenant Cloudflare R2 / AWS S3 buckets created at tenant onboarding (CreateAwsBucket.php already exists; not currently behind a provisioning state machine).
  • daymate-api — speculative future adopter.

Without a shared architecture pattern, each territory invents its own version. The kendo PR #1028 critique surfaced six recurring risks in one campaign:

  1. Inconsistent failure semantics. Does provisioning failure roll back the parent record? Cancel the user's onboarding? Leave a half-state?
  2. No retry contract. Manual retry, automatic retry, both? Bounded by what?
  3. Authorization at the retry seam. Tier-1 vs Tier-2 (per ADR-0006) decided ad-hoc per territory.
  4. Audit trail divergence. Provisioning is exactly the surface ISO 27001 A.8.15 demands history for; each territory may add or omit audit logging differently.
  5. Timeout discipline drift. Doctrine #8 (war-room) requires explicit per-call ->timeout() on external HTTP. Easy to forget when the provider is hidden behind a generic SDK.
  6. Provider lock-in. Direct SDK calls in Actions make territory migrations (e.g., Cloudflare → Route53) painful.

This ADR captures the architecture before kendo locks the implementation, so the next territory adopts the pattern instead of re-inventing it.

Decision

External provisioning operations are async, provider-abstracted, retain-on-failure, single-dispatch, audit-mandatory, Tier-1 retry, flag-gated. The shape:

1. State machine on the provisioned entity

The entity that owns the external resource carries provisioning lifecycle state in its own row. Minimum schema (territory adapts column names; semantics are universal):

provisioning_status         enum / string  // pending → dns_pending → dns_active → cert_pending → cert_active → health_pending → active; failed terminal
provisioning_failed_step    enum / string nullable  // null when not failed; identifies which step terminated
provisioning_attempts       unsigned int default 0
provisioning_last_error     text nullable
provisioned_at              timestamp nullable
{provider}_record_id        string nullable
{provider}_certificate_id   string nullable

A single failed state with a string last_error is rejected — recovery branches differently for DNS-fail vs cert-fail vs health-fail. failed_step makes the recovery path type-safe.

2. Provider abstraction

Each external system is fronted by an interface in app/Contracts/Provisioning/. Concrete implementations live in app/Services/Provisioning/. Provider methods are idempotent: ensureRecord() creates-or-updates and never errors on already-exists.

Provider interfaces declare an explicit timeout contract per Doctrine #8 + 2026-04-22 library-author extension. Either:

  • Constructor-required int $timeoutSeconds, OR
  • Per-method ->withTimeout(int $seconds) returning a configured client.

Implementations must not inherit framework defaults. An arch test fails the build if a class under app/Services/Provisioning/* references Http:: or an HTTP client without a visible timeout configuration.

3. Single dispatch path

The Action that creates the provisioned entity is the only code path that writes the entity row. Sibling code paths (signup actions, tenant-creation actions, batch-import actions) delegate to it rather than writing directly. This is the canonical path the dispatch hook attaches to.

Where sibling code paths exist today writing the entity directly, refactor them to delegate as part of the territory's adoption campaign — not as a deferred follow-up. An arch test enforces the single-writer constraint per territory.

After the entity-creation transaction commits, the creation Action dispatches the queued provisioning job. Failure of the job does not roll back the entity row — provisioning failure is a state, not a deletion.

4. Retain-on-failure

Failed provisioning leaves the entity row with provisioning_status = 'failed' and provisioning_last_error populated. The row is not deleted; the parent (e.g., the tenant) is not deleted. Retry — automatic with bounded backoff (deferrable per phase), manual via operator endpoint (always present from phase 1) — recovers from failed to active without re-creating the row.

Provisioning failures are external-system failures (DNS propagation slow, LE rate limits, provider API outage). Deleting the entity on failure conflates "transient external problem" with "operator decision to cancel". The user's signup or admin action commits independently of provisioning success.

5. Audit logging mandatory (ADR-0001)

Every provisioning state transition emits a row to the entity's audit log. The audit log records the actor (system actor for queued-job-driven transitions), previous state, new state, failed step (if applicable), and a RequestContext (system context for queued jobs).

Audit emission is not optional — provisioning is the surface A.8.15 demands forensic visibility for. For territories adopting this ADR, the audit logger is a prerequisite, not a phase-2 polish.

Tenant-cascade deletes (where the parent's deletion implies the entity's deletion) must iterate-and-log per entity, not bulk-delete. Bulk-delete bypasses the entity audit logger and creates a forensic gap.

6. Authorization at the retry seam (ADR-0006)

The retry endpoint (POST /.../{entity}/{id}/provisioning/retry) is a Tier-1 ability — User + Entity, no extra runtime context. Implemented as a Policy method (retryProvisioning) on the entity's Policy class. The route declares ->can('retryProvisioning', '{entity}').

For territories without an existing Policy on the entity (kendo central Domain is the case in point), the Policy must be created as a prerequisite to the provisioning campaign. Provisioning does not introduce the first Policy for an entity in the same PR.

7. Feature flag rollout

A territory-level config key gates the dispatch:

{TERRITORY}_PROVISIONING_ENABLED=true|false

When false, schema / API / UI ship but no external mutation occurs. Enables staging-first validation and production-ready-but-not-enabled deployments. The flag is removable once provisioning has soaked in production for the territory's defined burn-in period (territory chooses; default 30 days).

Options Considered

OptionVerdictReason
Synchronous provisioning in the create endpointRejectedDNS propagation and cert issuance are non-deterministic and slow. Blocks the user's request, causes timeouts. Couples external-system failure to user-visible request failure.
Wildcard DNS + wildcard certRejected (per kendo)Loses per-tenant cert visibility. Removes the operator seam for "is this tenant routable?" Doesn't extend to BYOD. Some territories may revisit; default architecture is per-host.
Roll back entity on provisioning failureRejectedConflates external-system transient failure with user cancel. Loses operator visibility. Loses retry capability.
Direct SDK calls in Actions, no provider interfaceRejectedCreates territory-by-territory drift. Hides the timeout-discipline surface. Provider migrations become refactors.
Single failed state with text error messageRejectedCollapses recovery branches. DNS-fail and cert-fail need different recovery; a string field is operator-grep where a state-machine field is type-safe.
Audit opt-in per territoryRejectedTwo of four likely-adopters are ISO 27001 certified; one carries ISO 27001 + AVG + NEN 7510. Encoding doctrinal asymmetry across the alliance is unsafe.
Async + provider-abstracted + retain-on-failure + single-dispatch + audit-mandatory + Tier-1 retry + flag-gatedAcceptedResolves all six observed risks. kendo first adopter; pattern transfers.

Consequences

Positive

  • Predictable failure semantics across territories.
  • Provider migrations are interface-swaps, not refactors.
  • Audit trail satisfies A.8.15 by construction (mandatory, enforced).
  • Authorization is doctrinally consistent (Tier-1 per ADR-0006).
  • Timeout discipline is interface-enforced, not convention-enforced.
  • Operators see provisioning state explicitly; retries are first-class.

Negative

  • Schema cost: 5–7 new columns on every provisioned-entity table.
  • Single-dispatch refactor cost: territories with multiple direct-write paths (kendo signup + central-create) must refactor to delegate.
  • Provider abstraction adds one layer of indirection above each external SDK.
  • Feature-flag plumbing is additional config surface that must eventually be removed.
  • Hot-path read cost: a provisioning_status predicate in tenant-resolution middleware (where applicable) makes a previously-covering index non-covering. Worth knowing, marginal at typical cardinality.

Risks

  • External rate limits. Let's Encrypt (50 certs / registered domain / week), Cloudflare API (1200 / 5min / token), Fly API. Mitigation: observability counters on cert_pendingcert_active latency; alert on sustained queue depth; territory documents rate-limit caps in its CLAUDE.md.
  • Certificate Transparency log enumeration. Per-host certs publish to CT logs, exposing tenant subdomain enumeration. Mitigation: documented trade-off; territories with strict tenant-existence privacy needs may need wildcard or explicit DECISIONS.md acknowledgement.
  • Concurrency on retry. Two jobs running for one entity (auto-retry tick + operator manual retry). Mitigation: advisory lock or processing_started_at timestamp + skip-if-recent guard in the job's first step.
  • Reserved-name claims pre-provisioning. Subdomain blocklists / reserved-prefix lists enforced asymmetrically across territory write surfaces become security holes when provisioning auto-issues real resources. Mitigation: enforcement of reserved-name lists is a prerequisite, not a phase-2 polish — territories must verify all write surfaces consult the same list before enabling the provisioning flag.
  • DNS/cert resource drift. Manual edits to provider-side records out of band. Mitigation: providers are idempotent (ensureRecord() reconciles); manual retry triggers reconciliation; periodic audit reconciliation deferred until rate of drift is observed.

Enforcement

WhatMechanismScope
Provider classes declare a timeout contractPest arch test, territory-local (candidate for phpstan-warroom-rules Phase 2)app/Contracts/Provisioning/* and app/Services/Provisioning/*
Single-writer constraint on provisioned entitiesPest arch test, territory-localModels for provisioned entities
Audit logger exists for provisioned entitiesPest arch test, extends the tests/Arch/AuditTest.php patternapp/Audit/{Entity}AuditLogger
Tier-1 Policy method exists for retry endpointtests/Arch/RoutesAuthorizationTest.php (extended for central / equivalent surfaces per kendo distress signal)Routes matching */provisioning/retry
Provisioning state column present on provisioned entityPest arch test, territory-localMigration history

Resolved Questions

Why retain-on-failure rather than transactional rollback?

Resolved 2026-05-01. Provisioning failure is an external-system condition, not a user-input condition. Rolling back the entity (and parent) on provisioning failure conflates "Cloudflare returned 503" with "user cancelled signup". The user's request committed; the external system's state is what's incomplete. Keeping the entity row preserves operator visibility (failed entities are visible and retryable) and decouples user-visible flow from provider availability.

Why mandatory audit, not opt-in per territory?

Resolved 2026-05-01. Provisioning is exactly A.8.15 surface (security-relevant configuration changes) and at least three of the four likely-adopter territories are under ISO 27001. Making audit opt-in encodes a doctrinal asymmetry across the alliance. Mandatory audit at the architectural level avoids that.

Why Tier-1 retry, not Tier-2?

Resolved 2026-05-01. Retry is a User + Entity decision: "can this central operator retry this domain's provisioning?" No runtime data beyond the route bindings is required. Per ADR-0006's rule of thumb, that is Tier-1 (Policy, route-level ->can()). Tier-2 would be appropriate if retry depended on, say, the originating request's tenant context — it doesn't.

Why single-dispatch (refactor signup/create-tenant to delegate), rather than event-driven?

Resolved 2026-05-01. Two of three production writers in kendo bypass the dedicated CreateDomainAction today. An event-driven hook would preserve the bypass shape (three emitters, one listener) and accept a permanent three-call-site dispatch contract policed by tests. The Commander chose the refactor: paying the cost now to land the canonical seam beats deferring and accepting drift risk forever. Trade-off: the provisioning campaign's blast radius widens to include two well-tested production write paths. Trade accepted.

Implementation

TerritoryStateNotes
kendoNot StartedFirst adopter. PR #1028 (KD-0580) being re-planned against this ADR. Prerequisite work (DomainAuditLogger, DomainPolicy + RoutesAuthorizationTest central-route extension, KD-0596 reserved-subdomain) sequenced ahead.
ublgenieNot StartedTenant onboarding will adopt when surfaced by reconnaissance. Cartographer pass needed before scoping.
emmieNot StartedCreateAwsBucket.php exists pre-ADR; revisit for adoption when tenant onboarding flow is reformed.
daymate-apiNot StartedSpeculative. Will revisit if tenant-resource provisioning becomes a campaign type.
Other territoriesNot in scopeNo tenant-resource provisioning surface today.

Architecture documentation for contributors and collaborators.