Provider reliability
Provider Health
A monitoring room for payment rails, identity, messaging, webhooks, APIs, database health, and the recovery actions that keep operators from guessing.
Operational
5
Provider surfaces that are currently healthy enough to rely on.
Degraded
3
Providers with visible warnings or incomplete readiness.
Blocked
1
Provider events or systems that should block automation.
Signal
63%
Average provider confidence across signals.
payments
2 signals
0
Blocked
2
Watch
58%
Inspect provider logs, queues, deployment context, and recent user-facing failures
webhook
3 signals
1
Blocked
0
Watch
47%
Verify idempotency, payload hash, retry policy, and owner alerting
api
4 signals
0
Blocked
1
Watch
71%
Keep monitoring cadence and evidence freshness
identity
1 signals
0
Blocked
0
Watch
92%
Deepen role-aware session usage across more surfaces
messaging
1 signals
0
Blocked
0
Watch
32%
Use for recovery, lead follow-up, and event comms
reporting
0 signals
0
Blocked
0
Watch
0%
No provider in this lane yet
infrastructure
1 signals
0
Blocked
0
Watch
95%
Keep monitoring cadence and evidence freshness
Provider signal queue
Ranked by signal strength so weak integrations, failed webhooks, and incomplete contracts become visible before members feel them.
Postgres database
infrastructure · Owner: Platform
95%
Signal
Just now
Last signal
Core product read/write availability
Recovery action
Keep monitoring cadence and evidence freshness
Escalation
Check health endpoint
Compare recent deploy and provider status
Open incident timeline
Communicate owner-safe status
Evidence
System: Postgres database
Status: healthy
Core read/write path is available, which keeps the operational app usable.
AI decision log
api · Owner: Operations
95%
Signal
Today
Last signal
Automation trust, review safety, and owner visibility
Recovery action
Keep monitoring cadence and evidence freshness
Escalation
Check health endpoint
Compare recent deploy and provider status
Open incident timeline
Communicate owner-safe status
Evidence
System: AI decision log
Status: healthy
AI surfaces need logging and reasoning visibility to stay trustworthy as automation grows.
Better Auth
identity · Owner: Platform
92%
Signal
Connection expected to be usable
Last signal
Login, staff access, family switching, and audit attribution
Recovery action
Deepen role-aware session usage across more surfaces
Escalation
Confirm provider owner
Check secrets/config
Review recent webhook/API failures
Create owner-visible incident if user-facing
Evidence
Category: identity
Status: live
The identity foundation is there; the next lift is richer role and household behavior on top of it.
Health endpoint
api · Owner: Platform
90%
Signal
Live API surface
Last signal
Consumer: Coolify and operations
Recovery action
Run health smoke and inspect contract drift
Escalation
Identify consumer
Validate auth scope
Smoke route/contract
Document breaking-change risk
Evidence
Consumer: Coolify and operations
Status: live
The existing health surface is the start of a broader platform operations API.
booking.changed
webhook · Owner: Operations
86%
Signal
Delivery active
Last signal
Destination: Waitlist and member timeline
Recovery action
Verify idempotency, payload hash, retry policy, and owner alerting
Escalation
Inspect latest delivery
Check idempotency key
Replay safe payload
Escalate if timeline or money state can drift
Evidence
Destination: Waitlist and member timeline
Status: active
Booking changes are the glue between schedule pressure, member experience, and attendance operations.
Members API
api · Owner: Platform
63%
Signal
Contract documented
Last signal
Consumer: Migration and integrations
Recovery action
Finish auth scope, examples, audit event, and rate-limit definition
Escalation
Identify consumer
Validate auth scope
Smoke route/contract
Document breaking-change risk
Evidence
Consumer: Migration and integrations
Status: documented
External access should follow the same tenant and audit expectations as internal actions.
Payment webhooks
payments · Owner: Operations
61%
Signal
5 minutes ago
Last signal
Money state, recovery workflows, and member trust
Recovery action
Inspect provider logs, queues, deployment context, and recent user-facing failures
Escalation
Check health endpoint
Compare recent deploy and provider status
Open incident timeline
Communicate owner-safe status
Evidence
System: Payment webhooks
Status: watch
Payment rails should be visible to owners before failed webhooks become mystery billing states.
Payment rails
payments · Owner: Billing
54%
Signal
Operational decision required
Last signal
Billing, dunning, refunds, receipts, and owner trust
Recovery action
Decide real retry/recovery ownership across providers
Escalation
Confirm provider owner
Check secrets/config
Review recent webhook/API failures
Create owner-visible incident if user-facing
Evidence
Category: payments
Status: needs attention
Billing depth is where the difference between a convincing shell and a truly operational system becomes very obvious.
payment.failed
webhook · Owner: Billing
38%
Signal
Delivery not yet wired
Last signal
Destination: Recovery workflow
Recovery action
Verify idempotency, payload hash, retry policy, and owner alerting
Escalation
Inspect latest delivery
Check idempotency key
Replay safe payload
Escalate if timeline or money state can drift
Evidence
Destination: Recovery workflow
Status: planned
Failed payments should trigger recovery, audit, and communication context without becoming noisy.
Bookings API
api · Owner: Platform
35%
Signal
Contract planned
Last signal
Consumer: Member app and partner widgets
Recovery action
Finish auth scope, examples, audit event, and rate-limit definition
Escalation
Identify consumer
Validate auth scope
Smoke route/contract
Document breaking-change risk
Evidence
Consumer: Member app and partner widgets
Status: planned
Booking APIs need capacity, waitlist, family, and cancellation rules baked in from the beginning.
Resend
messaging · Owner: Growth
32%
Signal
Not connected yet
Last signal
Lead nurture, recovery, reminders, and incident comms
Recovery action
Use for recovery, lead follow-up, and event comms
Escalation
Confirm provider owner
Check secrets/config
Review recent webhook/API failures
Create owner-visible incident if user-facing
Evidence
Category: messaging
Status: planned
Messaging is already becoming central to the product, so a clearer integration surface helps the roadmap feel coherent.
ai.decision.logged
webhook · Owner: AI review
18%
Signal
Delivery failing
Last signal
Destination: Audit and owner review
Recovery action
Pause automation, inspect failed payloads, and define retry/dead-letter behavior
Escalation
Inspect latest delivery
Check idempotency key
Replay safe payload
Escalate if timeline or money state can drift
Evidence
Destination: Audit and owner review
Status: failing
AI events need extra visibility because trust breaks quickly when automation feels untraceable.
Priority recovery
The providers that should drive the next operator action.
Members API
Finish auth scope, examples, audit event, and rate-limit definition
Payment webhooks
Inspect provider logs, queues, deployment context, and recent user-facing failures
Payment rails
Decide real retry/recovery ownership across providers
ai.decision.logged
Pause automation, inspect failed payloads, and define retry/dead-letter behavior
Monitoring runbook
The minimum loop for provider health once real providers are wired.
Observe
Capture status, latency, last success, last failure, payload count, queue depth, and provider incident notes.
Alert
Warn owners only when a signal is actionable: billing drift, blocked login, failed delivery, or user-facing outage.
Recover
Replay idempotent events, pause unsafe automations, hold billing sends, and document rollback options.
Close
Attach evidence, timeline, prevention task, and owner-safe explanation before marking healthy.
Durable tables next
provider_checks
Provider, check type, status, latency, last error, last success, and cadence.
provider_incidents
Impact, owner, linked provider, timeline, comms, recovery action, and prevention task.
provider_alert_rules
Threshold, severity, notification route, quiet hours, and suppress-until behavior.