# Backend Health & Observability
Version: 1.1.1
Last Updated: January 23, 2026
Status: Production
Component: rc-backend
Overview
BranchPy backend exposes health and observability endpoints to ensure system reliability and enable fast diagnosis of runtime issues. This document covers:
- Health check endpoints
- Database connectivity diagnostics
- Structured logging integration
- Monitoring integration
For comprehensive logging configuration and usage, see the Logging System Documentation.
Health Check Endpoints
/health (Lightweight)
Purpose: Service uptime verification - no I/O operations
Request:
GET /health
Response (Always 200):
{
"status": "ok",
"timestamp": "2026-01-23T14:32:11.382Z"
}
Semantics: “Is the Node.js process up and routable?”
Use Cases:
- Load balancer health checks
- Simple “is backend alive?” monitoring
- Kubernetes liveness probes
Response Time: <10ms
/health/db (Diagnostic)
Purpose: Database connectivity diagnostic with explicit pool checks
Request:
GET /health/db
X-Railway-Request-Id: <optional, for log correlation>
Response Contract:
All Pools Healthy → 200 OK
{
"status": "ok",
"core_db": "connected",
"telemetry_db": "connected",
"version": "rc-backend@abc1234",
"timestamp": "2026-01-23T14:32:11.382Z"
}
Semantics: Both pools answered SELECT 1 within 500ms. Ready for requests.
Partial Failure → 503 Service Unavailable
{
"status": "degraded",
"core_db": "connected",
"telemetry_db": "error",
"errors": {
"telemetry_db": "password authentication failed"
},
"version": "rc-backend@abc1234",
"timestamp": "2026-01-23T14:32:11.382Z"
}
Semantics:
- Core DB healthy → auth can work
- Telemetry DB failed → analytics/logging offline, but users unblocked
- HTTP 503 signals: “Something is broken, handle gracefully”
Total Failure → 503 Service Unavailable
{
"status": "down",
"core_db": "error",
"telemetry_db": "error",
"errors": {
"core_db": "connection timeout",
"telemetry_db": "password authentication failed"
},
"version": "rc-backend@abc1234",
"timestamp": "2026-01-23T14:32:11.382Z"
}
Semantics: Backend cannot authenticate users. Requests will fail. Return to incident response.
Status Field Semantics
| Status | Meaning | HTTP Code | Action |
|---|---|---|---|
ok |
Both pools healthy | 200 |
Proceed normally |
degraded |
One pool failed | 503 |
Log issue, telemetry offline but auth works |
down |
Both pools failed | 503 |
Incident: scale down, page on-call |
Pool Check Implementation
Query
SELECT 1;
Properties:
- Minimal overhead — No joins, no writes, no locks
- Proves connectivity — Actually uses connection from pool
- Non-destructive — Never modifies data
Timeout
- Per pool: 500ms max
- Rationale: Failed DB should fail fast; slow response = dead connection
- Concurrency: Both pool checks run in parallel
Error Handling
All errors are captured and reported:
| Error Type | Example | Handling |
|---|---|---|
| Auth failure | “password authentication failed” | Reported in errors.<pool> |
| Connection refused | “connect ECONNREFUSED” | Reported in errors.<pool> |
| Timeout | “connection timeout” | Reported in errors.<pool> |
| Network unreachable | “ENETUNREACH” | Reported in errors.<pool> |
| Pool exhausted | (rare in SELECT 1) | Treated as pool unavailable |
Key guarantee: Never returns 200 OK if a pool failed.
Version Identification
{
"version": "rc-backend@abc1234"
}
Identifies the running build:
- Source:
RAILWAY_GIT_COMMIT_SHA(preferred) - Fallback:
GITHUB_SHA, thenBUILD_ID - Format:
rc-backend@<commit_hash>
Purpose: Correlate health check with which deploy is unhealthy.
Example in incident:
[INCIDENT] /health/db returns 503
Check: version = "rc-backend@abc1234"
Compare to: deployed commit = "abc1234" ✓ (current)
→ DB is actually down, not stale version
Structured Logging
Every /health/db call emits exactly one log entry:
[HEALTH_DB] { request_id: "vgvTerVLTFyDlzLuCx5-qw", core_db: "connected", telemetry_db: "error", result: "degraded", version: "rc-backend@abc1234", timestamp: "2026-01-23T14:32:11.382Z" }
Properties:
- Correlate with Railway request ID (paste into logs search)
- Machine-parseable (grep/awk safe)
- Single line (no log spam)
- Includes result + both pool statuses
Parsing example:
# Show all degraded health checks today
grep "\[HEALTH_DB\]" logs.txt | grep "degraded" | tail -20
See also: Technical/logging/README.md
Comparison: /health vs /health/db
| Aspect | /health |
/health/db |
|---|---|---|
| Purpose | Service uptime | DB connectivity |
| DB Access | No | Yes (SELECT 1 both pools) |
| HTTP Code (OK) | 200 | 200 |
| HTTP Code (Fail) | Never fails | 503 |
| Response Time | <10ms | <1000ms (includes 500ms timeout) |
| Use Case | Load balancers | Monitoring, diagnostics, CI |
| Example Users | Kubernetes probe | Human operator, monitoring system |
Monitoring Integration
Current (v1.1.1)
- Structured logging to Railway logs
- Manual correlation via Railway dashboard
- Request ID tracing
Future Enhancements
Prometheus Metrics:
branchpy_backend_health_db_core{status="connected"} 1.0
branchpy_backend_health_db_telemetry{status="error"} 0.0
branchpy_backend_health_db_latency_ms 234.5
Alerting:
alert: BackendDBDegraded
if: /health/db status == "degraded" for 5m
Dashboard:
BranchPy Infrastructure Status
├─ Backend Service: 🟢 UP
├─ Core DB: 🟢 CONNECTED
└─ Telemetry DB: 🔴 ERROR (password auth failed)
└─ Action: Check Railway dashboard, verify creds
Troubleshooting
/health/db returns degraded or down
- Check response JSON for
errors.<pool>message - Correlate with
versionto ensure you’re testing the right deploy - Use Railway dashboard to verify:
- DB is running
- Credentials are correct
- Network connectivity exists
- Check firewall/proxy rules (Railway internal network)
/health/db returns down but auth still works
- Unlikely (core DB is required for auth)
- If it happens: race condition in pool check, retry immediately
- If persistent: core DB is flaky, needs triage
No response from /health/db (timeout)
- Backend might be hanging on pool acquisition
- Check Railway logs for hung connections
- Restart backend if needed
Design Rationale
Why two endpoints?
During Phase-1 closure, database connectivity issues were diagnosed by:
- ⌠Observing absent API responses (not crash, not error, just silence)
- ⌠Guessing whether core or telemetry DB was the problem
- ⌠Waiting for Railway logs to load (verbose, unstructured)
- ⌠No structured way to check “can this backend actually talk to DB right now?”
Solution: Explicit health endpoints that:
- Never mask failures
- Provide fast diagnosis (which pool failed)
- Are safe to call frequently
- Return machine-parseable output
Related Documentation
- Logging System - Comprehensive logging documentation
- Logging Configuration - Log levels, sinks, rotation
- Server Architecture - Server architecture
- API Reference - API contracts
- Backend Deployment - Production deployment
Source References
This document consolidates information from:
docs/v1.1.0/backend/health-and-observability/HEALTH_DB_ENDPOINT.md