Docs-Builder

# Backend Health & Observability

Version: 1.1.1
Last Updated: January 23, 2026
Status: Production
Component: rc-backend


Overview

BranchPy backend exposes health and observability endpoints to ensure system reliability and enable fast diagnosis of runtime issues. This document covers:

  1. Health check endpoints
  2. Database connectivity diagnostics
  3. Structured logging integration
  4. Monitoring integration

For comprehensive logging configuration and usage, see the Logging System Documentation.


Health Check Endpoints

/health (Lightweight)

Purpose: Service uptime verification - no I/O operations

Request:

GET /health

Response (Always 200):

{
  "status": "ok",
  "timestamp": "2026-01-23T14:32:11.382Z"
}

Semantics: “Is the Node.js process up and routable?”

Use Cases:

  • Load balancer health checks
  • Simple “is backend alive?” monitoring
  • Kubernetes liveness probes

Response Time: <10ms


/health/db (Diagnostic)

Purpose: Database connectivity diagnostic with explicit pool checks

Request:

GET /health/db
X-Railway-Request-Id: <optional, for log correlation>

Response Contract:

All Pools Healthy → 200 OK

{
  "status": "ok",
  "core_db": "connected",
  "telemetry_db": "connected",
  "version": "rc-backend@abc1234",
  "timestamp": "2026-01-23T14:32:11.382Z"
}

Semantics: Both pools answered SELECT 1 within 500ms. Ready for requests.


Partial Failure → 503 Service Unavailable

{
  "status": "degraded",
  "core_db": "connected",
  "telemetry_db": "error",
  "errors": {
    "telemetry_db": "password authentication failed"
  },
  "version": "rc-backend@abc1234",
  "timestamp": "2026-01-23T14:32:11.382Z"
}

Semantics:

  • Core DB healthy → auth can work
  • Telemetry DB failed → analytics/logging offline, but users unblocked
  • HTTP 503 signals: “Something is broken, handle gracefully”

Total Failure → 503 Service Unavailable

{
  "status": "down",
  "core_db": "error",
  "telemetry_db": "error",
  "errors": {
    "core_db": "connection timeout",
    "telemetry_db": "password authentication failed"
  },
  "version": "rc-backend@abc1234",
  "timestamp": "2026-01-23T14:32:11.382Z"
}

Semantics: Backend cannot authenticate users. Requests will fail. Return to incident response.


Status Field Semantics

Status Meaning HTTP Code Action
ok Both pools healthy 200 Proceed normally
degraded One pool failed 503 Log issue, telemetry offline but auth works
down Both pools failed 503 Incident: scale down, page on-call

Pool Check Implementation

Query

SELECT 1;

Properties:

  • Minimal overhead — No joins, no writes, no locks
  • Proves connectivity — Actually uses connection from pool
  • Non-destructive — Never modifies data

Timeout

  • Per pool: 500ms max
  • Rationale: Failed DB should fail fast; slow response = dead connection
  • Concurrency: Both pool checks run in parallel

Error Handling

All errors are captured and reported:

Error Type Example Handling
Auth failure “password authentication failed” Reported in errors.<pool>
Connection refused “connect ECONNREFUSED” Reported in errors.<pool>
Timeout “connection timeout” Reported in errors.<pool>
Network unreachable “ENETUNREACH” Reported in errors.<pool>
Pool exhausted (rare in SELECT 1) Treated as pool unavailable

Key guarantee: Never returns 200 OK if a pool failed.


Version Identification

{
  "version": "rc-backend@abc1234"
}

Identifies the running build:

  • Source: RAILWAY_GIT_COMMIT_SHA (preferred)
  • Fallback: GITHUB_SHA, then BUILD_ID
  • Format: rc-backend@<commit_hash>

Purpose: Correlate health check with which deploy is unhealthy.

Example in incident:

[INCIDENT] /health/db returns 503
           Check: version = "rc-backend@abc1234"
           Compare to: deployed commit = "abc1234" ✓ (current)
           → DB is actually down, not stale version

Structured Logging

Every /health/db call emits exactly one log entry:

[HEALTH_DB] { request_id: "vgvTerVLTFyDlzLuCx5-qw", core_db: "connected", telemetry_db: "error", result: "degraded", version: "rc-backend@abc1234", timestamp: "2026-01-23T14:32:11.382Z" }

Properties:

  • Correlate with Railway request ID (paste into logs search)
  • Machine-parseable (grep/awk safe)
  • Single line (no log spam)
  • Includes result + both pool statuses

Parsing example:

# Show all degraded health checks today
grep "\[HEALTH_DB\]" logs.txt | grep "degraded" | tail -20

See also: Technical/logging/README.md


Comparison: /health vs /health/db

Aspect /health /health/db
Purpose Service uptime DB connectivity
DB Access No Yes (SELECT 1 both pools)
HTTP Code (OK) 200 200
HTTP Code (Fail) Never fails 503
Response Time <10ms <1000ms (includes 500ms timeout)
Use Case Load balancers Monitoring, diagnostics, CI
Example Users Kubernetes probe Human operator, monitoring system

Monitoring Integration

Current (v1.1.1)

  • Structured logging to Railway logs
  • Manual correlation via Railway dashboard
  • Request ID tracing

Future Enhancements

Prometheus Metrics:

branchpy_backend_health_db_core{status="connected"}  1.0
branchpy_backend_health_db_telemetry{status="error"}  0.0
branchpy_backend_health_db_latency_ms  234.5

Alerting:

alert: BackendDBDegraded
  if: /health/db status == "degraded" for 5m

Dashboard:

BranchPy Infrastructure Status
├─ Backend Service: 🟢 UP
├─ Core DB: 🟢 CONNECTED
└─ Telemetry DB: 🔴 ERROR (password auth failed)
   └─ Action: Check Railway dashboard, verify creds

Troubleshooting

/health/db returns degraded or down

  1. Check response JSON for errors.<pool> message
  2. Correlate with version to ensure you’re testing the right deploy
  3. Use Railway dashboard to verify:
    • DB is running
    • Credentials are correct
    • Network connectivity exists
  4. Check firewall/proxy rules (Railway internal network)

/health/db returns down but auth still works

  • Unlikely (core DB is required for auth)
  • If it happens: race condition in pool check, retry immediately
  • If persistent: core DB is flaky, needs triage

No response from /health/db (timeout)

  • Backend might be hanging on pool acquisition
  • Check Railway logs for hung connections
  • Restart backend if needed

Design Rationale

Why two endpoints?

During Phase-1 closure, database connectivity issues were diagnosed by:

  • ❌ Observing absent API responses (not crash, not error, just silence)
  • ❌ Guessing whether core or telemetry DB was the problem
  • ❌ Waiting for Railway logs to load (verbose, unstructured)
  • ❌ No structured way to check “can this backend actually talk to DB right now?”

Solution: Explicit health endpoints that:

  • Never mask failures
  • Provide fast diagnosis (which pool failed)
  • Are safe to call frequently
  • Return machine-parseable output


Source References

This document consolidates information from:

  • docs/v1.1.0/backend/health-and-observability/HEALTH_DB_ENDPOINT.md