Docs-Builder — BranchPy Documentation

# Backend Health & Observability

Version: 1.1.1
Last Updated: January 23, 2026
Status: Production
Component: rc-backend

Overview

BranchPy backend exposes health and observability endpoints to ensure system reliability and enable fast diagnosis of runtime issues. This document covers:

Health check endpoints
Database connectivity diagnostics
Structured logging integration
Monitoring integration

For comprehensive logging configuration and usage, see the Logging System Documentation.

Health Check Endpoints

`/health` (Lightweight)

Purpose: Service uptime verification - no I/O operations

Request:

GET /health

Response (Always 200):

{
  "status": "ok",
  "timestamp": "2026-01-23T14:32:11.382Z"
}

Semantics: “Is the Node.js process up and routable?”

Use Cases:

Load balancer health checks
Simple “is backend alive?” monitoring
Kubernetes liveness probes

Response Time: <10ms

`/health/db` (Diagnostic)

Purpose: Database connectivity diagnostic with explicit pool checks

Request:

GET /health/db
X-Railway-Request-Id: <optional, for log correlation>

Response Contract:

All Pools Healthy â†’ `200 OK`

{
  "status": "ok",
  "core_db": "connected",
  "telemetry_db": "connected",
  "version": "rc-backend@abc1234",
  "timestamp": "2026-01-23T14:32:11.382Z"
}

Semantics: Both pools answered SELECT 1 within 500ms. Ready for requests.

Partial Failure â†’ `503 Service Unavailable`

{
  "status": "degraded",
  "core_db": "connected",
  "telemetry_db": "error",
  "errors": {
    "telemetry_db": "password authentication failed"
  },
  "version": "rc-backend@abc1234",
  "timestamp": "2026-01-23T14:32:11.382Z"
}

Semantics:

Core DB healthy â†’ auth can work
Telemetry DB failed â†’ analytics/logging offline, but users unblocked
HTTP 503 signals: “Something is broken, handle gracefully”

Total Failure â†’ `503 Service Unavailable`

{
  "status": "down",
  "core_db": "error",
  "telemetry_db": "error",
  "errors": {
    "core_db": "connection timeout",
    "telemetry_db": "password authentication failed"
  },
  "version": "rc-backend@abc1234",
  "timestamp": "2026-01-23T14:32:11.382Z"
}

Semantics: Backend cannot authenticate users. Requests will fail. Return to incident response.

Status Field Semantics

Status	Meaning	HTTP Code	Action
`ok`	Both pools healthy	`200`	Proceed normally
`degraded`	One pool failed	`503`	Log issue, telemetry offline but auth works
`down`	Both pools failed	`503`	Incident: scale down, page on-call

Pool Check Implementation

Query

SELECT 1;

Properties:

Minimal overhead â€” No joins, no writes, no locks
Proves connectivity â€” Actually uses connection from pool
Non-destructive â€” Never modifies data

Timeout

Per pool: 500ms max
Rationale: Failed DB should fail fast; slow response = dead connection
Concurrency: Both pool checks run in parallel

Error Handling

All errors are captured and reported:

Error Type	Example	Handling
Auth failure	“password authentication failed”	Reported in `errors.<pool>`
Connection refused	“connect ECONNREFUSED”	Reported in `errors.<pool>`
Timeout	“connection timeout”	Reported in `errors.<pool>`
Network unreachable	“ENETUNREACH”	Reported in `errors.<pool>`
Pool exhausted	(rare in SELECT 1)	Treated as pool unavailable

Key guarantee: Never returns 200 OK if a pool failed.

Version Identification

{
  "version": "rc-backend@abc1234"
}

Identifies the running build:

Source: RAILWAY_GIT_COMMIT_SHA (preferred)
Fallback: GITHUB_SHA, then BUILD_ID
Format: rc-backend@<commit_hash>

Purpose: Correlate health check with which deploy is unhealthy.

Example in incident:

[INCIDENT] /health/db returns 503
           Check: version = "rc-backend@abc1234"
           Compare to: deployed commit = "abc1234" âœ“ (current)
           â†’ DB is actually down, not stale version

Structured Logging

Every /health/db call emits exactly one log entry:

[HEALTH_DB] { request_id: "vgvTerVLTFyDlzLuCx5-qw", core_db: "connected", telemetry_db: "error", result: "degraded", version: "rc-backend@abc1234", timestamp: "2026-01-23T14:32:11.382Z" }

Properties:

Correlate with Railway request ID (paste into logs search)
Machine-parseable (grep/awk safe)
Single line (no log spam)
Includes result + both pool statuses

Parsing example:

# Show all degraded health checks today
grep "\[HEALTH_DB\]" logs.txt | grep "degraded" | tail -20

Comparison: `/health` vs `/health/db`

Aspect	`/health`	`/health/db`
Purpose	Service uptime	DB connectivity
DB Access	No	Yes (SELECT 1 both pools)
HTTP Code (OK)	200	200
HTTP Code (Fail)	Never fails	503
Response Time	<10ms	<1000ms (includes 500ms timeout)
Use Case	Load balancers	Monitoring, diagnostics, CI
Example Users	Kubernetes probe	Human operator, monitoring system

Monitoring Integration

Current (v1.1.1)

Structured logging to Railway logs
Manual correlation via Railway dashboard
Request ID tracing

Future Enhancements

Prometheus Metrics:

branchpy_backend_health_db_core{status="connected"}  1.0
branchpy_backend_health_db_telemetry{status="error"}  0.0
branchpy_backend_health_db_latency_ms  234.5

Alerting:

alert: BackendDBDegraded
  if: /health/db status == "degraded" for 5m

Dashboard:

BranchPy Infrastructure Status
â”œâ”€ Backend Service: ðŸŸ¢ UP
â”œâ”€ Core DB: ðŸŸ¢ CONNECTED
â””â”€ Telemetry DB: ðŸ”´ ERROR (password auth failed)
   â””â”€ Action: Check Railway dashboard, verify creds

Troubleshooting

`/health/db` returns `degraded` or `down`

Check response JSON for errors.<pool> message
Correlate with version to ensure you’re testing the right deploy
Use Railway dashboard to verify:
- DB is running
- Credentials are correct
- Network connectivity exists
Check firewall/proxy rules (Railway internal network)

`/health/db` returns `down` but auth still works

Unlikely (core DB is required for auth)
If it happens: race condition in pool check, retry immediately
If persistent: core DB is flaky, needs triage

No response from `/health/db` (timeout)

Backend might be hanging on pool acquisition
Check Railway logs for hung connections
Restart backend if needed

Design Rationale

Why two endpoints?

During Phase-1 closure, database connectivity issues were diagnosed by:

âŒ Observing absent API responses (not crash, not error, just silence)
âŒ Guessing whether core or telemetry DB was the problem
âŒ Waiting for Railway logs to load (verbose, unstructured)
âŒ No structured way to check “can this backend actually talk to DB right now?”

Solution: Explicit health endpoints that:

Never mask failures
Provide fast diagnosis (which pool failed)
Are safe to call frequently
Return machine-parseable output

Logging System - Comprehensive logging documentation
Logging Configuration - Log levels, sinks, rotation
Server Architecture - Server architecture
API Reference - API contracts
Backend Deployment - Production deployment

Source References

This document consolidates information from:

docs/v1.1.0/backend/health-and-observability/HEALTH_DB_ENDPOINT.md