BranchPy Server Lifecycle Management — BranchPy Documentation

Version: 1.1.1
Last Updated: January 23, 2026
Status: Production
Audience: Developers, Extension Maintainers

Overview

The BranchPy daemon features fully automated lifecycle management requiring zero manual intervention:

✅ Auto-Start - Daemon starts automatically when needed
✅ Auto-Diagnose - Startup failures provide actionable diagnostics
✅ Auto-Recovery - Crashes trigger automatic restart and reload
✅ Auto-Stop - Clean shutdown when VS Code closes

Core Philosophy: Users should never think about server management. Everything “just works.”

For comprehensive observability and monitoring details, see Technical/backend/observability.md. For logging configuration and troubleshooting, see Technical/logging/README.md.

Auto-Start

When It Happens

User opens any BranchPy feature (PILOT, Compare Lanes, Semantics)
Extension detects daemon is not running
Daemon spawns automatically in background
Health check confirms server ready
Feature loads seamlessly

Implementation Flow

User Action (Click PILOT)
    ↓
Extension: ensureRunning()
    ↓
Check metadata file: ~/.branchpy/default/wsd.meta.json
    ↓ (if not found or PID not alive)
Start daemon: python branchpy/ws/run.py
    ↓
Wait for health check: GET /pilot/health
    ↓ (max 10 seconds)
Server ready → proceed with UI

Code Example (TypeScript)

async ensureRunning(): Promise<boolean> {
  // 1. Quick check: already running?
  if (this.isRunning()) {
    return true;
  }

  // 2. Start daemon if not running
  const started = await this.start();
  if (!started) {
    return false;
  }

  // 3. Wait for health check (max 10 seconds)
  const healthy = await this.waitForHealth(10000);
  return healthy;
}

Configuration

{
  "branchpy.daemon.autoStart": true,  // default
  "branchpy.daemon.host": "127.0.0.1",
  "branchpy.daemon.port": 8765
}

When daemon startup fails, auto-diagnose runs a series of checks to identify the root cause and provide actionable guidance. Diagnostic results are automatically logged for analysis. For detailed logging configuration, see Technical/logging/README.md.

Diagnostic Checks

1. Python Availability

async checkPython(): Promise<DiagnosticResult> {
  const result = await exec('python --version');
  const version = result.stdout.match(/Python (\d+\.\d+\.\d+)/)?.[1];
  
  if (!version) {
    return { status: 'error', message: 'Python not found in PATH' };
  }
  
  const [major, minor] = version.split('.').map(Number);
  if (major < 3 || (major === 3 && minor < 8)) {
    return { 
      status: 'error', 
      message: `Python ${version} too old (need ≥ 3.8)` 
    };
  }
  
  return { status: 'ok', message: `Python ${version}` };
}

2. BranchPy Installation

async checkBranchPy(): Promise<DiagnosticResult> {
  try {
    const result = await exec('python -m branchpy --version');
    return { status: 'ok', message: `BranchPy ${result.stdout.trim()}` };
  } catch {
    return { 
      status: 'error', 
      message: 'BranchPy not installed (run: pip install branchpy)' 
    };
  }
}

3. Port Availability

async checkPort(port: number): Promise<DiagnosticResult> {
  try {
    const server = net.createServer();
    await new Promise((resolve, reject) => {
      server.listen(port, () => resolve(true));
      server.on('error', reject);
    });
    server.close();
    return { status: 'ok', message: `Port ${port} available` };
  } catch {
    const pid = await findProcessOnPort(port);
    return { 
      status: 'error', 
      message: `Port ${port} in use by PID ${pid}`,
      fix: `Run: taskkill /F /PID ${pid}` 
    };
  }
}

4. Startup Logs Analysis

async checkStartupLogs(): Promise<DiagnosticResult> {
  const logs = this.daemonProcess?.stdout?.slice(-50) || [];
  
  if (logs.some(line => line.includes('ModuleNotFoundError'))) {
    return { 
      status: 'error', 
      message: 'Missing Python dependencies',
      fix: 'Run: pip install -r requirements.txt'
    };
  }
  
  if (logs.some(line => line.includes('SyntaxError'))) {
    return { 
      status: 'error', 
      message: 'Syntax error in BranchPy code',
      fix: 'Check recent code changes'
    };
  }
  
  return { status: 'ok', message: 'No startup errors' };
}

User Experience

╔══════════════════════════════════════════════════════╗
║        🩺 BranchPy Daemon Diagnostics                ║
╠══════════════════════════════════════════════════════╣
║ ✅ Python: 3.11.5                                    ║
║ ✅ BranchPy: 0.9.0                                   ║
║ ❌ Port: 8766 in use by PID 12345                    ║
║    Fix: taskkill /F /PID 12345                       ║
║ ✅ Startup Logs: No errors                           ║
╠══════════════════════════════════════════════════════╣
║ Recommendation: Kill conflicting process or use      ║
║                 a different port (e.g., 8767)        ║
╚══════════════════════════════════════════════════════╝

[Kill Process]  [Use Port 8767]  [Show Full Logs]

Auto-Recovery

Problem Solved

Before: User sees “Can’t reach this page” → must manually restart daemon → refresh webview

After: Webview detects failure → auto-restarts daemon → auto-reloads → user sees brief notification

Architecture

Webview Health Monitor (JavaScript)
  • Polls GET /pilot/health every 10 seconds
  • Tracks consecutive failures (threshold: 2)
    ↓ (2 failures detected)
vscode.postMessage({ type: 'restartDaemon', reason: 'HTTP 500' })
    ↓
Extension: _restartDaemonAndNotify()
  • Show notification: "🔄 Auto-restarting server..."
  • Execute: daemonService.restart()
  • Wait 2 seconds for startup
  • Send 'daemonRestarted' message to webview
    ↓
Webview: window.location.reload()
  • Fresh page load with new daemon

For detailed health check implementation and monitoring, see Technical/backend/observability.md.

Health Monitor Implementation (JavaScript)

function setupHealthMonitor() {
  let failureCount = 0;
  const MAX_FAILURES = 2;
  const CHECK_INTERVAL = 10000; // 10 seconds
  let isRecovering = false;

  async function checkHealth() {
    if (isRecovering) return;
    
    try {
      const response = await fetch('http://127.0.0.1:8766/pilot/health');
      if (response.ok) {
        failureCount = 0; // Reset on success
      } else {
        handleFailure(`HTTP ${response.status}`);
      }
    } catch (error) {
      handleFailure(error.message);
    }
  }

  function handleFailure(reason) {
    failureCount++;
    
    if (failureCount >= MAX_FAILURES) {
      attemptAutoRecovery(reason);
    }
  }

  function attemptAutoRecovery(reason) {
    isRecovering = true;
    
    // Notify extension to restart daemon
    vscode.postMessage({ type: 'restartDaemon', reason });
    
    // Lock recovery for 15 seconds (prevents loops)
    setTimeout(() => {
      isRecovering = false;
      failureCount = 0;
    }, 15000);
  }

  setInterval(checkHealth, CHECK_INTERVAL);
}

Restart Handler (TypeScript)

private async _restartDaemonAndNotify(reason: string): Promise<void> {
  try {
    // 1. User notification
    void vscode.window.showInformationMessage(
      `🔄 Auto-restarting BranchPy server (${reason})...`
    );

    // 2. Execute restart
    await vscode.commands.executeCommand('bpy.ws.restart');
    
    // 3. Wait for startup
    await new Promise(resolve => setTimeout(resolve, 2000));
    
    // 4. Notify webview to reload
    this._panel.webview.postMessage({ type: 'daemonRestarted' });
    
    // 5. Success notification
    void vscode.window.showInformationMessage(
      '✅ BranchPy server restarted successfully'
    );
  } catch (error) {
    void vscode.window.showErrorMessage(
      `❌ Failed to restart: ${error}`
    );
  }
}

Recovery Timeline

[00:00] User working normally
[00:05] Daemon crashes
[00:10] Health check #1 fails
[00:12] Status: "Server issue (1/2)"
[00:20] Health check #2 fails
[00:22] Auto-recovery triggered
[00:22] Notification: "🔄 Auto-restarting..."
[00:23] Daemon stops (old process killed)
[00:24] Daemon starts (new process spawned)
[00:25] Webview reloads
[00:25] Notification: "✅ Restarted successfully"
[00:26] User continues working ✨

Restart Policies

The daemon implements the following restart policies to ensure system stability:

Restart Threshold Policy

Maximum restarts: 3 attempts
Time window: 5 minutes
Action on threshold: Disable auto-restart, notify user

Backoff Policy

First restart: Immediate (0s delay)
Second restart: 5s delay
Third restart: 15s delay
After threshold: Manual intervention required

Recovery Lock

Duration: 15 seconds
Purpose: Prevent restart loops during ongoing failures
Behavior: Ignore health check failures while locked

Configuration

{
  "branchpy.autoRecovery.enabled": true,
  "branchpy.autoRecovery.maxFailures": 2,
  "branchpy.autoRecovery.checkInterval": 10000,
  "branchpy.autoRecovery.recoveryLockTimeout": 15000,
  "branchpy.autoRecovery.maxRestarts": 3,
  "branchpy.autoRecovery.restartWindow": 300000
}

Auto-Stop

Purpose

Gracefully stop daemon when VS Code closes to prevent orphaned processes.

Implementation Flow

User closes VS Code
    ↓
VS Code: extension.deactivate()
    ↓
daemonService.stop()
  • Send SIGTERM to daemon process
  • Wait 5 seconds for graceful exit
  • Send SIGKILL if still running (force)
  • Clean metadata files
    ↓
Daemon exits cleanly

Graceful Stop (TypeScript)

async stop(): Promise<void> {
  if (!this.daemonProcess) {
    return; // Already stopped
  }

  const pid = this.daemonProcess.pid;
  
  try {
    // 1. Send SIGTERM (graceful shutdown)
    process.kill(pid, 'SIGTERM');
    
    // 2. Wait up to 5 seconds for exit
    const exited = await this.waitForExit(pid, 5000);
    
    if (!exited) {
      // 3. Force kill if still running
      console.warn(`PID ${pid} didn't exit gracefully, forcing...`);
      process.kill(pid, 'SIGKILL');
    }
    
    // 4. Clean metadata
    await this.cleanMetadata();
    
    // 5. Update UI
    this.updateStatusBar('stopped');
  } catch (error) {
    console.error(`Error stopping PID ${pid}:`, error);
  } finally {
    this.daemonProcess = null;
  }
}

Daemon SIGTERM Handler (Python)

import signal
import sys

def sigterm_handler(signum, frame):
    """Graceful shutdown on SIGTERM"""
    logger.info("[Daemon] Received SIGTERM, shutting down...")
    
    # 1. Close active connections
    if app:
        app.shutdown()
    
    # 2. Flush logs
    logging.shutdown()
    
    # 3. Exit cleanly
    sys.exit(0)

# Register handler
signal.signal(signal.SIGTERM, sigterm_handler)

State Machine

┌──────────┐
│ STOPPED  │ ← Initial state
└──────────┘
     │ [User opens feature]
     ▼
┌──────────┐
│ STARTING │ ← Auto-start triggered
└──────────┘
     │ [Health check passes]
     ▼
┌──────────┐
│ RUNNING  │ ← Normal operation
└──────────┘
     │         │
     │         │ [Crash detected]
     │         ▼
     │    ┌──────────┐
     │    │RESTARTING│ ← Auto-recovery
     │    └──────────┘
     │         │
     │         ▼
     │    ┌──────────┐
     │    │ RUNNING  │
     │    └──────────┘
     │
     │ [VS Code closes]
     ▼
┌──────────┐
│ STOPPING │ ← Auto-stop triggered
└──────────┘
     │
     ▼
┌──────────┐
│ STOPPED  │ ← Clean exit
└──────────┘

Diagnostic Checks and Health Monitoring

Health Check Flow

The daemon implements comprehensive health checks at multiple levels:

Startup Health Checks

Python environment validation
- Python version ≥ 3.8
- Required packages installed
- Virtual environment activated (if configured)
Network availability
- Port binding successful
- No conflicting processes
- Firewall rules allow local connections
Resource availability
- Sufficient memory (minimum 100MB free)
- Disk space for logs and metadata
- File descriptor limits not exceeded

Runtime Health Checks

Endpoint: GET /pilot/health
Frequency: Every 10 seconds (configurable)
Timeout: 5 seconds

Expected response:

{
  "status": "healthy",
  "uptime": 3600,
  "version": "1.1.1",
  "pid": 12345
}

Deep Health Checks (Periodic)

Frequency: Every 60 seconds
Checks:
- Memory usage within limits
- No deadlocked threads
- Database connections healthy
- Cache hit rates acceptable
- Request queue not backing up

For comprehensive observability metrics and monitoring dashboards, see Technical/backend/observability.md.

Troubleshooting

Issue: Daemon Won’t Start

Symptoms: Webview shows “Can’t reach this page”, status bar shows red

Auto-Diagnose Checks:

✅ Python installed? → python --version
✅ BranchPy installed? → python -m branchpy --version
✅ Port available? → Check for conflicts
✅ Startup logs? → .branchpy/logs/ws_daemon.log

Manual Fix:

Command Palette → "BranchPy: Run Diagnostics"

For detailed logging configuration and log analysis, see Technical/logging/README.md.

Issue: Daemon Keeps Crashing

Symptoms: Auto-restart notifications every 30 seconds

Auto-Recover Behavior:

After 3 failed restarts in 5 minutes → Disables auto-restart
Shows: “❌ Auto-restart disabled (too many failures)”

Manual Fix:

Check Output panel: “BranchPy Server” logs
Identify error (syntax error, missing dependency, etc.)
Fix error in code
Command Palette → "BranchPy: Restart Daemon"

Common Causes:

Python syntax errors in recent changes
Missing or incompatible dependencies
Port conflicts with other applications
Insufficient system resources (memory, disk space)
Corrupted metadata files

For comprehensive error analysis and debugging, see Technical/logging/README.md.

Issue: Orphaned Daemon After Crash

Symptoms: VS Code crashed, daemon still running, new window can’t start

Auto-Detect: On extension activation, checks for orphaned daemons:

const orphanedPid = await findOrphanedDaemon();

if (orphanedPid) {
  const action = await vscode.window.showWarningMessage(
    `Found orphaned daemon (PID ${orphanedPid})`,
    'Kill Process', 'Adopt Process', 'Ignore'
  );
  
  if (action === 'Kill Process') {
    process.kill(orphanedPid, 'SIGKILL');
  }
}

Manual Fix:

# Windows
Get-Process python | Where-Object { $_.CommandLine -like '*branchpy*' } | Stop-Process -Force

# Linux/macOS
pkill -f "python.*branchpy"

Performance Targets

Latency Budget

Operation	Target	Actual	P99
Health check	<10ms	5-8ms	12ms
Auto-start	<3s	1.5-2s	2.8s
Auto-recovery	<5s	3-4s	5.5s
Auto-stop	<2s	1s	1.8s

Resource Limits

Resource	Minimum	Recommended	Maximum
Memory	100MB	256MB	512MB
CPU (idle)	0-1%	<2%	5%
CPU (active)	5-15%	<25%	50%
Disk I/O	Minimal	<10MB/s	50MB/s

Uptime: Seconds since last start
Restart count: Total restarts in current session
Failed health checks: Count in last 24 hours
Recovery attempts: Successful vs. failed recoveries
Graceful stops: vs. forced terminations

Performance Metrics

Health check latency: P50, P95, P99
Memory usage: RSS, heap size, resident set
Request count: Total, per endpoint, per hour
Error rate: 4xx and 5xx responses
Active connections: Current WebSocket/HTTP connections

Diagnostic Metrics

Startup failures: Count and reasons
Port conflicts: Frequency and resolution
Orphaned processes: Detection and cleanup rate
Log volume: Bytes written per minute
Exception rate: Unhandled exceptions per hour

Dashboard (Future UI)

╔══════════════════════════════════════════════════════╗
║         🔍 BranchPy Daemon Health Dashboard          ║
╠══════════════════════════════════════════════════════╣
║ Status: 🟢 RUNNING                                   ║
║ Uptime: 2 hours 34 minutes                           ║
║ Restarts: 0 (last 24 hours)                          ║
║ Health: 100% (0 failed checks)                       ║
║ Response Time: 12ms (avg)                            ║
║ Memory: 145 MB / 512 MB (28%)                        ║
║ Active Connections: 3 WebSocket, 0 HTTP              ║
║ Request Rate: 45 req/min                             ║
╚══════════════════════════════════════════════════════╝

ARCHITECTURE.md - System design and components
PROTOCOL.md - Protocol specification
PORT_MANAGEMENT.md - Port selection and conflicts
VS_CODE_INTEGRATION_GUIDE.md - Extension integration
Technical/backend/observability.md - Observability and monitoring
Technical/logging/README.md - Logging configuration and analysis

Changelog

v1.1.1 (January 23, 2026)

📝 Updated documentation to version 1.1.1
🔗 Added cross-references to observability and logging documentation
📊 Enhanced metrics tracking specification
🔧 Clarified restart policies and thresholds
✨ Added deep health checks and resource limits
📈 Updated performance targets with P99 latencies

v1.0.0 (2025-11-07)

✅ Auto-start on feature use
✅ Auto-diagnose on startup failure
✅ Auto-recovery from crashes
✅ Auto-stop on VS Code shutdown
✅ Health monitoring with 10s intervals
✅ User notifications for all lifecycle events
✅ Orphaned daemon detection

Status: ✅ Production Ready
Last Updated: January 23, 2026

Source Reference

This document is a consolidated version of the lifecycle management documentation from BranchPy v1.1.0.

Original Source: docs/v1.1.0/Server/LIFECYCLE.md