Skip to content

Hub-Admin: Scaling

TL;DR

Hub is stateless by design — JWT validation, session state, and relay state are stored in MariaDB and Redis — so horizontal scaling is straightforward: add hub instances behind a load balancer, point them all at the same DB and Redis, and enable sticky sessions. Database backups use mysqldump with binlog for point-in-time recovery; offsite copies go to S3/R2 with 30-day retention. Restore drills should be run quarterly to catch corrupt or incomplete backups before a real disaster. RTO target is under 15 minutes with automated failover; RPO is under 4 hours.

MetricTarget
RTO< 15 minutes
RPO< 4 hours
Hub instances2+ recommended
DB replicationMariaDB Galera (multi-master) or single primary + read replica
Redis relay stateShared across all hub instances
Backup retention30 days offsite

Multi-Region / Horizontal Scaling

Architecture Overview

  • Hub is stateless: all state lives in MariaDB (users, server registry, grants, audit logs) and Redis (relay session state)
  • Multiple hub instances share the same DB + Redis behind a load balancer
  • Each server maintains one persistent WSS connection; clients are routed to the same hub instance via sticky sessions (source IP hash)
  • Docker Swarm or Kubernetes for orchestration; docker-compose up --scale phlix-hub=2 for simple HA

Deployment Topology

yaml
# Minimal HA stack (docker-compose)
# 2 hub instances + nginx load balancer + MariaDB primary + 1 read replica + Redis
services:
  phlix-hub:
    image: phlix/hub:latest
    deploy:
      replicas: 2
    environment:
      HUB_DB_HOST: hub-db
      HUB_DB_USER: phlix_hub
      HUB_DB_NAME: hub_db
      HUB_REDIS_HOST: hub-redis
    depends_on:
      - hub-db
      - hub-redis

  nginx:
    image: nginx:latest
    ports:
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - phlix-hub

  hub-db:
    image: mariadb:10.11
    environment:
      MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
      MYSQL_DATABASE: hub_db
      MYSQL_USER: phlix_hub
      MYSQL_PASSWORD: ${DB_PASSWORD}
    volumes:
      - hub-db-data:/var/lib/mysql

  hub-redis:
    image: redis:7-alpine
    volumes:
      - hub-redis-data:/data

  hub-db-replica:
    image: mariadb:10.11
    command: --read-only
    depends_on:
      - hub-db
    environment:
      MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
      MYSQL_DATABASE: hub_db
      MYSQL_USER: phlix_hub
      MYSQL_PASSWORD: ${DB_PASSWORD}
      MYSQL_MASTER_HOST: hub-db

Sticky Sessions (relay affinity)

bash
# nginx upstream with ip_hash for sticky sessions
# Each server's WSS connection lands on the same hub instance
upstream phlix_hub_backend {
    ip_hash;
    server phlix-hub-1:8443;
    server phlix-hub-2:8443;
}
  • Without sticky sessions, a server's WSS connection would be routed to different hub instances, breaking relay state
  • For Kubernetes: use an Ingress with sessionAffinity: ClientIP
  • For Docker Swarm: use docker-compose up --scale phlix-hub=2 with a compatible reverse proxy

Database Backups

Full mysqldump

bash
# Daily full backup
mysqldump -h hub-db -u phlix_hub -p hub_db > hub-backup-$(date +%Y%m%d).sql

# Verify the dump is valid before archiving
grep -c "INSERT INTO" hub-backup-$(date +%Y%m%d).sql
# Expected: table count > 0

Point-in-Time Recovery (binlog)

bash
# Ensure binlog is enabled on the primary
# Check in my.cnf or via:
SHOW VARIABLES LIKE 'log_bin';

# If binlog is enabled, replay to a specific time:
mysqlbinlog --stop-datetime="2025-06-01 12:00:00" /var/lib/mysql/mysql-bin.* | \
  mysql -h hub-db -u phlix_hub -p hub_db

Incremental Backup (every 4 hours)

bash
# Copy binlog files since last full backup
cp /var/lib/mysql/mysql-bin.{n} /backup/binlog-incremental-$(date +%Y%m%d%H%M)/
# Or use mysqlbinlog to dump the active binlog since last backup:
mysqlbinlog --read-from-remote-server \
  --host=hub-db \
  --user=phlix_hub \
  --password \
  --stop-never \
  mysql-bin.000123 > hub-incremental-$(date +%Y%m%d%H%M).sql

Off-Site Copy

bash
# Copy daily full backup to S3/R2 with 30-day retention
aws s3 cp hub-backup-$(date +%Y%m%d).sql s3://your-hub-backups/hub/
# Or with rclone:
rclone copy hub-backup-$(date +%Y%m%d).sql remote:hub-backups/hub/ \
  --excludes "*.tmp"

# Retention: 30 days
rclone delete remote:hub-backups/hub/ --min-age 30d

Restore Drill

Run this quarterly. Do NOT wait for a real disaster to discover your backups are corrupt.

Step 1 — Stop Hub Instances

bash
# Stop all hub instances (avoid writes during restore)
docker-compose stop phlix-hub
# Or for Kubernetes:
kubectl scale deployment phlix-hub --replicas=0

Step 2 — Restore the Database

bash
# Restore from latest full backup
mysql -h hub-db -u phlix_hub -p hub_db < hub-backup-20250601.sql

# If point-in-time recovery is needed, replay binlog:
mysqlbinlog --stop-datetime="2025-06-01 12:00:00" \
  /var/lib/mysql/mysql-bin.* | \
  mysql -h hub-db -u phlix_hub -p hub_db

Step 3 — Restart Hub Instances

bash
# Verify DB is accessible and schema is current
docker-compose up -d phlix-hub
# Or for Kubernetes:
kubectl scale deployment phlix-hub --replicas=2

Step 4 — Verify

bash
# Verify users can log in
curl -X POST https://your-hub.com/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email":"test@yourhub.com","password":"testpassword"}'

# Verify enrolled servers show correct claim status
php bin/hub.php server:list

# Verify relay sessions resume (servers reconnect automatically after hub restart)
php bin/hub.php relay:status

Failover Playbook

Hub Instance Down

  • Detection: Load balancer health check fails (/health endpoint returns non-200)
  • Response: Load balancer automatically routes traffic to healthy instance(s)
  • Recovery: Restart crashed container/pod; no data loss (stateless)
  • RTO: < 1 minute

Database Down

  • Detection: Hub instances report DB connection errors in logs
  • Response: If using Galera multi-master, remaining nodes continue serving
  • If single primary + read replica: promote read replica to primary
    bash
    # On the read replica, stop writes and promote:
    STOP SLAVE;
    RESET SLAVE ALL;
    # Update hub connection strings to point at new primary
  • RTO: < 15 minutes (automated failover preferred)
  • RPO: < 4 hours (last full backup + binlog replay)

Redis Down

  • Symptom: Relay sessions drop; clients must reconnect after timeout (~30s)
  • Recovery: Redis restarts; server WSS connections re-establish automatically
  • Important: Relay session state is lost on Redis crash — clients reconnect and resume stream position from their own last reported position (server-side), not from hub state
  • RTO: < 5 minutes if Redis is restarted; longer if data directory recovery is needed
  • RPO: N/A for Redis (no persistent relay state needed for playback resumption)

Summary RTO/RPO

FailureRTORPO
Hub instance crash< 1 minute (LB routes to other instance)N/A (stateless)
DB primary crash< 15 minutes (automated failover or manual promote)< 4 hours
Redis crash< 5 minutes (restart; clients reconnect)N/A
Full site failure< 15 minutes (restore from backup in new region)< 4 hours

What Can Go Wrong

Galera Cluster Split-Brain (network partition)

Symptom: Two sets of Galera nodes accept writes independently; data diverges; corruption on reconnect.

Cause: Network partition without proper quorum configuration; pc.recovery not enabled.

Fix: Configure proper quorum: set pc.wait_prim=true and pc.ignore_splits=true; ensure at least 3 nodes in cluster; enable pc.recovery=true so cluster recovers state on restart; always validate with a network-partition test drill.

Prevention: Minimum 3 nodes, odd node count, proper network isolation testing in staging.

Redis Relay State Lost on Crash (sessions drop)

Symptom: All active relay sessions disappear; users see "connection lost" and must reconnect.

Cause: Redis was running without persistence (appendonly no) or Redis crashed before last AOF write.

Fix: Enable AOF persistence (appendonly yes and appendfsync everysec); after Redis restarts, servers reconnect automatically and resume from last reported stream position (server-side); relay state is not critical for playback resumption.

Prevention: Verify AOF is on; test Redis crash recovery in staging quarterly.

Backup Not Tested (restore fails when needed)

Symptom: Full disaster strikes; backup file is corrupt, incomplete, or from wrong point in time.

Cause: Backups run on schedule but are never validated; disk full at backup destination causes truncated files.

Fix: Run a full restore drill quarterly on an isolated environment; verify checksum (sha256sum) of every backup immediately after creation; store checksum alongside backup.

Prevention: Automate restore drill notification; alert if backup size drops below expected minimum.

binlog Not Enabled (point-in-time recovery impossible)

Symptom: DB crashes; full backup is 3 days old; 3 days of new users, server enrollments, and grants are lost.

Cause: log_bin was not enabled in my.cnf; PITR is not possible without it.

Fix: Enable binlog before disaster: log_bin=mysql-bin in my.cnf and restart; for existing data, take a fresh full backup immediately after enabling; document this in the operations runbook.

Prevention: Verify binlog is enabled in all DB nodes during setup; check with SHOW VARIABLES LIKE 'log_bin'; in initial deployment checklist.

Load Balancer Health Check Misconfigured (routes to unhealthy instance)

Symptom: Users see 502s or connection timeouts; hub logs show requests from unhealthy instance.

Cause: Health check interval too long (e.g., 60s), or endpoint returns 200 when instance is actually degraded (relay sessions exhausted, DB connection pool depleted).

Fix: Use a meaningful /health endpoint that checks DB, Redis, and relay session capacity; set intervall_ms: 5000 and timeout: 3s; mark instance unhealthy if any check fails; test health check manually before deploying.

Prevention: Canary deploy new hub versions with brief health check window; monitor both healthy and unhealthy state transitions in alerting.


Next Steps

BSD-3-Clause