Hub-Admin: Scaling
TL;DR
Hub is stateless by design — JWT validation, session state, and relay state are stored in MariaDB and Redis — so horizontal scaling is straightforward: add hub instances behind a load balancer, point them all at the same DB and Redis, and enable sticky sessions. Database backups use mysqldump with binlog for point-in-time recovery; offsite copies go to S3/R2 with 30-day retention. Restore drills should be run quarterly to catch corrupt or incomplete backups before a real disaster. RTO target is under 15 minutes with automated failover; RPO is under 4 hours.
| Metric | Target |
|---|---|
| RTO | < 15 minutes |
| RPO | < 4 hours |
| Hub instances | 2+ recommended |
| DB replication | MariaDB Galera (multi-master) or single primary + read replica |
| Redis relay state | Shared across all hub instances |
| Backup retention | 30 days offsite |
Multi-Region / Horizontal Scaling
Architecture Overview
- Hub is stateless: all state lives in MariaDB (users, server registry, grants, audit logs) and Redis (relay session state)
- Multiple hub instances share the same DB + Redis behind a load balancer
- Each server maintains one persistent WSS connection; clients are routed to the same hub instance via sticky sessions (source IP hash)
- Docker Swarm or Kubernetes for orchestration;
docker-compose up --scale phlix-hub=2for simple HA
Deployment Topology
# Minimal HA stack (docker-compose)
# 2 hub instances + nginx load balancer + MariaDB primary + 1 read replica + Redis
services:
phlix-hub:
image: phlix/hub:latest
deploy:
replicas: 2
environment:
HUB_DB_HOST: hub-db
HUB_DB_USER: phlix_hub
HUB_DB_NAME: hub_db
HUB_REDIS_HOST: hub-redis
depends_on:
- hub-db
- hub-redis
nginx:
image: nginx:latest
ports:
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- phlix-hub
hub-db:
image: mariadb:10.11
environment:
MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
MYSQL_DATABASE: hub_db
MYSQL_USER: phlix_hub
MYSQL_PASSWORD: ${DB_PASSWORD}
volumes:
- hub-db-data:/var/lib/mysql
hub-redis:
image: redis:7-alpine
volumes:
- hub-redis-data:/data
hub-db-replica:
image: mariadb:10.11
command: --read-only
depends_on:
- hub-db
environment:
MYSQL_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
MYSQL_DATABASE: hub_db
MYSQL_USER: phlix_hub
MYSQL_PASSWORD: ${DB_PASSWORD}
MYSQL_MASTER_HOST: hub-dbSticky Sessions (relay affinity)
# nginx upstream with ip_hash for sticky sessions
# Each server's WSS connection lands on the same hub instance
upstream phlix_hub_backend {
ip_hash;
server phlix-hub-1:8443;
server phlix-hub-2:8443;
}- Without sticky sessions, a server's WSS connection would be routed to different hub instances, breaking relay state
- For Kubernetes: use an Ingress with
sessionAffinity: ClientIP - For Docker Swarm: use
docker-compose up --scale phlix-hub=2with a compatible reverse proxy
Database Backups
Full mysqldump
# Daily full backup
mysqldump -h hub-db -u phlix_hub -p hub_db > hub-backup-$(date +%Y%m%d).sql
# Verify the dump is valid before archiving
grep -c "INSERT INTO" hub-backup-$(date +%Y%m%d).sql
# Expected: table count > 0Point-in-Time Recovery (binlog)
# Ensure binlog is enabled on the primary
# Check in my.cnf or via:
SHOW VARIABLES LIKE 'log_bin';
# If binlog is enabled, replay to a specific time:
mysqlbinlog --stop-datetime="2025-06-01 12:00:00" /var/lib/mysql/mysql-bin.* | \
mysql -h hub-db -u phlix_hub -p hub_dbIncremental Backup (every 4 hours)
# Copy binlog files since last full backup
cp /var/lib/mysql/mysql-bin.{n} /backup/binlog-incremental-$(date +%Y%m%d%H%M)/
# Or use mysqlbinlog to dump the active binlog since last backup:
mysqlbinlog --read-from-remote-server \
--host=hub-db \
--user=phlix_hub \
--password \
--stop-never \
mysql-bin.000123 > hub-incremental-$(date +%Y%m%d%H%M).sqlOff-Site Copy
# Copy daily full backup to S3/R2 with 30-day retention
aws s3 cp hub-backup-$(date +%Y%m%d).sql s3://your-hub-backups/hub/
# Or with rclone:
rclone copy hub-backup-$(date +%Y%m%d).sql remote:hub-backups/hub/ \
--excludes "*.tmp"
# Retention: 30 days
rclone delete remote:hub-backups/hub/ --min-age 30dRestore Drill
Run this quarterly. Do NOT wait for a real disaster to discover your backups are corrupt.
Step 1 — Stop Hub Instances
# Stop all hub instances (avoid writes during restore)
docker-compose stop phlix-hub
# Or for Kubernetes:
kubectl scale deployment phlix-hub --replicas=0Step 2 — Restore the Database
# Restore from latest full backup
mysql -h hub-db -u phlix_hub -p hub_db < hub-backup-20250601.sql
# If point-in-time recovery is needed, replay binlog:
mysqlbinlog --stop-datetime="2025-06-01 12:00:00" \
/var/lib/mysql/mysql-bin.* | \
mysql -h hub-db -u phlix_hub -p hub_dbStep 3 — Restart Hub Instances
# Verify DB is accessible and schema is current
docker-compose up -d phlix-hub
# Or for Kubernetes:
kubectl scale deployment phlix-hub --replicas=2Step 4 — Verify
# Verify users can log in
curl -X POST https://your-hub.com/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"email":"test@yourhub.com","password":"testpassword"}'
# Verify enrolled servers show correct claim status
php bin/hub.php server:list
# Verify relay sessions resume (servers reconnect automatically after hub restart)
php bin/hub.php relay:statusFailover Playbook
Hub Instance Down
- Detection: Load balancer health check fails (
/healthendpoint returns non-200) - Response: Load balancer automatically routes traffic to healthy instance(s)
- Recovery: Restart crashed container/pod; no data loss (stateless)
- RTO: < 1 minute
Database Down
- Detection: Hub instances report DB connection errors in logs
- Response: If using Galera multi-master, remaining nodes continue serving
- If single primary + read replica: promote read replica to primarybash
# On the read replica, stop writes and promote: STOP SLAVE; RESET SLAVE ALL; # Update hub connection strings to point at new primary - RTO: < 15 minutes (automated failover preferred)
- RPO: < 4 hours (last full backup + binlog replay)
Redis Down
- Symptom: Relay sessions drop; clients must reconnect after timeout (~30s)
- Recovery: Redis restarts; server WSS connections re-establish automatically
- Important: Relay session state is lost on Redis crash — clients reconnect and resume stream position from their own last reported position (server-side), not from hub state
- RTO: < 5 minutes if Redis is restarted; longer if data directory recovery is needed
- RPO: N/A for Redis (no persistent relay state needed for playback resumption)
Summary RTO/RPO
| Failure | RTO | RPO |
|---|---|---|
| Hub instance crash | < 1 minute (LB routes to other instance) | N/A (stateless) |
| DB primary crash | < 15 minutes (automated failover or manual promote) | < 4 hours |
| Redis crash | < 5 minutes (restart; clients reconnect) | N/A |
| Full site failure | < 15 minutes (restore from backup in new region) | < 4 hours |
What Can Go Wrong
Galera Cluster Split-Brain (network partition)
Symptom: Two sets of Galera nodes accept writes independently; data diverges; corruption on reconnect.
Cause: Network partition without proper quorum configuration; pc.recovery not enabled.
Fix: Configure proper quorum: set pc.wait_prim=true and pc.ignore_splits=true; ensure at least 3 nodes in cluster; enable pc.recovery=true so cluster recovers state on restart; always validate with a network-partition test drill.
Prevention: Minimum 3 nodes, odd node count, proper network isolation testing in staging.
Redis Relay State Lost on Crash (sessions drop)
Symptom: All active relay sessions disappear; users see "connection lost" and must reconnect.
Cause: Redis was running without persistence (appendonly no) or Redis crashed before last AOF write.
Fix: Enable AOF persistence (appendonly yes and appendfsync everysec); after Redis restarts, servers reconnect automatically and resume from last reported stream position (server-side); relay state is not critical for playback resumption.
Prevention: Verify AOF is on; test Redis crash recovery in staging quarterly.
Backup Not Tested (restore fails when needed)
Symptom: Full disaster strikes; backup file is corrupt, incomplete, or from wrong point in time.
Cause: Backups run on schedule but are never validated; disk full at backup destination causes truncated files.
Fix: Run a full restore drill quarterly on an isolated environment; verify checksum (sha256sum) of every backup immediately after creation; store checksum alongside backup.
Prevention: Automate restore drill notification; alert if backup size drops below expected minimum.
binlog Not Enabled (point-in-time recovery impossible)
Symptom: DB crashes; full backup is 3 days old; 3 days of new users, server enrollments, and grants are lost.
Cause: log_bin was not enabled in my.cnf; PITR is not possible without it.
Fix: Enable binlog before disaster: log_bin=mysql-bin in my.cnf and restart; for existing data, take a fresh full backup immediately after enabling; document this in the operations runbook.
Prevention: Verify binlog is enabled in all DB nodes during setup; check with SHOW VARIABLES LIKE 'log_bin'; in initial deployment checklist.
Load Balancer Health Check Misconfigured (routes to unhealthy instance)
Symptom: Users see 502s or connection timeouts; hub logs show requests from unhealthy instance.
Cause: Health check interval too long (e.g., 60s), or endpoint returns 200 when instance is actually degraded (relay sessions exhausted, DB connection pool depleted).
Fix: Use a meaningful /health endpoint that checks DB, Redis, and relay session capacity; set intervall_ms: 5000 and timeout: 3s; mark instance unhealthy if any check fails; test health check manually before deploying.
Prevention: Canary deploy new hub versions with brief health check window; monitor both healthy and unhealthy state transitions in alerting.
Next Steps
- Hub-admin capacity planning — sizing hub hardware for your user base
- Hub claim and setup — understanding server claiming and hub identity
- Hub relay tunnel — how the WSS relay actually works under the hood
- Hub-admin overview — hub dashboard and admin CLI reference