Overview
This project explores how to coordinate LLM training when the network is the bottleneck: limited bandwidth, variable latency, occasional loss, and heterogeneous worker throughput.
- Pluggable schedulers (priority, Gale-Shapley, capability) with telemetry feedback.
- Async parameter server with adaptive staleness, gradient compression (top-k, quantize, fp16), and Byzantine-robust aggregation (trimmed mean, Krum, Bulyan).
- WAN impairment presets (good, cellular, degraded, brownout, straggler, satellite) configurable via netem.
- Reproducibility: SQLite state snapshots, PS checkpoints, CSV experiment logs, and a CLI to resume or batch-run configs.
Project links
- GitHub repo: https://github.com/cesposo/thedataworkshop
- ED Projects thread: https://edstem.org/us/courses/82067/discussion/6862044?comment=15973163
- Docs: Architecture & CLI guides
Experiment design
WAN profiles are applied via netem presets (latency, jitter, loss, bandwidth caps). Each run logs duration, completion counts, and EWMA throughput.
Basic mode validates scheduling, heartbeat health, and fault handling without PyTorch cost.
ML mode runs a tiny LSTM with async PS, compression, and robust aggregation to observe convergence under WAN stress.
All runs are stored in system/runs_*.csv; checkpoints live in system/checkpoints. Configs under system/configs mirror the runs.
Architecture
The system follows a controller/worker model with a parameter server. Scheduling is pluggable; telemetry feeds capability estimates; communication supports XML-RPC and ZMQ for binary payloads.
Registers workers, runs schedulers, applies WAN profiles, aggregates gradients via the parameter server, and checkpoints state.
Execute tasks, compress gradients, respect staleness bounds, and report telemetry (tokens/s, step time, epoch/batch).
Bounded-async coordinator with adaptive staleness; aggregation rules: mean, trimmed mean, Krum, Bulyan; optional gradient clipping/DP.
Documentation: system/docs/architecture.md, system/docs/scheduler_design.md, system/docs/training_protocol.md.
Concept: WAN deployment
Illustrative topology for a research federation. Nodes experience different RTT/loss and are matched to tasks by capacity and network profile.
Tasks are assigned by stable matching (Gale-Shapley) or capability-based scoring. Telemetry adjusts staleness bounds and compression decisions per link.
Goal: maximize gradient acceptance and throughput while tolerating lossy/slow paths.
Reproduce locally
cd system
pip install -r requirements.txt
pytest -q # 100+ passing tests
dist-llm-train demo-basic --config config.yaml --log-level INFO
For ML training demos, run dist-llm-train demo-ml --config config.yaml. More CLI options: system/README.md. Experiments runner: dist-llm-train run-experiments --glob \"configs/*_demo.yaml\".
Results at a glance
Data pulled directly from system/runs_*.csv (basic infra + ML training across WAN profiles). We report average runtime and completion across presets.
| Mode | Profile | Repeat | Duration (s) | Workers | Completed | Pending |
|---|---|---|---|---|---|---|
| Loading CSV results… | ||||||
Production-readiness
Static PWA with offline cache and Lighthouse-friendly defaults. GitHub Actions can run Lighthouse CI and unit tests on push.
Run a manual auditInstall the app (PWA)
Install this site like an app for quick access and offline reading.
The button enables when your browser offers installation.