Overview

This project explores how to coordinate LLM training when the network is the bottleneck: limited bandwidth, variable latency, occasional loss, and heterogeneous worker throughput.

Experiment design

WAN profiles are applied via netem presets (latency, jitter, loss, bandwidth caps). Each run logs duration, completion counts, and EWMA throughput.

Infrastructure

Basic mode validates scheduling, heartbeat health, and fault handling without PyTorch cost.

Training

ML mode runs a tiny LSTM with async PS, compression, and robust aggregation to observe convergence under WAN stress.

Reproducibility

All runs are stored in system/runs_*.csv; checkpoints live in system/checkpoints. Configs under system/configs mirror the runs.

Architecture

The system follows a controller/worker model with a parameter server. Scheduling is pluggable; telemetry feeds capability estimates; communication supports XML-RPC and ZMQ for binary payloads.

Controller

Registers workers, runs schedulers, applies WAN profiles, aggregates gradients via the parameter server, and checkpoints state.

Workers

Execute tasks, compress gradients, respect staleness bounds, and report telemetry (tokens/s, step time, epoch/batch).

Robust PS

Bounded-async coordinator with adaptive staleness; aggregation rules: mean, trimmed mean, Krum, Bulyan; optional gradient clipping/DP.

Documentation: system/docs/architecture.md, system/docs/scheduler_design.md, system/docs/training_protocol.md.

Concept: WAN deployment

Illustrative topology for a research federation. Nodes experience different RTT/loss and are matched to tasks by capacity and network profile.

Chicago Madison Oak Ridge Sandia Boston High throughput / low RTT Constrained / high RTT
Matchmaking

Tasks are assigned by stable matching (Gale-Shapley) or capability-based scoring. Telemetry adjusts staleness bounds and compression decisions per link.

Goal: maximize gradient acceptance and throughput while tolerating lossy/slow paths.

Reproduce locally

cd system
pip install -r requirements.txt
pytest -q  # 100+ passing tests
dist-llm-train demo-basic --config config.yaml --log-level INFO

For ML training demos, run dist-llm-train demo-ml --config config.yaml. More CLI options: system/README.md. Experiments runner: dist-llm-train run-experiments --glob \"configs/*_demo.yaml\".

Results at a glance

Data pulled directly from system/runs_*.csv (basic infra + ML training across WAN profiles). We report average runtime and completion across presets.

Mode Profile Repeat Duration (s) Workers Completed Pending
Loading CSV results…

Production-readiness

Static PWA with offline cache and Lighthouse-friendly defaults. GitHub Actions can run Lighthouse CI and unit tests on push.

Run a manual audit

Install the app (PWA)

Install this site like an app for quick access and offline reading.

The button enables when your browser offers installation.