DataWorkshop Research Hub

Overview

This project explores how to coordinate LLM training when the network is the bottleneck: limited bandwidth, variable latency, occasional loss, and heterogeneous worker throughput.

Pluggable schedulers (priority, Gale-Shapley, capability) with telemetry feedback.
Async parameter server with adaptive staleness, gradient compression (top-k, quantize, fp16), and Byzantine-robust aggregation (trimmed mean, Krum, Bulyan).
WAN impairment presets (good, cellular, degraded, brownout, straggler, satellite) configurable via netem.
Reproducibility: SQLite state snapshots, PS checkpoints, CSV experiment logs, and a CLI to resume or batch-run configs.

Project links

GitHub repo: https://github.com/cesposo/thedataworkshop
ED Projects thread: https://edstem.org/us/courses/82067/discussion/6862044?comment=15973163
Docs: Architecture & CLI guides

Experiment design

WAN profiles are applied via netem presets (latency, jitter, loss, bandwidth caps). Each run logs duration, completion counts, and EWMA throughput.

Infrastructure

Basic mode validates scheduling, heartbeat health, and fault handling without PyTorch cost.

Training

ML mode runs a tiny LSTM with async PS, compression, and robust aggregation to observe convergence under WAN stress.

Reproducibility

All runs are stored in system/runs_*.csv; checkpoints live in system/checkpoints. Configs under system/configs mirror the runs.

Architecture

The system follows a controller/worker model with a parameter server. Scheduling is pluggable; telemetry feeds capability estimates; communication supports XML-RPC and ZMQ for binary payloads.

Controller

Registers workers, runs schedulers, applies WAN profiles, aggregates gradients via the parameter server, and checkpoints state.

Workers

Execute tasks, compress gradients, respect staleness bounds, and report telemetry (tokens/s, step time, epoch/batch).

Robust PS

Bounded-async coordinator with adaptive staleness; aggregation rules: mean, trimmed mean, Krum, Bulyan; optional gradient clipping/DP.

Documentation: system/docs/architecture.md, system/docs/scheduler_design.md, system/docs/training_protocol.md.

Concept: WAN deployment

Illustrative topology for a research federation. Nodes experience different RTT/loss and are matched to tasks by capacity and network profile.

Matchmaking

Tasks are assigned by stable matching (Gale-Shapley) or capability-based scoring. Telemetry adjusts staleness bounds and compression decisions per link.

Goal: maximize gradient acceptance and throughput while tolerating lossy/slow paths.

Reproduce locally

cd system
pip install -r requirements.txt
pytest -q  # 100+ passing tests
dist-llm-train demo-basic --config config.yaml --log-level INFO

For ML training demos, run dist-llm-train demo-ml --config config.yaml. More CLI options: system/README.md. Experiments runner: dist-llm-train run-experiments --glob \"configs/*_demo.yaml\".

Results at a glance

Data pulled directly from system/runs_*.csv (basic infra + ML training across WAN profiles). We report average runtime and completion across presets.

Mode	Profile	Repeat	Duration (s)	Workers	Completed	Pending
Loading CSV results…

Production-readiness

Static PWA with offline cache and Lighthouse-friendly defaults. GitHub Actions can run Lighthouse CI and unit tests on push.

Run a manual audit

Install the app (PWA)

Install this site like an app for quick access and offline reading.

The button enables when your browser offers installation.