Skip to content

ML Engine

TPU-first ML research engine for reproducible distributed training and ablation studies.

Domain

Process — training workflow, distributed setup, evaluation loops.

When to Use

  • TPU mentions (v5e-8, v3-8, v3-64)
  • torch_xla, JAX, distributed training
  • MoE, router, Pallas, SPMD, GSPMD, FSDPv2
  • /ml, /ml-train, /ml-mesh, /ml-debug

Commands

CommandPurpose
/ml [idea]Godmode — full research scaffold
/ml-trainGenerate training script
/ml-meshGenerate mesh setup
/ml-debugDebug XLA issues
/ml-benchmarkBenchmark attention kernel
/ml-migrateMigrate old API → modern
/ml-portPort PyTorch → torch_xla
/ml-optimizeOptimize XLA bottleneck
/ml-planPlan & template research
/ml-ablateRun ablation matrix
/ml-checkpointSave/resume checkpoint
/ml-profileProfile training bottleneck

What It Provides

  • Modern torch_xla APIs (torch_xla.step(), torch_xla.sync())
  • SPMD / FSDPv2 setup
  • Attention kernel selection (Splash, Flash, SDPA)
  • Sharded data pipelines
  • wandb logging on multi-host
  • Reproducible ablation studies
  • Pallas custom kernels

Composability

yaml
domain: process
composable: true
yields_to: [craft, voice]

Resources

Released under the MIT License.