TTS 50K training run — live dashboard

Generated 2026-04-19 21:19:12 · auto-refreshes every 30 min · IDLE — no training job in queue

Flow-matching DiT TTS (779M) + trainable byte encoder (57M). 8 nodes × 8 MI250x GCDs, global batch 2052, AdamW + cosine LR, lr=1e-4. Primary job 17574420, resume jobs 17598698, 17617259, 17617423, 17618275.

50,000 / 50,000 steps (100.0%)
Current step
50,000
of 50,000
Latest loss
0.639
best: 0.563 | recent-50 avg: 0.639
Step time
2.33s
(avg last 100)
Throughput
888
samples/s (global)
Samples seen
102.6M
batch 2052/step × 50,000 steps
Current LR
1.00e-05
peak 1e-4 · cosine
Grad norm
0.060
(clipped to 1.0)
ETA
0h 0m
remaining steps × 2.33s

Current SLURM status

jobnamestateelapsedlimit nodesreason / nodelist
17626980tts_demo_emoRUNNING1:1530:001nid005017

Job history (primary + resumes)

jobnamestate startendelapsedexit
17574420allocationCANCELLED by 100353052026-04-17T20:43:072026-04-17T20:43:0700:00:000:0
17598698tts_50kFAILED2026-04-18T05:07:142026-04-18T05:11:1000:03:5615:0
17617259tts_50kFAILED2026-04-18T22:02:202026-04-18T22:08:0800:05:4815:0
17617423tts_50kFAILED2026-04-18T22:21:212026-04-18T22:30:2300:09:0215:0
17618275tts_50kCOMPLETED2026-04-19T00:09:432026-04-19T13:26:1313:16:300:0

Loss over time

LR schedule

Grad norm (log y)

WER / CER over checkpoints

Eval results (WER / CER on LibriSpeech test-clean 99-pair cross-speaker)

stepmean WERmedian WERmean CER
5,0001.2791.0001.041
10,0002.4601.0001.968
15,0002.7841.0001.513
20,0002.5231.0001.595
25,0003.3541.1112.183
30,0001.4540.9670.902
40,0001.4970.9030.980
45,0000.8270.8670.575
50,0000.8530.7930.622

Eval at step 30K (median WER 0.97) was the first checkpoint below 1.0 median WER — intelligibility threshold crossed.