TTS 50K demo — angry voice clone (custom reference)

Final 50K checkpoint of the flow-matching DiT TTS. Reference voice is a real angry-sounding audio clip (angry.mp3); the model clones the speaker timbre and delivery style into each generated sample. CFG=3.0, 30 sampling steps.

Reference voice

Generated angry samples

angry textgenerated audioWhisper hypothesis
What the heck do you think you're doing?! Get out of my house right now!
What the heck do you think you're doing? Get out of my house right now!
I told you a hundred times not to touch my stuff! Why don't you ever listen to me?!
I told you a hundred times to touch my stuff. Why don't you ever listen to me?
Are you kidding me?! Seriously?! I cannot believe this is happening again!
Are you kidding me? Serious? I cannot believe this is happening again.
Just go away! I don't want to see you ever again! Leave me alone!
I don't want you ever again! Leave me alone! Leave me alone!
How dare you say that to me?! I have never been so insulted in my entire life!
How dare you say that to me? I have never been so insulted in my entire life.
I am absolutely sick and tired of this! Enough is enough! It ends right now!
I am absolutely sick and tired out of this! Enough is enough! It ends right now!
You have no right to speak to me that way! Apologize immediately!
You have no right to speak to me that way. Apologize immediately.
If you do that one more time, I swear there will be serious consequences!
If you do that one more time, I swear there will be serious consequences.