Final 50K checkpoint of the
flow-matching DiT TTS.
Reference voice is a real angry-sounding audio clip
(angry.mp3); the model clones the speaker timbre and delivery
style into each generated sample. CFG=3.0, 30 sampling steps.
| angry text | generated audio | Whisper hypothesis |
|---|---|---|
What the heck do you think you're doing?! Get out of my house right now! | What the heck do you think you're doing? Get out of my house right now! | |
I told you a hundred times not to touch my stuff! Why don't you ever listen to me?! | I told you a hundred times to touch my stuff. Why don't you ever listen to me? | |
Are you kidding me?! Seriously?! I cannot believe this is happening again! | Are you kidding me? Serious? I cannot believe this is happening again. | |
Just go away! I don't want to see you ever again! Leave me alone! | I don't want you ever again! Leave me alone! Leave me alone! | |
How dare you say that to me?! I have never been so insulted in my entire life! | How dare you say that to me? I have never been so insulted in my entire life. | |
I am absolutely sick and tired of this! Enough is enough! It ends right now! | I am absolutely sick and tired out of this! Enough is enough! It ends right now! | |
You have no right to speak to me that way! Apologize immediately! | You have no right to speak to me that way. Apologize immediately. | |
If you do that one more time, I swear there will be serious consequences! | If you do that one more time, I swear there will be serious consequences. |