Voice Clone Sample Limits: What Actually Matters

A practical guide to LyricWinter voice sample limits, official provider guidance, and what to upload for the best cloning results.

Posted by

If you only remember one rule, make your voice clone sample 10 to 15 seconds of clean, single-speaker speech. LyricWinter accepts up to 30 seconds and 25MB, but the shortest provider windows still need the best part of the voice near the beginning.[1]

A great sample sounds like a person naturally speaking in a quiet room. It does not need music, dramatic effects, multiple takes, or a high sample rate. It needs a consistent voice, clear words, and no competing speaker.

What You Should Care About

  • Length: 10 to 15 seconds is the safest all-backend target. Use up to 30 seconds only if the whole clip is clean.
  • First seconds: Inworld and F5-TTS have short reference windows, so do not hide the best performance at the end.
  • Clean audio: background noise, music, echo, clipping, and multiple speakers matter more than file format trivia.
  • One voice: upload one speaker per voice. If a clip has two people talking, it is a bad clone sample.
  • File size: stay under LyricWinter's 25MB upload cap.

Limits by Backend

LyricWinter uses several voice backends. The upload limit is simple, but each backend treats the prompt differently.

BackendProvider guidanceLyricWinter behaviorUser takeaway
LyricWinter uploadNot a provider limit.One voice sample per uploaded voice, up to 30 seconds and 25MB.Use one clean clip. Do not upload a long reel.
InworldPortal cloning documents one file, 4MB max, and trims samples longer than 15 seconds. The API can represent multiple samples.LyricWinter currently sends one stored sample to Inworld.Put the best 10 to 15 seconds at the start.
Fish AudioDocs recommend at least 10 seconds per clip. More clean speech can help persistent voice creation.LyricWinter caps user-uploaded samples at 30 seconds and 25MB.Use 10 to 30 seconds if every second is clean.
F5-TTSInference docs recommend short reference clips under 12 seconds and clip long references.LyricWinter accepts 30 seconds, but F5-TTS only benefits from the short reference window.Make the first 12 seconds strong.
MisoTTSPublic docs do not publish a simple voice-sample duration cap.LyricWinter prepares the sample for the model and uses at most 30 seconds.Use clean speech. Do not chase sample rate.

Provider Notes

Inworld: Inworld's Portal flow documents one uploaded file, 4MB max, and trimming above 15 seconds.[2] Its API examples can build a voiceSamples array from multiple audio paths.[3] LyricWinter does not expose multi-clip Inworld cloning today; it sends one stored sample.

Fish Audio: Fish Audio recommends at least 10 seconds per clip, and says more clear single-speaker speech can improve persistent voice creation.[4] LyricWinter keeps uploaded samples shorter so the same voice can work across multiple engines.

F5-TTS: F5-TTS recommends short reference clips under 12 seconds and clips longer references before inference.[5] If F5-TTS matters to you, the opening 12 seconds are what matter most.

MisoTTS: MisoTTS uses prompt audio as voice context.[6] LyricWinter converts the sample into the format the model expects and uses up to 30 seconds. That conversion can technically change the audio a little, but users should not optimize around it. Clean speech is what matters.

What You Do Not Need To Care About

  • Internal sanitization limits: those are safety checks, not upload advice.
  • Exact sample rate: ordinary speech recordings are fine. Higher sample rate does not automatically mean a better clone.
  • Inworld multi-sample API support: useful for future product work, but not something the current LyricWinter upload flow asks you to manage.
  • Disabled or deprecated engines: RVC, Qwen3-TTS, and Zyphra are not part of this upload decision.

Best Upload Recipe

Record 10 to 15 seconds of natural speech, with one speaker, no background audio, no heavy effects, and no long silence. If the whole clip is excellent, 20 to 30 seconds can help Fish Audio and MisoTTS. If the clip gets worse after 12 seconds, cut it shorter.

For most users, the winning sample is not the longest one. It is the cleanest one.

References

  1. [1]LyricWinter - Generate
  2. [2]Inworld AI Documentation - Voice Cloning
  3. [3]Inworld API Examples - Voice Cloning
  4. [4]Fish Audio Documentation - Voice Cloning
  5. [5]F5-TTS Inference README
  6. [6]MisoTTS GitHub Repository