Cartesia Emotion and SSML Controls in LyricWinter
How LyricWinter uses surrounding dialogue context with Cartesia Sonic 3.5, plus the full Cartesia emotion list and manual SSML controls.
Posted by
Related reading
Manual Emotion Tags with Chatterbox FAL in LyricWinter
Use Chatterbox FAL emotion tags in LyricWinter to add laughs, sighs, gasps, yawns, groans, coughs, chuckles, and sniffles to generated story audio.
Voice Clone Sample Limits: What Actually Matters
A practical guide to LyricWinter voice sample limits, official provider guidance, and what to upload for the best cloning results.
Manual Emotion Control with Fish Audio Tags in LyricWinter
Use Fish Audio S2 Pro emotion tags in LyricWinter to guide whispering, shouting, laughter, urgency, and mood in generated story audio.
Cartesia Sonic 3.5 can infer emotional subtext from the transcript, but a generated LyricWinter clip usually contains only one line. LyricWinter therefore looks at nearby detected dialogue, then passes Cartesia native controls when the surrounding scene gives useful delivery context.[1]
The important difference from OpenAI TTS is the control shape. OpenAI accepts natural-language delivery instructions. Cartesia accepts documented controls: an emotion value, a speed multiplier, and a volume multiplier. LyricWinter maps the same nearby story context into those Cartesia-specific fields.
Automatic Cartesia Guidance
When a voice uses cartesia-sonic-3.5, LyricWinter can analyze up to 10 lines before and 10 lines after the current clip. If the context clearly suggests a performance direction, it sends Cartesia a generation_config with emotion, speed, and/or volume.
{
"generation_config": {
"emotion": "anxious",
"speed": 1.08,
"volume": 0.9
}
}If the scene is ambiguous or the line already works from the text alone, LyricWinter does not need to send a control. Cartesia's docs also warn that emotion guidance works best when it matches the transcript, so the system avoids forcing mismatched emotions.[1]
Manual SSML Controls
Cartesia also supports SSML-like tags inside the transcript. If LyricWinter sees a Cartesia SSML tag in the line, it treats that as deliberate user direction and skips the automatic Cartesia guidance step for that clip.[2]
<emotion value="sad"/>directs emotion.<speed ratio="1.2"/>changes pacing.<volume ratio="0.7"/>changes loudness.<break time="1s"/>inserts a fixed pause.<spell>ABC123</spell>asks Cartesia to read characters one by one.

Complete Cartesia Emotion List
Cartesia documents primary emotions such as neutral, angry, excited, content, sad, and scared, plus this broader set of supported emotion values.[1]
Positive and high energy
happyexcitedenthusiasticelatedeuphorictriumphantamazedsurprisedflirtatiousjoking/comediccuriousCalm, warm, and relational
contentpeacefulserenecalmgratefulaffectionatetrustsympatheticanticipationmysteriousConflict and edge
angrymadoutragedfrustratedagitatedthreateneddisgustedcontemptenvioussarcasticironicSadness, fatigue, and vulnerability
saddejectedmelancholicdisappointedhurtguiltyboredtiredrejectednostalgicwistfulapologeticUncertainty and fear
hesitantinsecureconfusedresignedanxiouspanickedalarmedscaredNeutral and character stance
neutralproudconfidentdistantskepticalcontemplativedeterminedA Practical Workflow
- Start automatic: generate the scene normally with Cartesia Sonic 3.5 and let LyricWinter use nearby context.
- Only tag specific misses: if one line needs a clear override, add a Cartesia SSML tag to that line and regenerate the clip.
- Do not mix tag systems: Chatterbox's
<sigh>and Fish Audio's[whispering]are not Cartesia controls. - Prefer natural punctuation first: Cartesia recommends punctuation for normal pauses, reserving
<break>for deliberate fixed silence.[2]
Cartesia Emotion Control FAQ
Does LyricWinter add Cartesia emotion controls automatically?
Yes. For Cartesia Sonic 3.5, LyricWinter analyzes nearby dialogue context and can pass Cartesia generation_config controls for emotion, speed, and volume.
What if I write my own Cartesia SSML tags?
LyricWinter leaves user-authored Cartesia SSML controls in the line and skips the automatic guidance step for that clip.
Can I use Chatterbox or Fish Audio tags with Cartesia?
No. Use Cartesia's speed, volume, emotion, break, and spell tags for Cartesia. Chatterbox uses angle-bracket sound tags, and Fish Audio uses square-bracket cues.