Cartesia Emotion and SSML Controls in LyricWinter

How LyricWinter uses surrounding dialogue context with Cartesia Sonic 3.5, plus the full Cartesia emotion list and manual SSML controls.

Posted by

Cartesia Sonic 3.5 can infer emotional subtext from the transcript, but a generated LyricWinter clip usually contains only one line. LyricWinter therefore looks at nearby detected dialogue, then passes Cartesia native controls when the surrounding scene gives useful delivery context.[1]

The important difference from OpenAI TTS is the control shape. OpenAI accepts natural-language delivery instructions. Cartesia accepts documented controls: an emotion value, a speed multiplier, and a volume multiplier. LyricWinter maps the same nearby story context into those Cartesia-specific fields.

Automatic Cartesia Guidance

When a voice uses cartesia-sonic-3.5, LyricWinter can analyze up to 10 lines before and 10 lines after the current clip. If the context clearly suggests a performance direction, it sends Cartesia a generation_config with emotion, speed, and/or volume.

{
  "generation_config": {
    "emotion": "anxious",
    "speed": 1.08,
    "volume": 0.9
  }
}

If the scene is ambiguous or the line already works from the text alone, LyricWinter does not need to send a control. Cartesia's docs also warn that emotion guidance works best when it matches the transcript, so the system avoids forcing mismatched emotions.[1]

Manual SSML Controls

Cartesia also supports SSML-like tags inside the transcript. If LyricWinter sees a Cartesia SSML tag in the line, it treats that as deliberate user direction and skips the automatic Cartesia guidance step for that clip.[2]

  • <emotion value="sad"/> directs emotion.
  • <speed ratio="1.2"/> changes pacing.
  • <volume ratio="0.7"/> changes loudness.
  • <break time="1s"/> inserts a fixed pause.
  • <spell>ABC123</spell> asks Cartesia to read characters one by one.
LyricWinter Generate page with voice control tags written in the story editor
Write manual Cartesia SSML tags only when you want to override automatic context guidance for that line.

Complete Cartesia Emotion List

Cartesia documents primary emotions such as neutral, angry, excited, content, sad, and scared, plus this broader set of supported emotion values.[1]

Positive and high energy

happyexcitedenthusiasticelatedeuphorictriumphantamazedsurprisedflirtatiousjoking/comediccurious

Calm, warm, and relational

contentpeacefulserenecalmgratefulaffectionatetrustsympatheticanticipationmysterious

Conflict and edge

angrymadoutragedfrustratedagitatedthreateneddisgustedcontemptenvioussarcasticironic

Sadness, fatigue, and vulnerability

saddejectedmelancholicdisappointedhurtguiltyboredtiredrejectednostalgicwistfulapologetic

Uncertainty and fear

hesitantinsecureconfusedresignedanxiouspanickedalarmedscared

Neutral and character stance

neutralproudconfidentdistantskepticalcontemplativedetermined

A Practical Workflow

  • Start automatic: generate the scene normally with Cartesia Sonic 3.5 and let LyricWinter use nearby context.
  • Only tag specific misses: if one line needs a clear override, add a Cartesia SSML tag to that line and regenerate the clip.
  • Do not mix tag systems: Chatterbox's <sigh> and Fish Audio's [whispering] are not Cartesia controls.
  • Prefer natural punctuation first: Cartesia recommends punctuation for normal pauses, reserving <break> for deliberate fixed silence.[2]

Cartesia Emotion Control FAQ

Does LyricWinter add Cartesia emotion controls automatically?

Yes. For Cartesia Sonic 3.5, LyricWinter analyzes nearby dialogue context and can pass Cartesia generation_config controls for emotion, speed, and volume.

What if I write my own Cartesia SSML tags?

LyricWinter leaves user-authored Cartesia SSML controls in the line and skips the automatic guidance step for that clip.

Can I use Chatterbox or Fish Audio tags with Cartesia?

No. Use Cartesia's speed, volume, emotion, break, and spell tags for Cartesia. Chatterbox uses angle-bracket sound tags, and Fish Audio uses square-bracket cues.

References

  1. [1]Cartesia - Volume, Speed, and Emotion
  2. [2]Cartesia - SSML Tags
  3. [3]LyricWinter - Generate