Seedance 2.0 Prompt Guide: How to Write Reference Instructions That Actually Work
There's a specific kind of frustration that comes from knowing a tool is capable of something but not being able to get it to produce what you want. You've seen examples of the output quality. You know it's in there somewhere. But your prompts keep producing results that are adjacent to what you're after rather than actually what you're after, and you can't quite figure out what you're doing wrong.
This is where most people are when they first start working with Seedance 2.0. The multi-modal reference system is genuinely powerful, but it operates on a logic that's different from what most people are used to from text-only generation. The prompting instincts that work in one context don't always transfer cleanly to the other.
This guide is about the practical craft of writing reference instructions that produce results you can actually use. Not theory about how the model works internally — that gets speculative quickly and isn't always actionable — but the patterns of what tends to work, what tends to fail, and why.
The Fundamental Shift in How Prompts Work
In text-only AI video generation, the prompt carries almost all the creative information. Everything the model knows about what you want lives in the words you write. The prompt needs to be comprehensive because there's nothing else for the model to draw on.
In multi-modal generation with reference inputs, the prompt's role changes significantly. Your uploaded files — images, videos, audio — carry substantial visual and audio information directly. The prompt's job shifts from describing everything to directing the relationship between the inputs and the output. You're writing instructions about how the materials should be used rather than descriptions of what should be created.
This shift is subtle but consequential. Many first-time users of Seedance 2.0 write prompts that try to describe the output comprehensively, the way they would in a text-only system, while also uploading reference files. The result is often a confused output that partially follows the text description and partially follows the references, because the model is receiving conflicting instructions about what to prioritize. The prompt is saying "here's what it should look like" while the references are saying "here's what it should look like" — and those two things aren't necessarily pointing in the same direction.
The discipline is to let the references carry the visual information and use the prompt to specify how that visual information should be applied. The prompt directs; the references demonstrate.
The Tagging System: Precision Matters More Than You Think
The @image1, @video1, @audio1 tagging system is how you connect your prompt instructions to specific uploaded files. Using these tags precisely is one of the highest-leverage habits you can develop for getting consistent results.
The model assigns numbers to your uploaded files in the order they were uploaded — @image1 is the first image you uploaded, @image2 is the second, and so on across different file types. Before writing your prompt, it's worth mentally confirming which file corresponds to which tag, particularly when you have multiple inputs of the same type.
When you reference a file in your prompt, be specific about what you're referencing it for. "Use @image1" is less clear than "Use @image1 as the opening frame" or "Apply the character appearance from @image1" or "Reference the color grading from @image1." The more specifically you tell the model what role each input plays, the more accurately it can act on your instructions.
When you have multiple inputs of the same type serving different purposes — say, one image as a character reference and another as a scene reference — the prompt needs to make that distinction explicit. "Use @image1 as the character reference and @image2 as the background setting" is much clearer than just mentioning both images without specifying their roles. The model can infer context from the types of images involved, but explicit instruction outperforms inference when precision matters.
Separating Motion Instructions from Content Instructions
One of the most useful structural habits for writing multi-modal prompts is treating motion instructions and content instructions as separate concerns and addressing them explicitly and distinctly.
Motion instructions cover everything about how the video moves: the camera behavior, the speed and quality of movement, the way subjects move within the frame, the timing of transitions. These instructions often benefit from a video reference that demonstrates the motion rather than just describing it, but when you're using a video reference for motion, the prompt still needs to explicitly identify that this is what the reference is for. "Reference the camera movement from @video1" is clear. Just uploading a video and assuming the model will know you want the motion is not.
Content instructions cover what's actually in the frame: the subjects, the setting, the scenario, what's happening. These often come primarily from your image references and your text description rather than from video references, though a video reference can carry content information as well as motion information.
Mixing motion and content instructions in the same sentence or passage tends to produce less accurate results than addressing them in distinct parts of the prompt. "The camera slowly pushes forward through the garden to reveal the character standing at the fountain" is both a motion instruction and a content instruction combined, and it's asking the model to do more interpretation work than "Garden setting with stone fountain. Camera slowly pushes forward. Character reference from @image1, positioned at the fountain." The second version is less elegant prose, but it's clearer instruction.
What Specific Language Looks Like in Practice
Abstract creative direction tends to produce variable results. The model makes its own interpretation of words like "cinematic," "dramatic," "energetic," or "atmospheric" — and those interpretations may or may not match what you have in mind. When you have a reference that demonstrates the quality you mean by those words, use it rather than relying on the description alone. When you don't have a reference, try to translate the abstract description into something more concrete.
"Cinematic" by itself means different things to different people. "Slow moving camera with shallow depth of field and soft natural lighting, similar to the movement in @video1" gives the model much more to work with. "Energetic" could mean fast cuts, dynamic camera movement, high-contrast lighting, intense motion — all of those things, or any one of them. "Fast-paced cuts synced to the beat of @audio1 with dynamic handheld camera movement" is describing the same general quality but with enough specificity that the model's interpretation space is narrowed considerably.
The same principle applies to descriptions of subjects and settings. "A professional woman in a modern office" produces a generic output. "A professional woman in her mid-thirties, dressed in a dark blazer, in a minimalist office environment with large windows and natural light, character reference from @image1" produces something much closer to an intended output, and adding the character reference image removes most of the remaining ambiguity about what the character should look like.
A Few Patterns Worth Knowing
Without claiming these are universal rules — the model's behavior is context-dependent and what works in one use case doesn't always transfer — a few patterns have proven consistently useful across different types of generation.
Separating "reference" instructions from "generate" instructions tends to produce cleaner results than mixing them. "Reference the movement from @video1. Generate a [description] using this movement." This clearly distinguishes what you're taking from the reference and what you're creating fresh.
When you want the model to maintain a specific element precisely — a character's appearance, a product's details, a specific visual attribute — stating that requirement explicitly and emphatically tends to produce better adherence than mentioning it once in passing. "Maintain the exact character appearance from @image1 throughout the entire sequence, particularly the facial features and hair color" is more likely to produce consistent character rendering than a single reference tag without supporting instruction.
For audio sync, being explicit about where you want the sync to happen produces better results than just uploading the audio and assuming the model will figure out the synchronization. "Sync major motion changes to the beat of @audio1" gives the model a specific synchronization instruction rather than leaving it to interpret what audio sync should mean for your particular content.
The craft of prompting multi-modal generation is genuinely learnable, and the learning curve flattens quickly once the underlying logic clicks. The investment in developing that skill has a direct and ongoing payoff in the quality and reliability of what you can produce. Seedance 2.0 rewards the time spent learning how to direct it well.