*Equal Contribution Work done as part of an internship at Lightricks.
Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video-to-audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video-and-text-to-audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text-to-audio model and integrates video information through a ControlNet mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio-visual synchronization our proposed model enable high textual controllability which demonstrated in subjective and objective evaluations.
Given a silent video, CAFA generates temporally aligned sounds while enabling creative sound design through text prompts.
An iconic scene from Jurassic Park, where
water in a glass shakes due to the approaching footsteps of a T-Rex.
Inferring the generated sound from the video alone is insufficient,
as the task is inherently ambiguous.
Our method leverages the
prompt "T-Rex Stomping" to generate a synchronized audio track
that aligns with both the visual timing and artistic intent.
Explore how different models generate audio based on a given video and text prompt!
Use the buttons on the left to select a text prompt.
Each column in the table
represents the synthesis output of a different model, while the first column
displays the original video with its original audio.
GT | CAFA (ours) | MMAudio | FoleyCrafter | REWAS | |
---|---|---|---|---|---|
|
|||||
|
|||||
|
|||||
|
|||||
|
Bibtex