CAFA: a Controllable Automatic Foley Artist

Roi Benita*
Technion

Michael Finkelson*
Hebrew University of Jerusalem

Tavi Halperin
Lightricks

Gleb Sterkin
Lightricks

Yossi Adi
Hebrew University of Jerusalem

*Equal Contribution Work done as part of an internship at Lightricks.

Paper

Code

Abstract

Foley is a key element in video production, refers to the process of adding an audio signal to a silent video while ensuring semantic and temporal alignment. In recent years, the rise of personalized content creation and advancements in automatic video-to-audio models have increased the demand for greater user control in the process. One possible approach is to incorporate text to guide audio generation. While supported by existing methods, challenges remain in ensuring compatibility between modalities, particularly when the text introduces additional information or contradicts the sounds naturally inferred from the visuals. In this work, we introduce CAFA (Controllable Automatic Foley Artist) a video-and-text-to-audio model that generates semantically and temporally aligned audio for a given video, guided by text input. CAFA is built upon a text-to-audio model and integrates video information through a ControlNet mechanism. By incorporating text, users can refine semantic details and introduce creative variations, guiding the audio synthesis beyond the expected video contextual cues. Experiments show that besides its superior quality in terms of semantic alignment and audio-visual synchronization our proposed model enable high textual controllability which demonstrated in subjective and objective evaluations.

Creative Automated Foley with textual control

Given a silent video, CAFA generates temporally aligned sounds while enabling creative sound design through text prompts.

An iconic scene from Jurassic Park, where water in a glass shakes due to the approaching footsteps of a T-Rex.
Inferring the generated sound from the video alone is insufficient, as the task is inherently ambiguous.
Our method leverages the prompt "T-Rex Stomping" to generate a synchronized audio track that aligns with both the visual timing and artistic intent.

Video To Audio controled by Text input

Explore how different models generate audio based on a given video and text prompt!
Use the buttons on the left to select a text prompt. Each column in the table
represents the synthesis output of a different model, while the first column
displays the original video with its original audio.

	GT	CAFA (ours)	MMAudio	FoleyCrafter	REWAS

Bibtex


@inproceedings{benita2025controllableautomaticfoleyartist,
      title={Controllable Automatic Foley Artist}, 
      author={Roi Benita and Michael Finkelson and Tavi Halperin and Gleb Sterkin and Yossi Adi},
      year={2025},
      booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
      url={https://arxiv.org/abs/2504.06778}, 
}

Acknowledgements

The template used to build this webpage can be found here.
It is inspired by the original template, available here.