GenCompositor: Generative Video Compositing with Diffusion Transformer
Arxiv 2025
Anonymous authors
1 Anonymous institution
Abstract
Video compositing serves as a crucial step in video creation and film production, bridging the gap between live-action footage and the final video production. Traditionally, this process demands labor-intensive efforts. Therefore, we aim to automate this process using a generative method, which we refer to as generative video compositing in this work. Specifically, our task strives to seamlessly inject both identity and motion information of dynamic video elements into the target video in a user-controllable manner. This allows users to effortlessly dictate the size, motion trajectory, and other attributes of the added dynamic elements in the final video. To achieve this, we introduce a novel Diffusion Transformer (DiT) pipeline specifically designed for generative video compositing. We develop a DiT-based background preservation branch to ensure content consistency in the video before and after editing. Furthermore, to seamlessly integrate dynamic foreground elements and ensure their coordination with the environment, we have designed an innovative DiT fusion block. This block merges conditions with self-attention, fully inheriting foreground information while harmonizing with the background, thereby providing a realistic and harmonious composition. To train our model, we have curated a dataset comprising 27K video data samples. This dataset includes complete dynamic elements and high-quality target videos. Experimental results demonstrate that our proposed method achieves promising results in the generative video compositing task.