ObjectMover | Generative Object Motion Modeling Based on Video Priors

ObjectMover | Generative Object Motion Modeling Based on Video Priors

ObjectMover is a research published in CVPR 2025 and jointly proposed by the University of Hong Kong and Adobe Research. Its core goal is to address the complex challenges of object movement in images, including illumination coordination, viewpoint adjustment, occluded region filling, shadow and reflection synchronization, etc., while maintaining object identity consistency. Traditional methods are difficult to deal with these comprehensive problems, so the research team proposes to leverage the a priori knowledge of video generation models to achieve realism generation of object motion through sequence-to-sequence modeling.

Core Innovation Points

  1. video a priori migration
    Considering object movement as a special case of two-frame video, the learning ability of pre-trained video generation models (e.g., diffusion models) for cross-frame consistency is utilized. Migrate the model from a video generation task to an image editing task by fine-tuning it.
  2. Sequence-to-sequence modeling
    The object movement task is reconstructed as a sequence prediction problem, where the inputs include the original image, the target object image, the command map (labeled with the position and direction of the movement), and the output is a synthesized image of the object after the movement.
  3. Synthetic dataset construction
    To address the lack of real data on large-scale object movement, modern game engines (e.g., Unreal Engine) are used to generate high-quality synthetic data pairs covering complex lighting, material, and occlusion scenes to enhance the versatility of model training.
  4. multitasking learning strategy
    Combining the four subtasks of object movement, removal, insertion and video data insertion, the model is trained on synthetic data and real video data through a unified framework to improve the generalization ability of the model to real scenes.

Methodological framework

  1. model architecture
    • Main mission (mobile): Input image, object image, command map, generate target frames by diffusion Transformer, fusion time step, position, task embedding.
    • Subtasks (Remove / Insert): Similar to the main task, adjust the input conditions to achieve a specific editing goal.
    • Video Data Insertion: Extends to video sequences to ensure cross-frame consistency.
  2. Technical details
    • A Gaussian noise perturbation and diffusion model is used for stepwise denoising to generate high fidelity images.
    • Optimizing model adaptation to different editing tasks through multi-task learning.

Experiments and Results

  • Synthetic data validation: Validate the model's ability to handle extreme lighting, materials, and occlusion on a home-built game engine dataset.
  • Real Scene Generalization: Through multi-task learning, the model shows robustness in real image editing, e.g., accurately complementing occluded regions and synchronizing shadow effects.
  • ablation experiment: Validate the necessity of video prior, synthetic data, and multi-task learning, and demonstrate the performance enhancement of each component.

applied value

ObjectMover provides a breakthrough solution for image editing, which can be widely used in film and television post-production, virtual reality, advertising design, etc. to realize efficient and realistic object position adjustment. Its video model-based transfer learning strategy provides new ideas for solving other image generation tasks (e.g., restoration, stylization).

Research Teams and Open Source

  • author: Xin Yu (University of Hong Kong), Tianyu Wang (Adobe Research), and others.
  • open source program: The webpage does not explicitly mention that the code is open source, but provides a link to the paper (to be added), which may be released via GitHub or other platforms in the future.

summarize: ObjectMover successfully solves the complex challenge of object movement in images through the combination of video prior and sequence modeling, setting a new benchmark for generative image editing. The breakthrough of its innovative approach in data synthesis and multi-task learning has important reference value for the field of computer vision.

Download permission
View
  • Download for free
    Download after comment
    Download after login
  • {{attr.name}}:
Your current level is
Login for free downloadLogin Your account has been temporarily suspended and cannot be operated! Download after commentComment Download after paying points please firstLogin You have run out of downloads ( times) please come back tomorrow orUpgrade Membership Download after paying pointsPay Now Download after paying pointsPay Now Your current user level is not allowed to downloadUpgrade Membership
You have obtained download permission You can download resources every daytimes, remaining todaytimes left today

📢 Disclaimer | Tool Use Reminder

1️⃣ The content of this article is based on information known at the time of publication, AI technology and tools are frequently updated, please refer to the latest official instructions.

2️⃣ Recommended tools have been subject to basic screening, but not deep security validation, so please assess the suitability and risk yourself.

3️⃣ When using third-party AI tools, please pay attention to data privacy protection and avoid uploading sensitive information.

4️⃣ This website is not liable for direct/indirect damages due to misuse of the tool, technical failures or content deviations.

5️⃣ Some tools may involve a paid subscription, please make a rational decision, this site does not contain any investment advice.

To TAReward
{{data.count}} people in total
The person is Reward
0 comment A文章作者 M管理员
    No Comments Yet. Be the first to share what you think
❯❯❯❯❯❯❯❯❯❯❯❯❯❯❯❯
Profile
Cart
Coupons
Check-in
Message Message
Search