Bringing Objects to Life: training-free 4D generation from 3D objects through view consistent noise

Bringing Objects to Life: training-free 4D generation from 3D objects through view consistent noise

1Bar-Ilan University 2NVIDIA

3D Mesh

icecream static

the ice cream is melting

Mario running

Result 4D

icecream melt

candle static

a spell is cast through the purple flame

Mario running

candle spell

Our method, 3D24D, takes a passive 3D object and a textual prompt describing a desired action. It then adds dynamics to the object based on the prompt to create a 4D animation, essentially a video viewable from any perspective. On the right, we display four 3D frames from the generated 4D animation.

Abstract


Recent advancements in generative models have enabled the creation of dynamic 4D content — 3D objects in motion — based on text prompts, which holds potential for applications in virtual worlds, media, and gaming. Existing methods provide control over the appearance of generated content, including the ability to animate 3D objects. However, their ability to generate dynamics is limited to the mesh datasets they were trained on, lacking any growth or structural development capability. In this work, we introduce a training-free method for animating 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom general scenes while maintaining the original object's identity. We first convert a 3D mesh into a static 4D Neural Radiance Field (NeRF) that preserves the object’s visual attributes. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce a view-consistent noising protocol that aligns object perspectives with the noising process to promote lifelike movement, and a masked Score Distillation Sampling (SDS) loss that leverages attention maps to focus optimization on relevant regions, better preserving the original object. We evaluate our model on two different 3D object datasets for temporal coherence, prompt adherence, and visual fidelity, and find that our method outperforms the baseline based on multiview training, achieving better consistency with the textual prompt in hard scenarios.


Overview


3D Mesh

Mario static

Mario static

Mario running

Mario running

Mario waving

Mario jumping

Mario walking

Mario walking

Left: we display the input object. Right: The resulted 4D object viewed from azimuth -60° → 60°, progressing over time.



Instead of generating a 4D dynamic object using text control only, one may want to animate an existing 3D object, like your favorite 3D toy or character. Conditioning 4D generation on 3D assets offers several advantages: it enhances control, leverages existing 3D resources efficiently, and accelerates 4D generation by using 3D as a strong initialization.

The latest advancement in conditioning is 3D-to-4D generation which train multi-view image-to-video diffusion models. These works capture a 3D object from multiple viewpoints and generate temporally consistent videos for each, ensuring coherence across different perspectives. To achieve this consistency, they rely on large-scale datasets of multi-view videos derived from existing 4D objects. However, these 4D objects are represented as meshes, which are inherently constrained by their fixed number of vertices and faces. As a result, approaches trained on this dataset tend to be more limited in handling evolution, volume change or growth deformation.

In this work, we introduce a training-free method for animating 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom general scenes while maintaining the original object's identity, called 3D24D. Taking a simple approach that incorporates textual descriptions to govern the animation of the 3D objects. First, we train a "static" 4D NeRF based on the 3D mesh input, effectively capturing the object appearance from multiple views, replicated across time. Then, our method modifies the 4D object using an image-to-video diffusion model, conditioned the first frame on renderings of the input object.

Unfortunately, we find that applying this approach naively is insufficient because it dramatically reduces the level of dynamic motion. We propose two key improvements that both enhance the generation of dynamic movements and ensure better preservation of the input object. First, we design a new view-consistent noising strategy for 4D generation, which constructs a noise pattern associated with the rendered viewpoint during optimization. This association between the viewpoint and the noising approach enhances the generation process, resulting in more pronounced motion in the animated 4D output. Second, we introduce a masked variant of the SDS loss that uses attention maps obtained from the image-to-video model. This masked SDS focuses optimization on the object across temporally relevant regions of the latent space, enhancing the fidelity of object-related elements and better preserving its identity.

Pipeline


Workflow of our 3D24D approach, designed to optimize a 4D radiance field using a neural representation that captures both static and dynamic elements. First, a 4D NeRF is trained to represent the static object (plant, left), having the same 3D structure at each time step. Then, we introduce dynamics to the 4D NeRF by distilling the prior from a pre-trained image-to-video model. At each SDS step, we select a viewpoint and render both the input object, the noise sphere, and the 4D NeRF from the same selected viewpoint. These renders, along with the textual prompts, are then fed into the image-to-video model, and the SDS loss is calculated to guide the generation of motion while preserving the object's identity. The noise is rendered from the sphere using the same viewpoint as the static object, providing better consistency at each step.

Results


For a collection of 3D objects, we used the Google Scanned Objects (GSO) dataset. This is a collection of high-quality 3D scans of everyday items.


3D Mesh



Hulk static

the hulk is transforming



Hulk transforming

3D Mesh



Plant static

A blooming plant slowly grows with colorful branches expanding outward

Plant blooming

3D Mesh


Candle spell static

a spell is cast through the flame, its fire flickering and growing

Candle spell casting

3D Mesh



Elephant wings static

an elephant grows its ears into long, powerful wings, stretching wide with a graceful flap

Elephant wings flap

3D Mesh



Unicorn rainbow static

a unicorn grows a colorful rainbow tail


Unicorn rainbow tail

3D Mesh


Snowman static

a snowman is melting, water trickling down its sides into a pool

Snowman melting

3D Mesh

Turtle static

A turtle has its head inside its shell

Turtle head inside

3D Mesh

Honey dipper static

Honey spills from a dipper

Honey spilling

3D Mesh



Apple static

an apple with a bite taken out of it


Apple bite

3D Mesh



Broccoli static

broccoli is growing and blooming, its green stalks stretching upward

Broccoli growing

3D Mesh


Candle down melt static

a candle is melting downward, wax dripping steadily

Candle down melting

3D Mesh

Candle melt static

a candle is melting, wax pooling at its base

Candle melting

3D Mesh


Raccoon fire static

a raccoon breathing fire from a weapon

Raccoon fire breathing

3D Mesh


Ice cream chocolate static

a chocolate is poured over the ice cream and drips from its side

Ice cream chocolate dripping

Example arranged in two columns, where each column has the following structure: On the left, we display the input 3D passive object. On the right, we present a video of generated 4D object, viewed from azimuth -60° → 60°, progressing over time. The title represnt the input prompt.

Citation



    @article{rahamim2024bringingobjectslife4d,
        title={Bringing Objects to Life: 4D generation from 3D objects}, 
        author={Ohad Rahamim and Ori Malca and Dvir Samuel and Gal Chechik},
        year={2024},
        eprint={2412.20422},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2412.20422}, 
  }