CoNo:

Consistency Noise Injection for Tuning-free Long Video Diffusion

arXiv

Xingrui Wang, Xin Li, Zhibo Chen,

University of Science and Technology of China

Campfire at night in a snowy forest with starry sky in the background.

Slow motion steam rises from a hot cup of coffee.

Slow motion sparks fly from a grinding wheel, creating a shower of light.

Bruce Lee shout like a lion ,wild fighter.

An elderly lady with a warm smile, white hair, and deep wrinkles. with the style of Salvador Dali

Slow motion lightning illuminates the dark sky, followed by the rumble of thunder.

Two white swans gracefully swam in the serene lake.

A corgi is swimming quickly.

We propose a brand-new tuning-free long video diffusion with our proposed consistency noise injection, intending to enhance the fine-grained long-term consistency of generated long videos.

Abstract

Tuning-free long video diffusion has been proposed to generate extended-duration videos with enriched content by reusing the knowledge from pre-trained short video diffusion model without retraining. However, most works overlook the fine-grained long-term video consistency modeling, resulting in limited scene consistency (i.e., unreasonable object or background transitions), especially with multiple text inputs. To mitigate this, we propose the Consistency Noise Injection, dubbed CoNo, which introduces the “look-back” mechanism to enhance the fine-grained scene transition between different video clips, and designs the long-term consistency regularization to eliminate the content shifts when extending video contents through noise prediction. In particular, the “look-back” mechanism breaks the noise scheduling process into three essential parts, where one internal noise prediction part is injected into two video-extending parts, intending to achieve a fine-grained transition between two video clips. The long-term consistency regularization focuses on explicitly minimizing the pixel-wise distance between the predicted noises of the extended video clip and the original one, thereby preventing abrupt scene transitions. Extensive experiments have shown the effectiveness of the above strategies by performing long-video generation under both single- and multi-text prompt conditions.

Approach

Approach


Illustration of the CoNo framework. We propose a “look-back” mechanism that inserts an internal noise prediction stage between two video extending stages to enhance scene consistency. To achieve this, we design the extending and internal initial noise shuffles and constrain the denoising trajectory using selected predicted noise (denoted as [s] in the figure). Additionally, we apply long-term consistency regularization between adjacent video clips to avoid abrupt content shifts. We obtain the final video by concatenating the frames marked with yellow boxes from different stages.

The ''Look-Back'' Mechanism

The ``Look-Back'' Mechanism


The ''look-back'' mechanism divides the video extension process into three crucial stages, where one internal noise prediction stage is inserted into two video extending stages, intending to ensure stable content transition through the inherent constrain of two-side contents at each reverse process (i.e., the predicted noises from left existing predicted frames and right extending frames). Notably maintaining the overall initial noise group of different video clips is crucial to guarantee the same contents/scenes, we also propose customized noise shuffle strategies for the above three stages, respectively. Concretely, we design the revised extending noise shuffle for the video extending stage, which recovers the noise order for guided frames after reversing the whole initial noises, thereby obviating the reverse-order repetitive content generation. For internal noise prediction, we directly inserted the initial noises at the end of the sequence into the middle position, resulting in an internal noise shuffle to ensure the same initial noise group.

The Long-term Consistency Regularization

The Long-term Consistency Regularization


We propose an explicit long-term consistency regularization, which minimizes the pixel-wise distance between predicted noises of the extended video clip and the original generated video clip.

Results: Comparison with Other Models


🌟 Single-prompt Longer Video Generation 🌟


The qualitative comparisons given a single text prompt.


A happy pig rolling in the mud on a sunny day.

GenL

FreeNoise

CoNo

Slow motion steam rises from a hot cup of coffee.

GenL

FreeNoise

CoNo


🌟 Multi-prompt Longer Video Generation 🌟


The qualitative comparisons given multiple text prompts.


1. A man runs on a beautiful tropical beach at sunset of 4k high resolution.

2. A man rides a bicycle on a beautiful tropical beach at sunset of 4k high resolution.

3. A man walks on a beautiful tropical beach at sunset of 4k high resolution.

4. A man reads a book on a beautiful tropical beach at sunset of 4k high resolution.

VidRD

GenL

FreeNoise

MTVG

CoNo

Quantitative Results

Approach
Another Approach

Ablation Study

Ablation Study for Long-term Consistency Regularization


Ablation for Long-term Consistency Regularization. ``w/o'' indicates without the regularization in the CoNo pipeline.

Ablation Study for Internal Noise Prediction


Ablation for Internal Noise Prediction. Transition frames are marked with red boxes and the details of transitions are highlighted with yellow boxes.

More Qualitative Results of CoNo


🌟 VideoCrafter1 🌟


The base model is VideoCrafter1.


Boats on the calm, blue ocean.

Mickey Mouse is dancing on white background.

A cat watching the starry night by Vincent Van Gogh, Highly Detailed, 2K with the style of emoji.

Angelina Jolie's piercing gaze conveys power, her lips set in a determined line.

A cute and chubby giant panda is enjoying a bamboo meal in a lush forest. The panda is relaxed and content as it eats, and occasionally stops to scratch its ear with its paw.

Oprah Winfrey's warm smile, her eyes full of empathy and understanding.

A squirrel.

A musician effortlessly plays a complex melody on a keyboard, fingers dancing across the keys with precision.

1. Cherry blossoms bloom around the Japanese-style castle.
2. Leaves fall around the Japanese-style castle.
3. Snow falls around the Japanese-style castle.
4. Snow builds up in trees around the Japanese-style castle.

1. A white butterfly sits on a purple flower.
2. The color of the purple flower where the white butterfly sits turns red.
3. A white butterfly is sitting on a red flower.

1. An astronaut in a white uniform is snowboarding in the snowy hill.
2. An astronaut in a white uniform is surfing in the sea.
3. An astronaut in a white uniform is surfing in the desert.

1. The whole beautiful night view of the city is shown.
2. Heavy rain flood the city with beautiful night scenery and flood.
3. The day dawns over the flooded city.

1. In spring, a white butterfly sit on a flower.
2. In summer, a white butterfly sit on flower.
3. In autumn, a white butterfly sit on flower.
4. In winter, a white butterfly sit on flower.

1. There is a beach where there is no one.
2. The waves hit the deserted beach.
3. There is a beach that has been swept away by waves.

1. A golden retriever has a picnic on a beautiful tropical beach at sunset.
2. A golden retriever is running towards a beautiful tropical beach at sunset.
3. A golden retriever sits next to a bonfire on a beautiful tropical beach at sunset.
4. A golden retriever is looking at the starry sky on a beautiful tropical beach.

1. A Red Riding Hood girl walks in the woods.
2. A Red Riding Hood girl sells matches in the forest.
3. A Red Riding Hood girl falls asleep in the forest.
4. A Red Riding Hood girl walks towards the lake from the forest.

1. The volcano erupts in the clear weather.
2. Smoke comes from the crater of the volcano, which has ended its eruption in the clear weather.
3. The weather around the volcano turns cloudy.

1. There is a Mickey Mouse dancing through the spring forest.
2. There is a Mickey Mouse walking through the autumn forest.
3. There is a Mickey Mouse running through the winter forest.

1. A teddy bear walks on the streets of Times Square.
2. The teddy bear enters restaurants.
3. The teddy bear eats pizza.
4. The teddy bear drinks water.

1. The cartoon-style bear appears in a comic book.
2. The cartoon-style bears in comic books jump out into the real world.
3. The bear in the real world dances.
4. The bear in the real world sits.


🌟 Lavie 🌟


The base model is Lavie.


A bunch of autumn leaves falling on a calm lake, smooth.

A corgi is swimming quickly.

1. A white butterfly sits on a purple flower.
2. The color of the purple flower where the white butterfly sits turns red.
3. A white butterfly is sitting on a red flower.

1. A waterfall flows in the mountains under a clear sky.
2. A waterfall flows in the fall mountains under a clear sky.
3. A waterfall flows in the winter mountains under a clear sky.
4. A waterfall frozen on a mountain during a snowstorm.

1. There is a beach where there is no one.
2. The waves hit the deserted beach.
3. There is a beach that has been swept away by waves.

1. The volcano erupts in the clear weather.
2. Smoke comes from the crater of the volcano, which has ended its eruption in the clear weather.
3. The weather around the volcano turns cloudy.

BibTeX

@article{wang2024cono,
      title={CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion},
      author={Wang, Xingrui and Li, Xin and Chen, Zhibo},
      year={2024}
    }