Google introduces ‘VideoPoet’ an AI model that makes videos from images, texts & audio
VideoPoet integrates various tasks such as text-to-video, image-to-video, video inpainting and outpainting, video stylisation, and video-to-audio generation, all within a single LLM.


Highlights
- Google introduces ‘VideoPoet,’ a multimodal LLM that produces videos
- VideoPoet integrates multiple video generation capabilities into a unified language model
- Researchers believe VideoPoet holds promising potential for 'any-to-any' format in the future
Google has unveiled ‘VideoPoet,’ a cutting-edge large language model (LLM) that takes video generation to unprecedented heights. This multimodal marvel boasts the ability to process text, images, video, and audio, producing videos like never before.
Revolutionary decoder-only architecture
Google's scientists have developed VideoPoet with a 'decoder-only architecture,' allowing it to generate content for tasks it hasn't been explicitly trained on. This approach involves two key steps: pretraining and task-specific adaptation. Essentially, VideoPoet is a versatile framework customisable for various video generation tasks.
Unified approach
Unlike existing video models that use diffusion models, VideoPoet integrates multiple video generation capabilities into a unified language model. This means it excels in various tasks such as text-to-video, image-to-video, video inpainting and outpainting, video stylisation, and video-to-audio generation, all within a single LLM.
Key to VideoPoet's success
VideoPoet's success lies in its autoregressive model, which creates output by building on its previous generations. Trained on video, audio, image, and text, VideoPoet utilises tokenisation, a process crucial for natural language processing, converting input text into smaller units for better analysis.
Unlocking creative possibilities
Researchers believe VideoPoet holds promising potential for 'any-to-any' format in the future. Remarkably, it can even craft a short film by combining multiple video clips. While not currently suited for longer videos, Google suggests overcoming this limitation by conditioning the last second to predict the next second.
- Reliance Jio forges ahead with 'Bharat GPT' AI model tailored for India; here’s all you need to know
Innovative applications
VideoPoet's capabilities extend to altering the movement of objects in existing videos, as exemplified by a quirky scenario where the Mona Lisa yawns. This demonstrates the model's creative prowess in reshaping visual content.
Google's VideoPoet is not just a leap in video generation technology; it's a glimpse into the future of multimedia content creation.
COMMENTS 0