FIRST Open Source SORA 2 Is HERE!
Summary
OVI is the first open-source Sora 2-like model, an 11 billion parameter text-to-video and image-to-video generation model developed by Character AI and Yale University. Built upon a 2.25 billion parameter base model, OVI can generate 720p resolution videos at 24 frames per second, including synchronized audio, from either text prompts or an initial image. While currently unoptimized and requiring a high-end GPU like an NVIDIA RTX 5090 for local execution, or a cloud instance with at least 32GB VRAM (e.g., RunPod's RTX Pro 6000), future optimized versions are expected to run on GPUs with less VRAM. The model supports multi-language audio generation and offers a user-friendly web UI, though it is limited to 5-second video clips and may exhibit occasional character consistency issues.
Key takeaway
For AI Engineers evaluating open-source multimodal generation, OVI represents a significant early release for text-to-video with audio. While its current 5-second limit and high VRAM requirement (RTX 5090 or cloud GPU) are constraints, its 11 billion parameter size suggests broader accessibility once optimized versions are released. You should explore its capabilities on cloud platforms like RunPod to understand its potential for integrating synchronized audio and video generation into your applications.
Key insights
OVI is the first open-source 11B parameter model generating video with audio from text or images.
Principles
- Smaller base models can enable new multimodal capabilities.
- Early-stage models often require significant computational resources.
Method
OVI generates video and audio from text or images. Prompts describe video content, then use `<S>` and `<E>` tags for dialogue, and `<OLD_CAP>` for voice characteristics. Users select a solver (Euler recommended) and can add negative prompts.
In practice
- Use RunPod with 32GB+ VRAM for OVI if you lack an RTX 5090.
- Structure OVI prompts with `<S>`, `<E>`, and `<OLD_CAP>` tags for dialogue and voice.
- Experiment with different seeds to improve video consistency or pronunciation.
Topics
- OVI Model
- Open-Source Video Generation
- Text-to-Video
- Image-to-Video
- GPU Requirements
Best for: AI Engineer, Machine Learning Engineer, Prompt Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Aitrepreneur.