FIRST Open Source SORA 2 Is HERE!

2025-10-04 · Source: Aitrepreneur · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Novice, long

Summary

OVI is the first open-source Sora 2-like model, an 11 billion parameter text-to-video and image-to-video generation model developed by Character AI and Yale University. Built upon a 2.25 billion parameter base model, OVI can generate 720p resolution videos at 24 frames per second, including synchronized audio, from either text prompts or an initial image. While currently unoptimized and requiring a high-end GPU like an NVIDIA RTX 5090 for local execution, or a cloud instance with at least 32GB VRAM (e.g., RunPod's RTX Pro 6000), future optimized versions are expected to run on GPUs with less VRAM. The model supports multi-language audio generation and offers a user-friendly web UI, though it is limited to 5-second video clips and may exhibit occasional character consistency issues.

Key takeaway

For AI Engineers evaluating open-source multimodal generation, OVI represents a significant early release for text-to-video with audio. While its current 5-second limit and high VRAM requirement (RTX 5090 or cloud GPU) are constraints, its 11 billion parameter size suggests broader accessibility once optimized versions are released. You should explore its capabilities on cloud platforms like RunPod to understand its potential for integrating synchronized audio and video generation into your applications.

Key insights

OVI is the first open-source 11B parameter model generating video with audio from text or images.

Principles

Smaller base models can enable new multimodal capabilities.
Early-stage models often require significant computational resources.

Method

OVI generates video and audio from text or images. Prompts describe video content, then use `<S>` and `<E>` tags for dialogue, and `<OLD_CAP>` for voice characteristics. Users select a solver (Euler recommended) and can add negative prompts.

In practice

Use RunPod with 32GB+ VRAM for OVI if you lack an RTX 5090.
Structure OVI prompts with `<S>`, `<E>`, and `<OLD_CAP>` tags for dialogue and voice.
Experiment with different seeds to improve video consistency or pronunciation.

Topics

OVI Model
Open-Source Video Generation
Text-to-Video
Image-to-Video
GPU Requirements

Best for: AI Engineer, Machine Learning Engineer, Prompt Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Aitrepreneur.