Standard Intelligence: Training General Intelligence in Pixel Space

2026-04-30 · Source: Sequoia Capital · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Emerging Technologies & Innovation · Depth: Advanced, short

Summary

Standard Intelligence, a startup founded by Galen Mead and Devansh Pandey, is pursuing a contrarian approach to general computer agents by focusing on full video pre-training from raw computer use. Published on April 30, 2026, their thesis posits that scaling raw video data, rather than text or screenshots, is the most promising path to truly scalable action data for agents. The company's model learns to predict mouse movements, clicks, and keystrokes directly from screen pixels, akin to Tesla FSD for knowledge work. Despite video's computational expense, Standard Intelligence has achieved significant breakthroughs, including an 11-million-hour computer action dataset, a video encoder 50x more token-efficient, and a 30-petabyte storage cluster built for under \$500K. Their first foundation model, FDM-1, demonstrates capabilities like extruding CAD gears in Blender and finding software bugs. Sequoia Capital led their Series A funding.

Key takeaway

For AI Engineers evaluating foundational model paradigms, Standard Intelligence's video-first pre-training presents a compelling alternative to language-centric approaches. You should consider how raw pixel-based learning could enable more generalizable agents for complex computer tasks. This shift suggests exploring video data pipelines and efficient video encoders for future agent development, potentially yielding agents capable of nuanced interaction beyond text commands.

Key insights

General computer agents can emerge from aggressively scaled raw video pre-training of computer use.

Principles

Scaling raw video data enables truly generalizable agent actions.
First principles reasoning can overcome established domain challenges.
Aggressive data scaling fosters emergent generality in AI models.

Method

The model predicts subsequent mouse movements, clicks, and keystrokes directly from raw screen pixel data, learning computer use.

In practice

Extrude CAD gears in Blender using FDM-1.
Fine-tune FDM-1 for driving tasks in one hour.
Utilize FDM-1 to explore software state spaces for bug detection.

Topics

Video Pre-training
General Agents
FDM-1
Computer Use Automation
Data Scaling
AI Safety

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, AI Engineer, Investor

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Sequoia Capital.