TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions
Summary
TinyFish has open-sourced BigSet, a multi-agent system designed to construct structured, live datasets from a single plain-English description. For instance, a user can request "YC companies that are currently hiring engineers, with their funding stage, location, and number of open roles." The system operates through several stages: Schema Inference, handled by Claude Sonnet, determines column names and data types. An Orchestrator Agent, powered by Qwen, conducts broad entity discovery using TinyFish Search. This leads to a Sub-Agent Fan-Out, where isolated agents run in parallel, each limited to six tool calls and secured against prompt injection via a JS closure for dataset writes. Finally, data is exported with primary key deduplication and source attribution, available as CSV or XLSX. BigSet also features automatic refresh capabilities, allowing datasets to stay current at intervals like 30 minutes or daily, distinguishing it from narrative-focused research tools by providing directly queryable tabular data.
Key takeaway
For Data Engineers or MLOps Engineers seeking to automate data pipeline creation, BigSet offers a novel approach to generating structured, live datasets from simple text prompts. You can significantly reduce manual data collection and transformation efforts, ensuring your analytical inputs remain current without continuous intervention. Consider integrating BigSet to rapidly prototype data sources or maintain dynamic intelligence feeds, using its multi-agent architecture for efficient, secure data acquisition.
Key insights
BigSet automates structured dataset creation and live updates from natural language using a multi-agent architecture.
Principles
- Isolate agents to limit tool calls.
- Prevent prompt injection on writes via closures.
- Deduplicate data using primary keys.
Method
A Schema Inference agent defines structure, an Orchestrator agent discovers entities, then parallel sub-agents fetch and insert data, followed by deduplication and export.
In practice
- Generate live datasets for YC companies hiring.
- Automate data refreshes for market intelligence.
- Create queryable tables from plain text.
Topics
- Multi-Agent Systems
- Structured Data Generation
- Live Datasets
- Open-Source Software
- Prompt Injection Prevention
- Data Automation
Best for: AI Architect, Machine Learning Engineer, AI Engineer, Data Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.