TinyFish Launches BigSet: An Open-Source Multi-Agent System That Builds Structured Live Datasets from Plain-English Descriptions

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Intermediate, quick

Summary

TinyFish has open-sourced BigSet, a multi-agent system designed to construct structured, live datasets from a single plain-English description. For instance, a user can request "YC companies that are currently hiring engineers, with their funding stage, location, and number of open roles." The system operates through several stages: Schema Inference, handled by Claude Sonnet, determines column names and data types. An Orchestrator Agent, powered by Qwen, conducts broad entity discovery using TinyFish Search. This leads to a Sub-Agent Fan-Out, where isolated agents run in parallel, each limited to six tool calls and secured against prompt injection via a JS closure for dataset writes. Finally, data is exported with primary key deduplication and source attribution, available as CSV or XLSX. BigSet also features automatic refresh capabilities, allowing datasets to stay current at intervals like 30 minutes or daily, distinguishing it from narrative-focused research tools by providing directly queryable tabular data.

Key takeaway

For Data Engineers or MLOps Engineers seeking to automate data pipeline creation, BigSet offers a novel approach to generating structured, live datasets from simple text prompts. You can significantly reduce manual data collection and transformation efforts, ensuring your analytical inputs remain current without continuous intervention. Consider integrating BigSet to rapidly prototype data sources or maintain dynamic intelligence feeds, using its multi-agent architecture for efficient, secure data acquisition.

Key insights

BigSet automates structured dataset creation and live updates from natural language using a multi-agent architecture.

Principles

Method

A Schema Inference agent defines structure, an Orchestrator agent discovers entities, then parallel sub-agents fetch and insert data, followed by deduplication and export.

In practice

Topics

Best for: AI Architect, Machine Learning Engineer, AI Engineer, Data Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.