Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles
Summary
Infini-News is a new retrieval toolkit and index designed to provide efficient access to over 1.35 billion news articles from the Common Crawl's CC-News archive, spanning from August 2016 to the latest snapshot. This resource addresses the challenges of high costs and extensive storage requirements associated with existing news corpora. The toolkit processes and cleans article text, parses structured metadata, and enriches the corpus with language detection using GlotLID, lingua, and CommonLingua classifiers. It also includes multi-source geographic attribution, successfully resolving the country of origin for 83.4% of articles across 222 countries. Furthermore, Infini-News utilizes Infini-gram indexes, which are suffix-array structures, enabling sub-second search for arbitrary text patterns across the entire archive.
Key takeaway
For Computational Social Science and NLP researchers needing access to vast news archives, Infini-News significantly reduces the computational and storage barriers. You can now perform longitudinal, cross-national media research with sub-second query times on over a billion articles, enriched with language and geographic metadata. Consider integrating Infini-News for your next large-scale text analysis project to streamline data access and processing.
Key insights
Infini-News provides efficient, queryable access to 1.35 billion Common Crawl news articles via Infini-gram indexes.
Principles
- Large-scale text corpora require efficient indexing.
- Metadata enrichment enhances corpus utility.
Method
The Infini-News method involves extracting, cleaning, and parsing 1.35B articles, enriching them with language and geographic data, then indexing with suffix-array based Infini-grams for rapid search.
In practice
- Search 1.35B articles in sub-second time.
- Filter articles by language or country of origin.
Topics
- Infini-News
- Common Crawl News
- Infini-gram Indexes
- Language Detection
- Geographic Attribution
Best for: AI Scientist, Research Scientist, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.