Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Software Development & Engineering · Depth: Advanced, quick

Summary

Infini-News is a new retrieval toolkit and index designed to provide efficient access to over 1.35 billion news articles from the Common Crawl's CC-News archive, spanning from August 2016 to the latest snapshot. This resource addresses the challenges of high costs and extensive storage requirements associated with existing news corpora. The toolkit processes and cleans article text, parses structured metadata, and enriches the corpus with language detection using GlotLID, lingua, and CommonLingua classifiers. It also includes multi-source geographic attribution, successfully resolving the country of origin for 83.4% of articles across 222 countries. Furthermore, Infini-News utilizes Infini-gram indexes, which are suffix-array structures, enabling sub-second search for arbitrary text patterns across the entire archive.

Key takeaway

For Computational Social Science and NLP researchers needing access to vast news archives, Infini-News significantly reduces the computational and storage barriers. You can now perform longitudinal, cross-national media research with sub-second query times on over a billion articles, enriched with language and geographic metadata. Consider integrating Infini-News for your next large-scale text analysis project to streamline data access and processing.

Key insights

Infini-News provides efficient, queryable access to 1.35 billion Common Crawl news articles via Infini-gram indexes.

Principles

Method

The Infini-News method involves extracting, cleaning, and parsing 1.35B articles, enriching them with language and geographic data, then indexing with suffix-array based Infini-grams for rapid search.

In practice

Topics

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.