Social-RAG: A Retrieval-Augmented Generation Pipeline for Computational Social Science Research on Telegram

· Source: Paper Index on ACL Anthology · Field: Science & Research — Artificial Intelligence & Machine Learning, Social Sciences & Behavioral Studies, Research Methodology & Innovation · Depth: Expert, medium

Summary

Social-RAG is a modular Retrieval-Augmented Generation (RAG) architecture developed to enable scalable qualitative research on large, fast-moving text corpora, specifically public Telegram messages. The system prioritizes evidence traceability, auditability, and researcher control. Key design elements include a "one post = one chunk" indexing strategy, semantic retrieval using vector embeddings with Approximate Nearest Neighbor (ANN) search, an Adaptive-K dynamic cutoff for context selection, and Maximal Marginal Relevance (MMR) re-ranking for diversity. The system also employs structured analytical instructions to ensure generation is constrained to retrieved evidence. Evaluated on vaccine discourse and Brazil's Lei Rouanet policy debates, Social-RAG was tested with three language models: a local open-weight, a cloud open-weight, and a commercial closed model. Results indicate that larger/closed models perform robustly in both narrative and factual tasks, while a smaller local model is better suited for exploratory narrative synthesis than strict factual extraction.

Key takeaway

For computational social scientists analyzing large digital trace data, Social-RAG offers a robust framework to conduct scalable qualitative inquiry. You should consider implementing its design principles, such as "one post = one chunk" indexing and Adaptive-K context selection, to maintain interpretive rigor and auditability. Be mindful of the trade-off between model size and task reliability; larger models are more dependable for factual extraction, while smaller ones can support exploratory narrative synthesis.

Key insights

Social-RAG enables scalable qualitative inquiry on large text corpora while preserving evidence traceability and researcher control.

Principles

Method

Social-RAG uses a "one post = one chunk" indexing, semantic retrieval with ANN search, Adaptive-K cutoff, MMR re-ranking, and structured instructions to constrain LLM generation to retrieved evidence.

In practice

Topics

Best for: AI Scientist, Research Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Paper Index on ACL Anthology.