Patrick Harrison: Financial NLP at S&P Global
Summary
S&P Global, a Fortune 500 financial data and technology company with a \$50+ billion market cap, is developing a next-generation ESG (Environmental, Social, Governance) data set using NLP and active learning. This initiative addresses the challenge of collecting unstructured ESG data, which differs significantly from regulated conventional financial metrics. Their NLP pipeline frames the task as a multi-label classification problem, utilizing tools like spaCy's TextCat and PyTorch-based BERT models to identify text spans as evidence for hundreds of ESG attributes. The active learning lifecycle begins with a handful of expert annotations, iteratively trains and fine-tunes models, validates predictions with domain experts, and continuously generates new training data. This process aims for S&P Global's 100% accuracy guarantee, transitioning to a production operations mode for ongoing data collection and model refinement across thousands of companies.
Key takeaway
For Directors of AI/ML tasked with building high-accuracy data products from unstructured text, you should prioritize human-in-the-loop systems. Your models, even with high F1 scores, must integrate into a larger workflow that guarantees 100% data precision and recall, especially for critical financial or regulatory data. Implement active learning loops with domain expert validation to continuously improve model performance and generate gold-standard training data, ensuring your system meets stringent accuracy requirements.
Key insights
S&P Global builds high-accuracy ESG datasets by combining NLP (spaCy, BERT) with an active learning loop and human expert validation.
Principles
- 100% data accuracy requires human-in-the-loop systems.
- Unstructured data collection benefits from iterative active learning.
- Domain experts are crucial for high-quality labeled data.
Method
An active learning lifecycle for multi-label classification: initial expert annotations, model training (spaCy TextCat, BERT), prediction on historical documents, expert validation, and iterative feedback to refine models and generate training data.
In practice
- Use spaCy for efficient text preprocessing and tokenization.
- Frame evidence extraction as multi-label classification.
- Integrate domain experts for high-quality data labeling.
Topics
- Financial NLP
- ESG Data
- Active Learning
- Multi-label Classification
- spaCy
- BERT
- Data Accuracy
Best for: Machine Learning Engineer, NLP Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Explosion · Developer tools and consulting for AI, Machine Learning and NLP - Explosion.ai.