Can LLM Embeddings Improve Time Series Forecasting? A Practical Feature Engineering Approach
Summary
This article explores whether integrating large language model (LLM) embeddings as engineered features can enhance time series forecasting performance. It details a practical example using daily Dow Jones Industrial Average (DJIA) closing prices and corresponding financial news headlines from 2008 to 2016. The process involves building a baseline model with traditional time series features (lagged returns, rolling statistics) and a full model that additionally incorporates 20-dimensional PCA-reduced embeddings generated from news headlines using a SentenceTransformer model ("all-MiniLM-L6-v2"). Both models, trained with `LGBMClassifier`, predict the direction of the next day's DJIA return. The baseline model achieved an accuracy of 0.5, while the full model with embeddings achieved 0.50476, indicating only a marginal, practically insignificant improvement.
Key takeaway
For Data Scientists evaluating new feature engineering techniques in financial time series forecasting, you should establish a strong baseline with traditional features before integrating LLM embeddings. The observed marginal gains (0.5 to 0.50476 accuracy) suggest that LLM embeddings are not a universal solution and require rigorous, context-specific validation across multiple experimental settings and time splits to confirm consistent and statistically meaningful improvements.
Key insights
LLM embeddings for time series forecasting offer marginal improvements, requiring careful validation.
Principles
- Traditional time series features establish a robust baseline.
- Dimensionality reduction (PCA) is crucial for embeddings.
- Performance gains from LLM embeddings are context-dependent.
Method
Generate LLM embeddings from related text data, reduce dimensionality via PCA, then merge with traditional time series features to train and compare forecasting models against a baseline.
In practice
- Use `yfinance` for stock data retrieval.
- Apply `SentenceTransformer` for text embeddings.
- Employ `LGBMClassifier` for forecasting models.
Topics
- LLM Embeddings
- Time Series Forecasting
- Feature Engineering
- Financial Time Series
- Sentence Transformers
Best for: AI Engineer, Machine Learning Engineer, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by MachineLearningMastery.com - Machinelearningmastery.com.