Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The CADE (Contrastive Alignment with Direct Embedding) framework addresses the tokenization bottleneck in Time-Series Question Answering (TSQA) for Large Language Models (LLMs). Traditional Byte Pair Encoding (BPE) fragments continuous numerical values, losing critical magnitude and trend information, while prior patch-based encoders fix temporal granularity. CADE introduces a point-wise linear encoder and MLP projector to map each timestep directly into the LLM embedding space, preserving exact index-level access and handling variable series lengths without patching. Additionally, a one-directional supervised contrastive loss aligns time-series embeddings with frozen class-name text anchors, enhancing semantic correspondence. Experiments on the Time-MQA benchmark demonstrate CADE's consistent performance improvements across six TSQA tasks, including raising forecasting FCR from 0.46 to 0.596 and reducing imputation MSE from 2,399,043 to 34,532. This 0.6B model outperforms both open-source and proprietary LLM baselines, including DeepSeek-V3.2 on numeric accuracy and discriminative understanding.

Key takeaway

For Machine Learning Engineers integrating time series with LLMs, you should prioritize direct numerical embedding over standard text tokenization. Your models will achieve significantly better accuracy on tasks like forecasting and imputation by preserving the metric structure of time series data. Consider implementing a lightweight linear encoder and an MLP projector for continuous input, as this approach demonstrably outperforms BPE serialization and even larger, general-purpose LLMs on numeric tasks.

Key insights

Direct timestep embedding and contrastive alignment overcome LLM tokenization limits for time-series data.

Principles

BPE tokenization destroys metric structure of numerical time series.
Continuous token interfaces are superior to BPE for time series.
Cross-modal alignment improves shared representations across tasks.

Method

CADE uses a linear encoder and MLP projector for direct timestep embedding, then a one-directional supervised contrastive loss aligns these with frozen class-text anchors.

In practice

Implement direct linear projection for time-series input to LLMs.
Use one-directional contrastive loss for semantic alignment.
Prioritize continuous token interfaces over BPE for numerical data.

Topics

Time-Series Question Answering
Large Language Models
Direct Timestep Embedding
Contrastive Learning
Tokenization Bottleneck
Time-MQA Benchmark

Code references

YafengWu/CADE

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.