Kaggle Winners Walkthroughs: Jane Street Real-Time Market Data Forecasting with Team Patrick Yam

2026-03-23 · Source: Kaggle · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, FinTech & Digital Financial Services · Depth: Expert, long

Summary

A Kaggle Grandmaster details a winning solution for a financial crypto forecasting competition, employing a modified Axial Transformer model. The architecture integrates a Gated Recurrent Unit (GRU) layer for time series assets, replacing one self-attention layer, and processes an entire day's data as a single 2D input to capture both time series and cross-sectional information. Feature engineering is minimal, adding only a Gaussian-ranked "time of day" feature. The training regimen uses seed ensembling with multiple random seeds instead of traditional cross-validation, combined with multitask learning and a custom R-squared loss function. Crucially, the solution incorporates daily online learning via an Adam optimizer and features a fast inference module that parallelizes predictions across multiple models, achieving 2.5 hours for 17 models within a 9-hour limit.

Key takeaway

For AI Data Scientists developing financial forecasting models, consider integrating 2D transformer architectures that simultaneously process time series and cross-sectional data. Your models will benefit significantly from daily online learning to adapt to market shifts, and employing seed ensembling can enhance robustness. Prioritize fast inference techniques, such as replacing linear modules with stacked weights, to meet strict deployment time limits.

Key insights

A modified Axial Transformer with GRU and online learning excels in financial time series forecasting.

Principles

Combine time series and cross-sectional data.
Online learning is critical for dynamic financial data.
Seed ensembling improves model robustness.

Method

The method involves a modified Axial Transformer with a GRU layer, minimal feature engineering, multitask learning with weighted R-squared loss, seed ensembling, daily online learning, and a parallelized inference module for speed.

In practice

Replace self-attention with GRU for time series.
Use Optuna for optimizer hyperparameter tuning.
Implement parallel inference for ensemble models.

Topics

Axial Transformer
Time Series Forecasting
Online Learning
Multitask Learning
Inference Optimization

Best for: AI Data Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Kaggle.