Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery
Summary
A neuro-symbolic framework integrates Mamba State Space Model (SSM) gradient saliency with DeepSeek-R1 Large Language Model (LLM) chain-of-thought (CoT) reasoning for biomarker discovery in high-dimensional RNA-seq data. Researchers trained a Mamba SSM on TCGA-BRCA RNA-seq data to extract top-50 genes by gradient saliency. DeepSeek-R1 then evaluated each candidate gene using structured CoT reasoning, resulting in a final 17-gene set. The LLM-filtered set achieved an AUC of 0.927, outperforming a 5,000-gene variance baseline (AUC 0.903) and a raw 50-gene saliency set (AUC 0.832), while using 294x fewer features. A faithfulness audit against COSMIC CGC, OncoKB, and PAM50 revealed that only 6 of the 17 selected genes (35.3%) were validated BRCA biomarkers, and 10 of 16 known BRCA genes in the input were missed, including FOXA1. This suggests "selective faithfulness," where targeted confounder removal drives performance gains despite incomplete biological recall.
Key takeaway
For AI Engineers developing genomic feature selection pipelines, you should integrate LLM-based reasoning to filter noisy gradient saliency features. While LLMs can significantly boost classification performance by removing confounders, do not assume high task metrics equate to perfect biological faithfulness. You must conduct independent biological audits of the LLM's selected gene sets to understand its limitations and potential false negatives, like FOXA1, before deploying for critical applications.
Key insights
LLM reasoning significantly improves biomarker discovery performance by precisely removing confounders, even with imperfect biological recall.
Principles
- Gradient saliency alone can degrade classifier performance.
- LLM reasoning can be causally necessary for performance gains.
- Task performance is an unreliable proxy for reasoning faithfulness.
Method
A Mamba SSM generates gradient saliency scores from RNA-seq data, which are then filtered by a DeepSeek-R1 LLM using structured chain-of-thought reasoning and explicit rejection/keep criteria to refine biomarker candidates.
In practice
- Use Mamba SSM for efficient high-dimensional genomic data processing.
- Apply LLM CoT for targeted confounder removal in feature selection.
- Audit LLM reasoning against ground-truth databases.
Topics
- Mamba SSM
- LLM Chain-of-Thought
- Biomarker Discovery
- Feature Selection
- TCGA-BRCA RNA-seq
Code references
Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.