Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, long

Summary

A neuro-symbolic framework integrates Mamba State Space Model (SSM) gradient saliency with DeepSeek-R1 Large Language Model (LLM) chain-of-thought (CoT) reasoning for biomarker discovery in high-dimensional RNA-seq data. Researchers trained a Mamba SSM on TCGA-BRCA RNA-seq data to extract top-50 genes by gradient saliency. DeepSeek-R1 then evaluated each candidate gene using structured CoT reasoning, resulting in a final 17-gene set. The LLM-filtered set achieved an AUC of 0.927, outperforming a 5,000-gene variance baseline (AUC 0.903) and a raw 50-gene saliency set (AUC 0.832), while using 294x fewer features. A faithfulness audit against COSMIC CGC, OncoKB, and PAM50 revealed that only 6 of the 17 selected genes (35.3%) were validated BRCA biomarkers, and 10 of 16 known BRCA genes in the input were missed, including FOXA1. This suggests "selective faithfulness," where targeted confounder removal drives performance gains despite incomplete biological recall.

Key takeaway

For AI Engineers developing genomic feature selection pipelines, you should integrate LLM-based reasoning to filter noisy gradient saliency features. While LLMs can significantly boost classification performance by removing confounders, do not assume high task metrics equate to perfect biological faithfulness. You must conduct independent biological audits of the LLM's selected gene sets to understand its limitations and potential false negatives, like FOXA1, before deploying for critical applications.

Key insights

LLM reasoning significantly improves biomarker discovery performance by precisely removing confounders, even with imperfect biological recall.

Principles

Gradient saliency alone can degrade classifier performance.
LLM reasoning can be causally necessary for performance gains.
Task performance is an unreliable proxy for reasoning faithfulness.

Method

A Mamba SSM generates gradient saliency scores from RNA-seq data, which are then filtered by a DeepSeek-R1 LLM using structured chain-of-thought reasoning and explicit rejection/keep criteria to refine biomarker candidates.

In practice

Use Mamba SSM for efficient high-dimensional genomic data processing.
Apply LLM CoT for targeted confounder removal in feature selection.
Audit LLM reasoning against ground-truth databases.

Topics

Mamba SSM
LLM Chain-of-Thought
Biomarker Discovery
Feature Selection
TCGA-BRCA RNA-seq

Code references

pushpakumarbalan/feature-selection

Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.