FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

FraudSMSWalker is a new benchmark designed to evaluate agentic Large Language Models (LLMs) for cross-channel SMS-to-webpage fraud detection. This benchmark addresses limitations in existing evaluations that either focus solely on message content or allow models to exploit URL and domain reputation shortcuts. FraudSMSWalker comprises 699 bilingual chains, including 332 fraudulent and 367 benign cases, spanning ten distinct service scenarios. Crucially, the benchmark masks raw URLs, hosts, domains, IPs, redirects, and reputation metadata, providing models only with SMS context and sanitized webpage evidence. It also features "hard benign" cases with login or payment elements that are legitimate within context but mimic scam flows. Initial evaluations of nine web agents reveal they can identify suspicious cues but struggle with benign recall and often generate weakly supported positive predictions. This positions FraudSMSWalker as a critical tool for assessing evidence-grounded fraud judgments when direct reputation shortcuts are unavailable.

Key takeaway

For AI Security Engineers developing fraud detection systems, this research highlights a critical vulnerability: current agentic LLMs often fail to make accurate, evidence-grounded fraud judgments without URL reputation data. You should prioritize training models on content alignment between SMS and webpages, especially for "hard benign" cases. Focus on improving benign recall to avoid false positives, ensuring your systems can detect sophisticated cross-channel scams even when direct URL cues are suppressed.

Key insights

Agentic LLMs struggle with evidence-grounded SMS-to-webpage fraud detection when URL reputation shortcuts are removed.

Principles

Method

FraudSMSWalker evaluates web agents using URL-masked browser protocols, providing SMS context and sanitized webpage evidence for fraud judgment across ten service scenarios.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.