Can Small GenAI Language Models Rival Large Language Models in Understanding Application Behavior?

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Software Development & Engineering · Depth: Expert, long

Summary

A study by Mohammad Meymani et al. from the Canadian Institute for Cybersecurity systematically evaluated small and large Generative AI (GenAI) language models for understanding application behavior, focusing on malware detection. Using 10,000 samples from the SBAN dataset, the research compared models like DeepSeek, Phi, Llama, Qwen, and Mistral, primarily through a prompt-based strategy. While larger models, such as Qwen2.5-Coder, generally achieved higher overall accuracy, smaller models like Phi-4 mini demonstrated competitive precision and recall, particularly a high F1 score for malware detection (Class "1") and the second highest for benign (Class "0"). These smaller models offer significant advantages in computational efficiency, faster inference, and suitability for resource-constrained environments, despite some, like Deepseek-coder-1.3b, showing lower recall for benign samples. The findings suggest small GenAI models can effectively complement larger ones, balancing performance and resource efficiency.

Key takeaway

For AI Security Engineers evaluating malware detection solutions, this research indicates that small language models (SLMs) like Phi-4 mini offer a compelling balance of performance and resource efficiency. You can achieve competitive detection capabilities with significantly faster inference and lower computational costs than larger LLMs. Consider deploying SLMs in resource-constrained environments or for real-time analysis, and explore fine-tuning or model compression to further enhance their robustness.

Key insights

Small GenAI models can rival larger LLMs in malware detection, offering efficiency for resource-constrained environments.

Principles

Method

The study evaluated GenAI models for malware detection using 10,000 SBAN dataset samples. It compared classification head and zero-shot prompt-based strategies, focusing on the latter, and measured accuracy, precision, recall, and F1-score.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.