MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens

2026-06-15 · Source: Artificial Intelligence · Field: Technology & Digital — Cybersecurity & Data Privacy, Artificial Intelligence & Machine Learning · Depth: Intermediate, quick

Summary

MASCOT-Android is a new curated dataset of Android malware source code, coupled with an automated collection framework designed to address the scarcity and high manual review costs associated with such data. This system facilitates scalable malware source code discovery on GitHub. A key finding is that repository-level documentation, specifically README files, provides a strong signal for identifying malware source code. The framework employs a LinearSVC classifier, trained on character-level TF-IDF features extracted from 8,772 malware and 25,747 benign README documents. This "README-only" model achieved an accuracy of 96.28% and a false positive rate (FPR) of 1.06% in local evaluation. The model also provides confidence scores, enabling users to adjust the decision threshold to balance FPR and coverage for practical collection scenarios.

Key takeaway

For AI Security Engineers or Research Scientists building Android malware detection systems, MASCOT-Android demonstrates a highly efficient and scalable method for source code collection. You should consider integrating automated repository documentation analysis, specifically README file screening, into your malware intelligence pipeline. This approach, leveraging a LinearSVC classifier on character-level TF-IDF features, can significantly reduce manual review overhead and improve the speed of dataset curation, allowing you to adjust detection thresholds for optimal balance between false positives and coverage.

Key insights

Repository READMEs provide a strong, automatable signal for identifying Android malware source code, enabling scalable dataset collection.

Principles

Repository documentation signals malicious intent.
Automated collection reduces manual review costs.

Method

Character-level TF-IDF features are extracted from repository READMEs. A LinearSVC classifier is trained to distinguish malware, with confidence scores allowing threshold adjustment for FPR/coverage balance.

In practice

Discover Android malware source code on GitHub.
Automate malware source code dataset building.
Use READMEs for initial threat screening.

Topics

Android Malware
Source Code Analysis
Dataset Curation
Machine Learning
GitHub
Cybersecurity

Best for: AI Security Engineer, AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.