VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

VectraYX-Nano is a 41.95M-parameter Spanish cybersecurity language model, trained from scratch with a Latin-American regional focus and native tool invocation via the Model Context Protocol (MCP). The model utilizes a 170M-token Spanish corpus, VectraYX-Sec-ES, built through an eight-VM distributed pipeline at a cost of approximately $25 USD. Its training employs a three-phase curriculum: conversational, cybersecurity, and tooling, with explicit replay buffers to prevent catastrophic forgetting. Key findings include that a higher-perplexity bootstrap corpus (OpenSubtitles-ES) yields better chat behavior at nano scales than a lower-perplexity alternative (mC4-ES), and that tool-use emergence is gated by corpus density, not parametric capacity. Specifically, a 1:20 tool-use to total SFT examples ratio is sufficient for reliable tool dispatch at 42M parameters. The model is released with training scripts, configuration files, and GGUF artifacts.

Key takeaway

For AI Engineers developing specialized, small-scale language models, prioritize bootstrap corpus selection for register-matching over perplexity, especially for chat applications. Ensure a tool-use corpus density of at least 1:20 within your SFT data to enable reliable tool invocation. Additionally, layer safety policies at the runtime level, such as command filtering and output review, rather than relying solely on model weights, given the absence of RLHF alignment in such models.

Key insights

Small language models require careful corpus selection and density for effective chat behavior and tool use.

Principles

Method

VectraYX-Nano was trained from scratch using a three-phase curriculum (conversational, cybersecurity, tooling) with 10-25% replay buffers. It integrates native tool invocation via the Model Context Protocol (MCP) and uses a 1:20 tool-use to SFT example ratio for reliable tool dispatch.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.