VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
Summary
VectraYX-Nano is a 41.95M-parameter Spanish cybersecurity language model, trained from scratch with a Latin-American regional focus and native tool invocation via the Model Context Protocol (MCP). The model utilizes a 170M-token Spanish corpus, VectraYX-Sec-ES, built through an eight-VM distributed pipeline at a cost of approximately $25 USD. Its training employs a three-phase curriculum: conversational, cybersecurity, and tooling, with explicit replay buffers to prevent catastrophic forgetting. Key findings include that a higher-perplexity bootstrap corpus (OpenSubtitles-ES) yields better chat behavior at nano scales than a lower-perplexity alternative (mC4-ES), and that tool-use emergence is gated by corpus density, not parametric capacity. Specifically, a 1:20 tool-use to total SFT examples ratio is sufficient for reliable tool dispatch at 42M parameters. The model is released with training scripts, configuration files, and GGUF artifacts.
Key takeaway
For AI Engineers developing specialized, small-scale language models, prioritize bootstrap corpus selection for register-matching over perplexity, especially for chat applications. Ensure a tool-use corpus density of at least 1:20 within your SFT data to enable reliable tool invocation. Additionally, layer safety policies at the runtime level, such as command filtering and output review, rather than relying solely on model weights, given the absence of RLHF alignment in such models.
Key insights
Small language models require careful corpus selection and density for effective chat behavior and tool use.
Principles
- Bootstrap corpus register-matching is crucial for small-scale chat models.
- Tool-use emergence is gated by corpus density, not parametric capacity.
- Replay buffers prevent catastrophic forgetting in continual pre-training.
Method
VectraYX-Nano was trained from scratch using a three-phase curriculum (conversational, cybersecurity, tooling) with 10-25% replay buffers. It integrates native tool invocation via the Model Context Protocol (MCP) and uses a 1:20 tool-use to SFT example ratio for reliable tool dispatch.
In practice
- Select bootstrap corpus to match desired response register.
- Implement runtime-level command filtering in MCP for safety.
- Ensure tool-use corpus density of at least 1:20 for small models.
Topics
- Spanish Language Model
- Cybersecurity LLM
- Curriculum Learning
- Model Context Protocol
- Tool Use Density
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.