Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism
Summary
Formal neural network verification, which proves safety properties for all inputs, is often limited by GPU memory because standard bound-propagation algorithms like IBP, CROWN, and α-CROWN require weight and relaxation-coefficient matrices to reside on a single accelerator. This work adapts Tensor Parallelism (TP) and Fully Sharded Data Parallelism (FSDP) to the auto_LiRPA/α,β-CROWN framework. TP shards both weight and A-matrices across GPUs, achieving approximately 2x peak-memory reduction at P=2, validated on VNN-COMP 2022 MNIST-FC benchmarks, though bound tightness can degrade. FSDP, which shards only weight matrices with a per-layer AllGather, yields bounds bitwise identical to single-GPU baselines, reducing baseline memory by 80-90% and peak memory by 34-39% on wide MLPs. FSDP integrates with complete verification (β-CROWN + Branch-and-Bound) and convolutional layers, enabling an "unsat" result for CIFAR-100 ResNet-large (VNN-COMP 2024). The primary memory bottleneck in α-CROWN+BaB mode was identified as per-neuron alpha tensors.
Key takeaway
For Machine Learning Engineers developing robust AI systems, if you are encountering GPU memory limitations during formal neural network verification, consider implementing Fully Sharded Data Parallelism (FSDP). FSDP offers significant memory reductions (34-39% peak memory) while preserving bitwise identical verification bounds, making it suitable for complete verification methods like β-CROWN + Branch-and-Bound. This approach allows you to verify larger, more complex models that previously exceeded single-GPU capacity, enhancing the reliability of your deployed AI.
Key insights
Adapting Tensor Parallelism and FSDP scales neural network verification beyond single-GPU memory constraints.
Principles
- GPU memory capacity limits formal neural network verification.
- FSDP maintains bitwise identical verification bounds.
- Per-neuron alpha tensors are a key memory bottleneck in α-CROWN+BaB.
Method
Adapt Tensor Parallelism by sharding weight and A-matrices, or Fully Sharded Data Parallelism by sharding only weight matrices with per-layer AllGather, within bound-propagation verification frameworks.
In practice
- Apply FSDP for memory-efficient, bitwise identical verification bounds.
- Integrate FSDP with β-CROWN + Branch-and-Bound for complete verification.
Topics
- Neural Network Verification
- Tensor Parallelism
- Fully Sharded Data Parallelism
- GPU Memory Optimization
- Bound Propagation
- α-CROWN
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.