Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Formal neural network verification, which proves safety properties for all inputs, is often limited by GPU memory because standard bound-propagation algorithms like IBP, CROWN, and α-CROWN require weight and relaxation-coefficient matrices to reside on a single accelerator. This work adapts Tensor Parallelism (TP) and Fully Sharded Data Parallelism (FSDP) to the auto_LiRPA/α,β-CROWN framework. TP shards both weight and A-matrices across GPUs, achieving approximately 2x peak-memory reduction at P=2, validated on VNN-COMP 2022 MNIST-FC benchmarks, though bound tightness can degrade. FSDP, which shards only weight matrices with a per-layer AllGather, yields bounds bitwise identical to single-GPU baselines, reducing baseline memory by 80-90% and peak memory by 34-39% on wide MLPs. FSDP integrates with complete verification (β-CROWN + Branch-and-Bound) and convolutional layers, enabling an "unsat" result for CIFAR-100 ResNet-large (VNN-COMP 2024). The primary memory bottleneck in α-CROWN+BaB mode was identified as per-neuron alpha tensors.

Key takeaway

For Machine Learning Engineers developing robust AI systems, if you are encountering GPU memory limitations during formal neural network verification, consider implementing Fully Sharded Data Parallelism (FSDP). FSDP offers significant memory reductions (34-39% peak memory) while preserving bitwise identical verification bounds, making it suitable for complete verification methods like β-CROWN + Branch-and-Bound. This approach allows you to verify larger, more complex models that previously exceeded single-GPU capacity, enhancing the reliability of your deployed AI.

Key insights

Adapting Tensor Parallelism and FSDP scales neural network verification beyond single-GPU memory constraints.

Principles

Method

Adapt Tensor Parallelism by sharding weight and A-matrices, or Fully Sharded Data Parallelism by sharding only weight matrices with per-layer AllGather, within bound-propagation verification frameworks.

In practice

Topics

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.