Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences, Software Development & Engineering · Depth: Expert, quick

Summary

Sergei Vorobyov and Eugene Ilyushin introduce methods to scale formal neural network verification, which is often limited by GPU memory. Their work adapts Tensor Parallelism (TP) and Fully Sharded Data Parallelism (FSDP) to the `auto_LiRPA` / α,β-CROWN verification framework. TP shards both weight and A-matrices across GPUs, achieving approximately 2x peak-memory reduction with P=2, and its soundness was confirmed on VNN-COMP 2022 MNIST-FC benchmarks, though bound tightness decreased. FSDP, which shards only weight matrices using a per-layer `AllGather`, produced bounds bitwise identical to single-GPU baselines. FSDP reduced baseline memory by 80-90% and peak memory by 34-39% on wide MLPs, integrating with complete verification (β-CROWN + Branch-and-Bound) and convolutional layers. It enabled a complete *unsat* result for CIFAR-100 ResNet-large (VNN-COMP 2024). The authors note that per-neuron alpha tensors, not weight matrices, are the primary memory bottleneck in α-CROWN+BaB mode.

Key takeaway

For AI Architects designing robust verification systems, consider integrating Fully Sharded Data Parallelism (FSDP) into your `auto_LiRPA` / α,β-CROWN workflows. FSDP significantly reduces GPU memory usage by 80-90%. It maintains bitwise identical verification bounds, enabling complete verification of larger models like CIFAR-100 ResNet-large. This approach allows you to verify more complex neural networks without sacrificing precision, addressing a critical scaling bottleneck.

Key insights

Adapting parallelism from training can mitigate GPU memory limits in neural network verification.

Principles

Method

Tensor Parallelism (TP) shards weight and A-matrices. Fully Sharded Data Parallelism (FSDP) shards only weight matrices with per-layer `AllGather`. Both adapt to `auto_LiRPA` / α,β-CROWN.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.