ConvNeXt: A ConvNet Wakes Up in the 2020s

· Source: Deep Learning on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, medium

Summary

ConvNeXt, a convolutional neural network developed by Facebook AI Research in early 2022, demonstrates that traditional ConvNets can outperform Vision Transformers like Swin Transformer when modernized with contemporary design choices and training recipes. Starting from a ResNet-50 with 76.1% ImageNet-1K top-1 accuracy, researchers systematically applied Transformer-era advancements. A 300-epoch training schedule with AdamW and advanced data augmentations alone boosted accuracy to 78.8%. Further architectural modifications included adjusting the stage compute ratio to (3, 3, 9, 3), replacing the stem with a 4x4 stride-4 convolution, adopting depthwise convolutions with 96 channels, incorporating an inverted bottleneck, and using a 7x7 kernel size. Final micro-design changes, such as GELU activations, fewer activations and normalization layers, and LayerNorm, pushed ConvNeXt to 82.0% top-1 accuracy, surpassing Swin-T's 81.3% at the same FLOP budget. This modernization process highlights that many gains attributed to attention-based models stem from surrounding design decisions, proving ConvNeXt's versatility as a strong backbone for various computer vision tasks.

Key takeaway

For Machine Learning Engineers evaluating vision backbones, ConvNeXt demonstrates that traditional ConvNets, when updated with modern training and architectural elements, can achieve superior performance to Transformers. You should consider ConvNeXt as a robust, plug-and-play alternative for image classification, object detection, and segmentation tasks, especially if you prioritize simplicity and efficiency over attention mechanisms. Before adopting a new architecture, ensure your current models benefit from modern training recipes and design principles.

Key insights

Modernizing ConvNets with Transformer-era training and design choices enables them to surpass attention-based models.

Principles

Method

A ResNet-50 was systematically updated with Transformer training schedules, macro design (stage ratio, stem), ResNeXt-ifying (depthwise conv, width), inverted bottlenecks, large kernel sizes, and micro design (GELU, LayerNorm).

In practice

Topics

Code references

Best for: AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Deep Learning on Medium.