ConvNeXt: A ConvNet Wakes Up in the 2020s
Summary
ConvNeXt, a convolutional neural network developed by Facebook AI Research in early 2022, demonstrates that traditional ConvNets can outperform Vision Transformers like Swin Transformer when modernized with contemporary design choices and training recipes. Starting from a ResNet-50 with 76.1% ImageNet-1K top-1 accuracy, researchers systematically applied Transformer-era advancements. A 300-epoch training schedule with AdamW and advanced data augmentations alone boosted accuracy to 78.8%. Further architectural modifications included adjusting the stage compute ratio to (3, 3, 9, 3), replacing the stem with a 4x4 stride-4 convolution, adopting depthwise convolutions with 96 channels, incorporating an inverted bottleneck, and using a 7x7 kernel size. Final micro-design changes, such as GELU activations, fewer activations and normalization layers, and LayerNorm, pushed ConvNeXt to 82.0% top-1 accuracy, surpassing Swin-T's 81.3% at the same FLOP budget. This modernization process highlights that many gains attributed to attention-based models stem from surrounding design decisions, proving ConvNeXt's versatility as a strong backbone for various computer vision tasks.
Key takeaway
For Machine Learning Engineers evaluating vision backbones, ConvNeXt demonstrates that traditional ConvNets, when updated with modern training and architectural elements, can achieve superior performance to Transformers. You should consider ConvNeXt as a robust, plug-and-play alternative for image classification, object detection, and segmentation tasks, especially if you prioritize simplicity and efficiency over attention mechanisms. Before adopting a new architecture, ensure your current models benefit from modern training recipes and design principles.
Key insights
Modernizing ConvNets with Transformer-era training and design choices enables them to surpass attention-based models.
Principles
- Training recipes significantly impact architecture performance.
- Macro and micro design choices are critical.
- Depthwise convolutions can replace attention for token mixing.
Method
A ResNet-50 was systematically updated with Transformer training schedules, macro design (stage ratio, stem), ResNeXt-ifying (depthwise conv, width), inverted bottlenecks, large kernel sizes, and micro design (GELU, LayerNorm).
In practice
- Use ConvNeXt as a plug-and-play backbone.
- Pre-train on ImageNet-22K for scaling.
- Apply modern training recipes to existing ConvNets.
Topics
- ConvNeXt
- Convolutional Neural Networks
- Vision Transformers
- ImageNet
- Computer Vision Backbones
- Deep Learning Architecture
Code references
Best for: AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Deep Learning on Medium.