ConvNeXt: A ConvNet Wakes Up in the 2020s

2026-06-27 · Source: Deep Learning on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, medium

Summary

ConvNeXt, a convolutional neural network developed by Facebook AI Research in early 2022, demonstrates that traditional ConvNets can outperform Vision Transformers like Swin Transformer when modernized with contemporary design choices and training recipes. Starting from a ResNet-50 with 76.1% ImageNet-1K top-1 accuracy, researchers systematically applied Transformer-era advancements. A 300-epoch training schedule with AdamW and advanced data augmentations alone boosted accuracy to 78.8%. Further architectural modifications included adjusting the stage compute ratio to (3, 3, 9, 3), replacing the stem with a 4x4 stride-4 convolution, adopting depthwise convolutions with 96 channels, incorporating an inverted bottleneck, and using a 7x7 kernel size. Final micro-design changes, such as GELU activations, fewer activations and normalization layers, and LayerNorm, pushed ConvNeXt to 82.0% top-1 accuracy, surpassing Swin-T's 81.3% at the same FLOP budget. This modernization process highlights that many gains attributed to attention-based models stem from surrounding design decisions, proving ConvNeXt's versatility as a strong backbone for various computer vision tasks.

Key takeaway

For Machine Learning Engineers evaluating vision backbones, ConvNeXt demonstrates that traditional ConvNets, when updated with modern training and architectural elements, can achieve superior performance to Transformers. You should consider ConvNeXt as a robust, plug-and-play alternative for image classification, object detection, and segmentation tasks, especially if you prioritize simplicity and efficiency over attention mechanisms. Before adopting a new architecture, ensure your current models benefit from modern training recipes and design principles.

Key insights

Modernizing ConvNets with Transformer-era training and design choices enables them to surpass attention-based models.

Principles

Training recipes significantly impact architecture performance.
Macro and micro design choices are critical.
Depthwise convolutions can replace attention for token mixing.

Method

A ResNet-50 was systematically updated with Transformer training schedules, macro design (stage ratio, stem), ResNeXt-ifying (depthwise conv, width), inverted bottlenecks, large kernel sizes, and micro design (GELU, LayerNorm).

In practice

Use ConvNeXt as a plug-and-play backbone.
Pre-train on ImageNet-22K for scaling.
Apply modern training recipes to existing ConvNets.

Topics

ConvNeXt
Convolutional Neural Networks
Vision Transformers
ImageNet
Computer Vision Backbones
Deep Learning Architecture

Code references

facebookresearch/ConvNeXt

Best for: AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Deep Learning on Medium.