I recently added 15 different variants of the ConvNeXt architecture to TensorFlow Hub (TF-Hub). This post is a reflection of what had to be done to get to that point. First, we’ll discuss the implementation of ConvNeXt in Keras and how the original pre-trained parameters were ported into these models. We’ll then talk about TF-Hub’s ConvNeXt collection and what it offers.

I hope this post is useful for anyone willing to contribute models to TF-Hub as doing it the right way can be a good amount of work.

ConvNeXt models were proposed by Liu et al. in A ConvNet for the 2020s. ConvNeXt models are composed of standard layers such as depthwise convolutions, layer normalization, etc., and use standard network topologies. They don’t use self-attention or any hybrid approaches, unlike the recent architectures such as Vision Transformers, CoAtNet, etc. The authors start with a base architecture and gradually refine it to match some of the design choices of Swin Transformers. In the process, they developed a family of models named ConvNeXt achieving performance on the ImageNet-1k dataset with efficiency. For details, check out the original paper.

Figure 1: ConvNeXt performance (source: original paper).

# Implementation and weight porting

The ConvNeXt models are fairly easy to implement especially with the official PyTorch codebase available for reference. As mentioned before, these models can be implemented using the standard components provided in most of the major deep learning frameworks such as JAX, PyTorch, and TensorFlow.

ConvNeXt models use the following block structure with layer scaling as introduced in Going deeper with image transformers by Touvron et al.

Figure 2: ConvNeXt block (source: original paper).

The skip connection is controlled with Stochastic Depth to induce regularization during training. Different ConvNeXt variants correspond to different depths along with different channels used in each of the stages. For example, the "tiny" variant uses the following setup:

depths = [3, 3, 9, 3]
dims = [96, 192, 384, 768]


If you plan to populate the implemented models with the original parameters then it helps to align the architecture implementation with the official one as much as possible. Since I went with this approach I tried closely following the official implementation. My final implementation is available in this script. Note that, it does not yet include the isotropic ConvNeXt models.

Coming to the weight porting part, this is usually the most interesting part because there’s no standard recipe that’d work for all the models. You’ll need to think about how to best align the original model parameters with your implementation.

A ConvNeXt model is divided into three main parts: (1) stem which directly operates on the input image, (2) downsample blocks that reduce the resolution of feature maps as the network progresses, and (3) stages that apply the ConvNeXt blocks shown above. This is why I organized my weight porting script such that it has a correspondence between these different parts with the original parameters. Here is an example:

for layer in stem_block.layers:
if isinstance(layer, tf.keras.layers.Conv2D):
layer.kernel.assign(
tf.Variable(param_list[0].numpy().transpose(2, 3, 1, 0))
)
layer.bias.assign(tf.Variable(param_list[1].numpy()))
elif isinstance(layer, tf.keras.layers.LayerNormalization):
layer.gamma.assign(tf.Variable(param_list[2].numpy()))
layer.beta.assign(tf.Variable(param_list[3].numpy()))


The most difficult bit was figuring out how to properly populate the weights of the convolutional layers in TensorFlow from PyTorch. In an earlier implementation, I was simply using transpose(). The resulting models were giving poorer performance than expected. Vasudev helped me figure out the correct transposition of the weight axes and the models were then coming out as expected. More about the evaluation of these models in a moment.

Once the weights were ported successfully, the next task was to verify if the outputs of the intermediate layers matched with their original counterparts. One minor detail to note here is that the outputs of layers are not the same as their parameters. So, even if you check if the parameters of your implemented model and the original model are matching, their outputs could still mismatch. This mainly happens because of mismatches between the layer configurations of your model and the original one.

The final model conversion script is available here.

# Evaluation of the models

To be more certain, it’s also important to check the evaluation metrics of the converted models on the datasets used during training. In this case, we need to use the top-1 accuracy of the models on the ImageNet-1k dataset (validation set).

To set up this evaluation, I developed this notebook where I closely followed the preprocessing used in the official codebase for inference. The following table reflects the top-1 accuracies of the converted models along with the original scores reported here.

Name Original acc@1 Keras acc@1
convnext_tiny_1k_224 82.1 81.312
convnext_small_1k_224 83.1 82.392
convnext_base_1k_224 83.8 83.28
convnext_base_1k_384 85.1 84.876
convnext_large_1k_224 84.3 83.844
convnext_large_1k_384 85.5 85.376
convnext_base_21k_1k_224 85.8 85.364
convnext_base_21k_1k_384 86.8 86.79
convnext_large_21k_1k_224 86.6 86.36
convnext_large_21k_1k_384 87.5 87.504
convnext_xlarge_21k_1k_224 87.0 86.732
convnext_xlarge_21k_1k_384 87.8 87.68

Keras acc@1 refers to the scores of my implementation. Differences in the results are primarily because of the differences in the library implementations, especially how image resizing is implemented in PyTorch and TensorFlow. My evaluation logs are available at this URL. I’d like to thank Gus from the TF-Hub team for the productive discussions during this phase.

# Publishing on TF-Hub

With the models converted as expected, I was now tasked with publishing them on TF-Hub. These models can be categorized into two different variants: (1) off-the-shelf classifiers and (2) feature extractors used for downstream tasks. This means that the 15 model variants that I had converted would actually amount to 30 models.

Whenever I publish models on TF-Hub, I try to accompany each model with the following:

• Documentation that includes references of the models, how it was exported, etc.
• Colab Notebook showing the model usage.

Doing these things (especially the documentation part) for 30 models can be quite cumbersome. Willi from the TF-Hub team supported me in automatically generating documentation for 30 models. The script is available here. This script was basically generated from a documentation template and can be used for generating documentation when publishing more than one model. Additionally, I worked on a script that can archive the TensorFlow SavedModels in a way accepted by TF-Hub.

I hope these scripts will be beneficial for anyone planning to contribute models to TF-Hub.

As of today, all 30 models are available on TF-Hub. They come with Colab Notebooks and documentation so that it’s easier to get started. Moreover, these TF-Hub models are not black-box SavedModels. You can load them as tf.keras.Model objects for further inspection. Here’s an example:

model_gcs_path = "gs://tfhub-modules/sayakpaul/convnext_tiny_1k_224/1/uncompressed"