Jupyter Notebooks are a great way to present your code offering a good level of interactivity, transparency, and reproducibility. However, operating with a Jupyter Notebook environment can get very challenging if you are working your way through large-scale training workflows as is common in deep learning.

If you are conducting large-scale training it is likely that you are using a powerful remote machine via SSH access. So, even if you are not using Jupyter Notebooks, problems like SSH pipe breakage, network teardown, etc. can easily occur. Consider using a powerful virtual machine on Cloud as your remote. The problem gets far worse when there’s a connection loss but you somehow forget to turn off that virtual machine to stop consuming its resources. You get billed for practically nothing when the breakdown happens until and unless you have set up some amount of alerts and fault tolerance.

To resolve these kinds of problems, we would want to have the following things in the pipeline:

  • A training workflow that is fully managed by a secure and reliable service with high availability.
  • The service should automatically provision and de-provision the resources we would ask it to configure allowing us to only get charged for what’s been truly consumed.
  • The service should also be very flexible. It must not introduce too much technical debt into our existing pipelines.

In this post, we are going to consider all of these factors and will implement them using a service called AI Platform (provided by GCP) and Docker. We will use TensorFlow and Keras to handle distributed training to develop an image classification model capable of classifying cats and dogs. Apart from deep learning-related knowledge, a bit of familiarity would be needed to fully understand this post.

All of the code developed for this post can be found here. We won’t be covering the entire codebase, instead, we will focus on the most important bits.

If you are like me, who have lost sleep over the very thought of the aforementioned problem, you will likely find this tutorial a good starting point to get around it.

Environment setup

You will need to have Docker, command-line GCP (Google Cloud Platform) tools like gcloud, and TensorFlow (2.x) installed if you are on a local machine. But if you have a billing-enabled GCP project it’s possible to get started without any significant setup.

We will use a cheap AI Platform Notebook instance as our staging machine which we will use to build our custom Docker image, push it to Google Container Registry (GCR), and submit a training job to AI Platform. Additionally, we will use this instance to create TensorFlow Records (TFRecords) from the original dataset (Cats vs. Dogs in this case) and upload them to a GCS Bucket. AI Platform notebooks come pre-configured with many useful Python libraries, Linux packages like docker, and also the command-line GCP tools like gcloud.

Note: I used an n1-standard-4 instance with TensorFlow 2.4 as the base image which costs $0.141 hourly.

Notes on the task, data pipeline, and training


As mentioned earlier, we will be training an image classification model on the Cats vs. Dogs dataset which is a moderate-sized dataset. The learning problem is a binary classification task.

Data pipeline

For setting up our data pipeline, we will first create shards of TFRecords from the original dataset. Each of the shards will contain batches of preprocessed images and their labels. This has an advantage. When we would load these shards back for training, we won’t need to do any preprocessing giving us a slight performance boost. Figure 1 demonstrates our TFRecords’ creation workflow.

Figure 1: Schematics of our TFRecord’s creation process.

Figure 1: Schematics of our TFRecord’s creation process.

As you might have already noticed that we have also thrown in another component in the mix -- a GCS Bucket. We would need to store our data on a Google Cloud Storage (GCS) Bucket since the training code won’t be executed locally. We could have used other bucket services (like AWS S3) here but TensorFlow has very unified integrations with GCS Buckets, hence. We will be using the same GCS Bucket to store our trained model and also TensorBoard logs. The total cost to store all of these will be about $1.20.

You are welcome to check out the corresponding code here. In order to streamline the TFRecords’ creation and upload process we will make use of a little shell script:

echo "Uploading TFRecords to Storage Bucket..."
echo gs://${BUCKET_NAME}

python ../trainer/create_tfrecords.py

gsutil -m cp -r train_tfr gs://${BUCKET_NAME}
gsutil -m cp -r validation_tfr gs://${BUCKET_NAME}

gsutil ls -lh gs://${BUCKET_NAME}

After creating the TFRecords we simply copy them over to a previously created GCS Bucket. You can create one by executing the following: gsutil mb ${BUCKET_NAME}.


As for the training pipeline, we will follow the steps below:

  • Load the TFRecords from GCS using CPU in a parallelized way using tf.data.Dataset.map() and feed batches of data to our model. For performance, we will also prefetch several future batches of data so that our model does not have to wait for the data to consume. Our data loader is present here in this script.
  • We will be using a pre-trained model to unleash the power of transfer learning. In particular, we will be using the DenseNet121 model that is available inside tf.keras.applications.
  • We will be training our model inside the tf.distribute.MirroredStrategy scope for distributed training. This strategy is applicable when we have a single host containing multiple GPUs. We will also be using mixed-precision training to speed up the process. The code for realizing this is here.

The training will take place on a remote machine fully managed by AI Platform.

So far, we have discussed the utilities for creating TFRecords, loading them, and building and training our model. Here’s how the code is structured in the GitHub repository mentioned at the beginning of the post:

├── Dockerfile
├── README.md
├── config.yaml
├── scripts
│   ├── train_cloud.sh
│   ├── train_local.sh
│   └── upload_tfr.sh
└── trainer
    ├── config.py
    ├── create_tfrecords.py
    ├── data_loader.py
    ├── model_training.py
    ├── model_utils.py
    ├── task.py
    └── tfr_utils.py

Next, we will be reviewing how Docker fits into all these. From there on, we will have all the recipes set up to kickstart model training.

Fitting in Docker

To submit custom training jobs to AI Platform, we need to package our code inside a Docker image. So, let’s start with that.

To build a Docker image, we first need to define a Dockerfile specifying how it should itself up. Google Container Registry (GCR) provides CUDA-configured TensorFlow containers that we can use to build custom ones. In our case, we extend one such container. Our Dockerfile looks like so:

# Use an existing CUDA-configured TensorFlow container
FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-4

# Update TensorFlow to the latest version (2.4.1 at the
# time of writing).
RUN pip install -U tensorflow

# Copies the trainer code to the docker image.
COPY trainer/config.py ./trainer/config.py
COPY trainer/data_loader.py ./trainer/data_loader.py
COPY trainer/model_utils.py ./trainer/model_utils.py
COPY trainer/model_training.py ./trainer/model_training.py
COPY trainer/task.py ./trainer/task.py

# Set up the entry point to invoke the trainer.
ENTRYPOINT ["python"]
CMD ["trainer/task.py"]

After we have defined the Dockerfile, we can proceed to build it and do a round of model training by locally running it.

Building and locally running our container

We will be using GCR to manage the lifecycle of our container. To build a Docker container, one must provide a correct Image URI (Uniform Resource Identifier) and it depends on the platform you are using for managing your container. In our case, that is GCR.

For GCR, the format of the image goes like the following: gcr.io/${PROJECT_ID}/${IMAGE_REPO_NAME}:${IMAGE_TAG}, where PROJECT_ID is the ID of your GCP project and IMAGE_REPO_NAME and IMAGE_TAG are identifiers.

We then build our image and locally run it:

$ docker build -f Dockerfile -t ${IMAGE_URI} ./
$ docker run ${IMAGE_URI} \
    trainer/task.py --bucket ${BUCKET_NAME} \
    --train-pattern ${TRAIN_FILES} \
    --valid-pattern ${VALIDATION_FILES}

To make the process cleaner, we can create a shell script and put all the instructions inside it. You can follow this one to get an idea.

The first time it’s run, it’s going to take a while. But after that, all the consequent runs will use the cached resources to speed up the build. The local Docker daemon (dockerd) will first read our Dockerfile and after getting to the entry point, it will parse all the command-line arguments we provided to task.py. task.py just takes all the command-line arguments and starts the model training. TRAIN_FILES and VALIDATION_FILES are patterns to the TFRecords residing inside a GCS Bucket and they look like so -


If everything goes well, then, after a while, you should be able to see that our model has started training:

Figure 2: Docker build.

Figure 3: Local training logs.

The local Docker run is a way for us to ensure our code is running fine without any hiccups. So, it’s advisable to stop the local run after you have ensured the model is able to start training. With this, we are now ready to push our custom Docker image to GCR, and submit a training job to AI Platform.

Submitting a training job

For this step, we need to add two more lines of code:

  • After building our Docker image, we need to push it to GCR so that AI Platform can pull it to run model training.
  • Submit a training job to AI Platform.

So, let’s put these pieces together:

# Build and push the docker image
$ docker build -f Dockerfile -t ${IMAGE_URI} ./
$ docker push ${IMAGE_URI}

# Submit job
$ gcloud ai-platform jobs submit training ${JOB_NAME} \
    --region ${REGION} \
    --master-image-uri ${IMAGE_URI} \
    --config ./config.yaml \
    -- \
    trainer/task.py --bucket ${BUCKET_NAME} \
    --train-pattern ${TRAIN_FILES} \
    --valid-pattern ${VALIDATION_FILES}

Reviewing what’s going on with the gcloud command, we have:

  • region, that informs AI Platform about the region to be used for the training process. This very region is also going to be used to provision resources such as GPUs. If GPUs are to be used then it’s important to pass a region that has that support. You can know the regions that have this support from here.
  • master-image-uri is the URI of our custom Docker image.
  • Via config, we provide a specification of the kind of machine we want to use for training. This specification is provided using a YAML file and ours looks like so:

        scaleTier: CUSTOM
        # Configure a master worker with 2 V100 GPUs
        masterType: n1-standard-8 # Specify the base machine type
            count: 2
            type: NVIDIA_TESLA_V100

    The advantage of using specifications like this lies in the flexibility it provides. The gcloud ai-platform jobs submit training command has a scale-tier option through which we can pass a pre-defined machine configuration. But let’s say we want to train using multiple machines - 1 master, 3 workers, and 3 parameter servers each having different GPU and CPU configurations. The pre-defined values won’t cut here and this is where we can take the advantage of custom specifications. You can check here to know the different machine types and configurations that can be provided to AI Platform.

Important: We are using V100 GPUs because they come with Tensor cores and that is a must-have to take the advantage of mixed-precision training. We could have used other GPUs like T4, A100 as well that fit this criterion.

We have already discussed the part that follows config so we will not be reviewing that here. If the job submission is successful you should see an entry for it on the GCP console:

Figure 4: AI Platform training job list.

On the extreme right, you would notice an option called View Logs that lets us monitor our training. It’s incredibly useful to have all of your training logs stored somewhere safe without making any effort. Logging for an AI Platform training job is managed by Cloud Logging. Here’s how mine looks like:

Figure 5: Training logs. Notice the neat search filter query.

After training is complete, we can verify if all the necessary artifacts were stored inside our GCS Bucket:

Figure 6: SavedModel file and TensorBoard logs.

In our training script, we had set up the TensorBoard callback to keep track of the training progress. You can check one such log here online on tensorboard.dev. Inspecting into it, we can see that our model’s been trained well, as the validation accuracy has stabilized:

Figure 7: Accuracy plot.

As an effective practitioner, It’s important to be aware of the costs and ensure maximization of resource utilization. Now that we were able to successfully complete our model training, let’s discuss these aspects in the next and final section of the post.

Delving deep into training costs and resource utilization

AI Platform provides a number of useful metrics for the training jobs. Each job has a separate dashboard that makes it super easy to keep track of its statistics such as total training time, average resource utilization, etc.

First, we have high-level information about the job:

Figure 8: High-level information about a training job.

We can see that the job takes about 22 minutes to complete, and this includes the provisioning of resources, completing the model training, and de-provisioning the resources. We then see the total ML units consumed to run our job. The cost for this translates to:

1.79 (Consumed ML units) $\times$ USD 0.49 = USD 0.8771

You can refer to this document that details the cost calculation scheme. GCP also provides a handy estimated cost calculator that you can find here.

So far our costs are: USD 0.141 (AI Platform Notebook) + USD 1.20 (GCS) + USD 0.8771 (training job) = USD 2.2181. Let’s compare this to an AI Platform Notebook instance equipped with the similar configurations as the one we used for training:

Figure 9: Cost for an AI Platform Notebook with 2 V100 GPUs with n1-standard-8.

Coming to CPU utilization, we have some room for improvement it seems:

Figure 10: CPU utilization of our training resources.

The overall GPU utilization has a few spikes which might need some more inspections in the future:

Figure 11: GPU utilization of our training resources.


We have covered quite a lot of ground in this post. I hope by now you have an idea of how to combine tools like Docker, AI Platform to manage your large-scale training workflows in a more cost-effective and scalable way. As a next step, you could take the trained model from AI Platform and deploy the model using it. AI Platform predict jobs make it easier to expose models via REST API-like services that are fully managed by AI Platform offering things like autoscaling, authorization, monitoring, etc. If you'd like to try it out yourself, I encourage you to check out the code of this post on GitHub. You are also welcome to checkout TensorFlow Cloud that provides a set of tools making it easier to perform large-scale training with GCP.


I am thankful to Karl Weinmeister for his comments on the initial draft of this post. Also, thanks to the ML-GDE program for providing generous GCP support without which I couldn’t have executed the experiments.