In this post, I discuss how I used several Google Cloud Platform (GCP) APIs to turn two ideas into small prototypes. It includes my thought process, the problems I ran into while developing the prototypes, and my approach toward tackling them. All the code discussed in the post is available in this repository.

As a Machine Learning (ML) Practitioner, I advocate for having an understanding of the underlying principles of the models and other stuff that I use. This understanding has many extents. Sometimes, it involves minimally implementing models, and sometimes it may not involve the from-scratch implementation. When it does not involve the implementation part and when the model is readily available, I prefer to put such models directly to use and get a sense of their broader capabilities.

With libraries like TensorFlow, PyTorch, and Scikit-Learn, realizing this usage has never been easier. As all of these libraries are open-source, you could easily get access to the low-level primitives of their model APIs whenever you’d like. It may require you to have a sufficient amount of experience with the library you’d use. But as a Machine Learning Practitioner, one cannot skip this practice. It’s important to have a good grip over a particular Machine Learning library given the domain of choice (structured tabular dataset, images, texts, audios, for example).

On the other hand, APIs that offer ML as a service, allow non-ML folks to incorporate the power of Machine Learning in their applications very easily. This way developers can prototype ideas faster than ever. Some would argue that leaky abstractions can hit sooner than expected and it can be particularly very miserable in Machine Learning. Nonetheless, if you are more on the applied side of things and don’t want to worry about this aspect, that’s perfectly fine.

I wanted to revisit this idea through the lens of an ML Practitioner. More precisely, I wanted to build a series of short demos utilizing the Cloud ML APIs offered by Google Cloud Platform. The premise here is if I have an idea for an ML project, I wanted to see how quickly I can develop a PoC around it.

The ideation phase

Let me quote Emil Wallner from this interview -

It’s important to collect objective evidence that you can apply machine learning.

With regard to successful ML practice, this statement couldn’t have been more appropriate. Machine Learning has affected almost every industry in some way, it has changed the way we develop and perceive software. Coming up with an ML application idea that’s not already there or implemented is actually pretty hard.

So, I ideated the prototypes drawing inspiration from what is already available. For example, Dale and Kaz of Google built this uber-cool demo that lets you transform a PDF into an audiobook. I really wanted to build something similar but in a more minimal capacity – something that could solely run on a Colab Notebook.

I decided to revisit some of the GCP ML APIs that I already knew, Vision, Text-to-Speech APIs, for example. As someone that is already working in the field of Computer Vision, I was inclined to do something that involves it. So here are some initial ideas that came to mind after spending a considerable amount of time with the different API documentation available on GCP:

A pipeline that takes a short video clip, detects the entities present in the video and generates an audio clip dictating detected entity labels. This allowed me to spend some time with GCP’s Video Intelligence API.
A pipeline that takes an arXiv paper and generates an audio clip of the paper abstract. This was inspired by the demo that Dale and Kaz had already built.

Note that if you are already experienced with the Vision and Text-to-Speech APIs then these may seem very trivial.

The mental model

After these ideas, I designed a bunch of visual workflows demonstrating the steps required to realize these ideas along with the right tooling. Here’s an example -

I also like to refer to these workflows as mental models. Additionally, it helps me to figure out the major dependencies and steps that may be required for the work so that I can plan accordingly. I discuss the importance of developing mental models in this blog post.

(You might have noticed that the above model is a bit different from the first initial idea - I added a logo detection block in there as well.)

Here is another workflow I developed for the second idea I mentioned above:

This is slightly different from the initial idea I had. In fact, it does not even incorporate anything related to the Vision API. If I only wanted to deal with arXiv papers, I thought using the arXiv API (I used the arXiv Python library) would be a far more reasonable option here since it already provides important information about an arXiv paper such as its categories, abstract, last updated date, and so on.

Finally, I wanted to combine the Vision and Text-to-Speech APIs for the second idea I had. In their demos, Dale and Kaz used AutoML Tables to train a model capable of classifying a paragraph of text into the following categories - “body”, “header”, “caption” and “others”. But I wanted to see if I can bypass this additional training step to filter out the abstract block of a paper and perform optical character recognition (OCR) locally. So, I came up with the following workflow -

As we can see I am using two Python libraries additionally -

pdf2image - as the name suggests, it is for converting a PDF file to PNG.
pytesseract - this is for performing OCR locally on an image.

In the next sections, I’ll discuss the problems I faced while implementing these workflows in code, and how I went about approaching the solutions.

Building a short video descriptor

In the following texts, we will go over the main ingredients that turned out to be important while developing the prototypes. This will include some code along with the motivation to justify their inclusion.

For the first two workflows, it was mostly about reading the documentation carefully and figuring out the right APIs to use. GCP provides first-class documentation for these APIs with bindings available in many different languages as you can see in the figure below -

I repurposed these code snippets for the workflows. The Python binding of the Video Intelligence API is simple to use -

You first instantiate the client and instruct what all you are interested in performing -

video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.LABEL_DETECTION]

It provides a bag of different features like entity detection, logo recognition, text recognition, object tracking, and so on. Here I am only interested in performing entity detection on a per-segment basis. A user usually specifies segments if they are interested to only analyze a part of their videos. I didn’t specify any segments, and in that case, the Video Intelligence API handles the entire video as a segment. The API also allows you to perform label detection on more granular levels, i.e. on both shot and frame levels.

After the initialization, it was only a matter of a few keystrokes till I made my first video annotation request -

# Specify the mode in which label detection is to be performed
mode = videointelligence.enums.LabelDetectionMode.SHOT_AND_FRAME_MODE
config = videointelligence.types.LabelDetectionConfig(label_detection_mode=mode)
context = videointelligence.types.VideoContext(label_detection_config=config)
 
# Make the request
operation = video_client.annotate_video(
    input_uri=gcs_path, features=features, video_context=context)

Here I am supplying a GCS bucket path of the video I wanted to infer on. Processing the results of the operation is also straightforward -

# Process video/segment level label annotations
# Get the first response, since we sent only one video.
segment_labels = operation.result.annotation_results[0].segment_label_annotations
video_labels = []
for (i, segment_label) in enumerate(segment_labels):
    print("Video label description: {}".format(segment_label.entity.description))
    video_labels.append(segment_label.entity.description)

After I got the entity labels on the entire video the next task was to use the Text-to-Speech API to generate an audio clip. For that, I simply followed the official tutorial and reused the code.

The logo detection pipeline is almost similar with some very minor changes. In case you want to catch all the details please follow this Colab Notebook.

I tested the entire workflow on the following video and you can see the outputs right below it -

Processing video for label annotations:

Finished processing.
Video label description: sidewalk
Video label description: street
Video label description: public space
Video label description: pedestrian

Processing video for logo detection:

Finished processing.

As for the audio clip, it got came out pretty nice -

Speed-wise the entire pipeline executed pretty quickly.

I had some previous experience working with videos, so I was able to get an idea of what was going under the hood for the video-related activities but for speech, I plan to get to that probably in the next summer (?)

A potential extension of this demo could be developed to aid blind people to navigate their ways when they are outside. I developed this demo keeping this mind, hence you won’t see any visual results.

Detecting, cropping, and reading an arXiv summary

I presented with two different workflows for the second idea i.e. get the abstract of an arXiv paper and generate an audio clip of it. The workflow involving the arxiv Python library wasn’t problematic at all, so I am not going to discuss it in detail. You can always check out this fully worked out Colab Notebook in case you are interested.

The other workflow is a bit more involved. In there, I wanted to take an arXiv paper in PDF format, use the Vision API to get blocks of texts from it, and then locate the abstract from there like so -

But that’s not it. I also wanted to perform OCR locally on the text blocks. This essentially allowed me to reduce the number of calls to the Vision API and thereby saving me some $. The final piece of the puzzle was to take the local OCR results and generate an audio clip. If you saw the Text-to-Speech documentation, you probably noticed that it is really not a big deal.

So, to realize this workflow here’s what I did (Colab Notebook) -

As I am only interested in dealing with the abstract of a paper, I first converted the entire PDF-formatted paper to PNG and serialized only the first page. I used the pdf2png library for this.

Next, I used the Vision API to make a document_text_detection() request for getting the dense text blocks. The code for this is again, very straightforward -

client = vision.ImageAnnotatorClient()
bounds = []

with io.open(image_file, 'rb') as image_file:
    content = image_file.read()

image = types.Image(content=content)
response = client.document_text_detection(image=image)
document = response.full_text_annotation

# Segregate the blocks
for page in document.pages:
    for block in page.blocks:
        bounds.append(block.bounding_box)

Then I used the example presented here to draw the bounding boxes on the input image which we saw earlier. I also reused these bounding boxes to segregate different blocks as inferred by the Vision API.
I am not going to get into the gory details of how I did the segregation. The catch here is for dense text block detection, Vision API returns polygon coordinates and not rectangular coordinates. So, I had to take polygon crops to segregate the different text blocks. (Thanks to this StackOverflow thread.)
After the segregation part, I used pytesseract to perform OCR on the segregated text blocks. In pytesseract it’s literally doable with text = pytesseract.image_to_string(image_block).
Now, an abstract cannot be just a single character (if the OCR was performed correctly). So I only considered those OCR’d texts where the character length is greater than 1000.
Even with this kind of thresholding, you’d end up with multiple text blocks where this criterion holds. To counter this, I first sorted the OCR’d text blocks with respect to their character lengths and checked if a text block contained only one or no reference to citations. If this criterion was matched then the text block is returned as the abstract.

Here’s how I coded it up:
```
texts_sorted = sorted(texts, key=len)
for text in texts_sorted:
    if text.split()[0].isupper() & text.count("[") <= 1:
        abstract = text
```
The upper case criterion is there to ensure an abstract always starts with an uppercase letter.

I am aware that these handcrafted rules can get broken for many instances. But I wanted to explore this possibility anyway.
To make sure the Text-to-Speech API does not account for any citation I filtered out the raw text to escape them - raw_lines = re.sub("[[\s*\d*\,*]*]", "", raw_lines).

And that’s it! After a number of trial and error rounds, I was able to get a decent output.

Final thoughts

Throughout this post, we went over two different ideas that are good prototype candidates for Machine Learning. We saw how easy it is to see these ideas in actions with different ML APIs. We saw how to make these different APIs work together to solve a given problem. Now, if you are feeling excited enough, you can dive deeper into the different ML tasks we saw: detection and classification, for example. Also note that even if one is using these APIs, it’s important to be able to process the API responses properly for the project at hand.

I would like to leave you with this amazing resource provided by GCP. It includes detailed solution walkthroughs of real-world problem scenarios across a wide range of different industry verticals. They also show how to make the best use of different GCP services.

I would like to thank Karl Weinmeister for reviewing this post and for sharing his valuable feedback. Also, thanks to the GDE program for providing the GCP credit support which made these demos possible.