A Battle of Text Detectors for Mobile Deployments: CRAFT vs. EAST

In the previous post, we saw how to convert the pre-trained CRAFT model from PyTorch to TensorFlow Lite (TFLite) and run inference with the converted TFLite model. In this post, we will be comparing the TFLite variants of the CRAFT model to another text detection model - EAST. The objective of this post is to provide a comparative study between these two models with respect to various deployment-specific pointers such as inference latency, model size, performance on dense text regions, and so on. Text detection continues to be a very important use-case across many verticals. So we hope this post will serve as a systematic guide for developers that are interested to explore on-device text detection models.

Precisely, we will be comparing the two models on the basis of the following pointers which we think are very crucial when it comes to deploying them out in the wild -

Visual Inspection of Performance
Model Size
Inference Latency
Memory Usage

Important

If you are interested to know about the conversion process and inference pipelines of the models, please refer to these notebooks - CRAFT and EAST. The pre-converted models are available on TensorFlow Hub - CRAFT and EAST.

Benchmark Setup

We used the TensorFlow Lite Benchmark tool in order to gather results on inference latency and memory usage of the models with Redmi K20 Pro as the target device. We chose a mobile device for this purpose because text detection is a pretty prevalent recipe of many mobile applications such as Google Lens.

In order to make the comparisons fair, we consider the two models with three different image resolutions - 320x320, 640x416, and 1200x800. For each of these resolutions, we consider two different post-training quantization schemes - dynamic-range and float16. The CRAFT model conversion is not yet supported in the integer variant, hence we do not consider integer quantization (but the EAST model does support it).

Visual Inspection of Performance

In this setting, we run both of the models and their different variants (dynamic-range and float16 quantized) on a sample image that has dense text regions, and then we visualize the results. We observed that both of these models perform fairly well on images having lighter text regions. Here’s the sample image we used for the purpose -

Image is taken from the SROIE dataset.

Time to detect some texts!

CRAFT - 320x320 Dynamic-Range & float16

In the dynamic-range quantization setting, we can see the model misses out on some text blocks.

Inference results from the 320x320 dynamic-range and float16 quantized CRAFT models.

With increased numerical precision i.e. float16, we can clearly see quite a bit of improvement in the results. It’s important to note that this improvement comes at the cost of increased model size.

Next up, we apply the same steps to the EAST model.

EAST - 320x320 Dynamic-Range & float16

EAST apparently performs better than CRAFT under dynamic-range quantization. If we look closely, it appears that the CRAFT model produces far fewer overlaps in the detections compared to EAST. On the other hand, the EAST model is able to detect more text blocks. When developing practical applications with text detectors, it often becomes a classic case of precision-recall trade-offs like the one we are currently seeing. So, you would want to consider the application-specific needs in order to decide the level of trade-off to be achieved there.

Inference results from the 320x320 dynamic-range and float16 quantized EAST models.

With increased precision, the above-mentioned points still hold, i.e. the number of overlaps being way higher for the EAST model than they are in the CRAFT equivalent. In this setting (float16 quantization), superiority in the performance of the CRAFT model is quite evident in regards to the EAST model.

As different applications may use different image resolutions we decided to test the performance of the models on larger dimensions as well. This is what we are going to see next.

CRAFT - 640x416 Dynamic-Range & float16

On an increased resolution, the CRAFT model performs pretty well -

Inference results from the 640x416 dynamic-range and float16 quantized CRAFT models.

The float16 version of this resolution is a slam dunk (rightfully leaving behind the barcode which is not a piece of text).

EAST - 640x416 Dynamic-Range & float16

The performance of the EAST model under these settings are very equivalent to CRAFT -

Inference results from the 640x416 dynamic-range and float16 quantized EAST models.

With float16 quantization and 640x416 as the resolution, the CRAFT model is a clear winner. Notice that the EAST model is still unable to discard the barcode part which might be an important point to note for some applications.

Time to inspect the results for our final and highest resolution - 1280x800.

CRAFT - 1280x800 Dynamic-Range & float16

Under dynamic-range quantization, the results look okayish. The model misses out on a number of text blocks but the only ones that it detects appear to be neat.

Inference results from the 1280x800 dynamic-range and float16 quantized CRAFT models.

The results from the float16 variant are tremendous (as you probably have guessed by now).

EAST - 1280x800 Dynamic-Range & float16

At this resolution, the EAST model seems to be performing well too -

Inference results from the 1280x800 dynamic-range and float16 quantized EAST models.

With float16 quantization as well, the CRAFT model beats EAST in terms of the detection quality.

Model Size

When it comes to deploying models to mobile devices model size becomes a really important factor. You may not want to have a heavy model that would, in turn, make your mobile application bulky. Moreover, Playstore and AppStore also have size restrictions on the applications one can host there.

On the other hand, heavier models tend to be slower. If your application cannot have increased inference latency then you would want to have the model size as low as possible.

The following figure shows the size of the CRAFT and EAST models -

Model (TFLite variants) sizes of CRAFT and EAST.

The dynamic-range quantized versions of both the models are in a well-acceptable range with respect to size. However, the float16 variants may still be a bit heavier for some applications.

Inference Latency

Inference latency is also one of the major factors for mobile-based deployments especially when your applications might require instantaneous predictions. We are going to show a comparison between all the settings we considered in the visual inspection section.

To reiterate we performed the benchmarks for this section on a Redmi K20 Pro using 4 threads. In the following figures, we present inference latency of different variants of the CRAFT and EAST models.

Inference latency of different variants of the CRAFT model.

Inference latency of different variants of the EAST model.

As expected, with increased resolution the inference latency also increases. Inference latency is also quite lower for all the variants of the EAST model compared to CRAFT. Earlier we saw how a quantization affects model performance under a particular resolution. As stated earlier, when using these models inside a mobile application, the “Size vs. Performance” trade-off becomes extremely vital.

important: The results for the float16 1280x800 CRAFT model could not be obtained on our target device.

Memory Usage

In section, we shed light on the total memory allocated for the models while running the TensorFlow Lite Benchmark tool. Knowing about the memory usage of these models helps us plan application releases accordingly as not all the mobile phones may support extensive memory requirements. So based on this information, you may want to set some device requirements for your application using these models. On the other hand, if you would want your application to be as device-agnostic as possible then you may want to maintain separate models according to their size and memory usage.

In this case, also, we are going to consider all the settings we had considered in the previous sections. The following figures give us a sense of the memory footprint left behind by the models -

Memory footprint of different variants of the CRAFT model.

Memory footprint of different variants of the EAST model.

Detection performance-wise, CRAFT was a winner in many cases but if we factor in for inference latency and memory footprint the situation might need reconsideration. In other words, the best performing (with respect to a certain task, detection in this case) model may not always be the best candidate for deployments.

Important

The results for the float16 1280x800 CRAFT model could not be obtained on our target device.

Conclusion

In this post, we presented a comparative study between two text detection models - CRAFT and EAST. We went beyond their task-specific performance and considered various essential factors that one needs to consider when deploying these models. At this point, you might have felt the need to consider another important factor of these models - FPS information of the models on real-time videos. Please check out this repository to get a handle on how to approach that development.

Contribution

Tulasi worked on the CRAFT model while Sayak worked on the EAST model. For the purpose of this post, Tulasi focused on gathering all the relevant information for doing the comparisons while Sayak focused on the writing part.

Thanks to Khanh LeViet from the TFLite team for reviewing the post.