Create Text Recognition Mobile Application with Google Cloud Vision

MPyK
The Web Tub
Published in
5 min readJan 20, 2022

--

Image Courtesy of RICOH

Optical Character Recognition (OCR) is a widespread technology to recognize text inside images. It is used to convert images containing text (typed, handwritten, or printed) into machine-readable text data. In this article, we are going to create a simple OCR application with Google Cloud Version API.

Text Recognition can automate tedious data entry for credit cards, receipts, and business cards. With the Cloud-based API, you can also extract text from pictures of documents, which you can use to increase accessibility or translate documents. Apps can even keep track of real-world objects, such as by reading the numbers on trains.

What are we building?

Our OCR App!

Here, it is what we are going to build. It takes a picture with a mobile phone camera or pc webcam, displays it as the first picture, then draws bounding boxes around detected text in the second picture, and finally outputs all the text at the bottom.

Get Cloud Vision API Key

To extract text information from an image, in this article, we are using Cloud Vision API provided by Google. First, you will need to create a project in Google Cloud, then enable billing, enable Cloud Vision API, and finally create an API key to access the Cloud Vision API.

The Cloud Vision API recognizes text in 100+ different languages and can be tested freely with the first 1000 requests. it is built for both recognizing sparse text in images (such as photos of business cards) and recognizing densely-spaced text in pictures of scanned documents.

If you get it right, you would have something like this.

API Key

Copy the API Key `AIza…41Y` to use later in the application.

Let’s Build the App

Similar to my previous blog, we will be using Vue 3 and Framework 7 to build the app. We are suggesting cloning this Github repository and we will start explaining the main pieces of the application.

First is the application layout.

Application layout

The application’s body is divided into 3 rows (from line 7 to line 29). In the first row, there are 3 div elements. The canvas element, on line 16, is always shown. It is used to draw bounding boxes of all detected texts from the captured image. The video element, on line 10, is shown only on a non-mobile platform. It is used to fit live video from the PC’s webcam. If the platform is mobile, we will display the image element, on line 13, to render the image from the phone’s camera. The 2nd and 3rd rows are pretty much intuitive.

Next, let’s check out the “script” part.

Setting it up

On line 3, we import axios package to handle HTTP requests. Thus, please install the package if you are not cloning the Github repo or you can use browser-based Fetch API but then you have to manage the data, request, response a little bit more.

Next, we have declared variables related to Google Cloud Vision API from lines 14 to 29. Input your API key created earlier on line 14. We then have the API endpoint on the next line (line 15) for text recognition concatenating with the API key. Finally, we have the constant variable REQUEST_OPTION as the input data and option for HTTP requests. It accepts the image as in base64 encoding (on line 20). we will later assign image value as base64 encoding to this property.

Base64 encoding schemes are commonly used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with ASCII. This is to ensure that the data remain intact without modification during transport.

On line 24, we assign TEXT_DETECTION to the type property. TEXT_DETECTION detects and extracts text from any image. For example, a photograph might contain a street sign or traffic sign. The JSON includes the entire extracted string, as well as individual words, and their bounding boxes. The other option is DOCUMENT_TEXT_DETECTION . It also extracts text from an image, but the response is optimized for dense text and documents. The JSON includes page, block, paragraph, word, and break information.

Moving on to onMounted lifecycle hook, the startUserMedia is called if the platform is non-mobile and it supports UserMedia web API. Otherwise, on a mobile platform, we listen to the load event of the image element and call ocr function when the image is completed loaded followed by the imageLoaded flag. It is the flag to skip calling the ocr function when the image is first loaded from assigning static file — “static/icons/favicon.png”.

Here, is the startUserMedia function. On line 12, we hook the web cam’s stream to the video element.

startUserMedia function

and here is the ocr function.

ocr

In the ocr function, we first draw anything from the passing argument image to the canvas on line 3. It could be from video element or image element from PC’s webcam or phone’s camera respectively. Then, the canvas is converted to base64 encoding format for passing as a parameter to recognize function. The recognize function is just a simple function calling Cloud Vision API and if any detected found, it invokes drawBoundingBoxes function to draw bounding boxes on all detected texts.

Now, let’s see how we can draw the result on the canvas. First, here is the result (line 13) returned from the API. The result returned as an array. The first record contains all detected texts. As you can see in the description property, it has all the texts separated by \n. From the second record onward, it stores all individual detected texts and their corresponding bounding boxes boundingPoly.vertices.

annotations.json

To draw the boxes on canvas can be done with the following two functions. The actual drawing part is from line 19 to line 22 in which we draw lines from each vertex to another vertex to form the polygon shape.

Bounding Boxes Drawing function

Finally, the scan function. If it is on the mobile function, we use getPicture method of the Cordova Camera plugin. On the success callback function, we assign the value to the image element and set the imageLoaded flag to true. When the image is fully loaded it will call the ocr function as we listened to the load event in the onMounted hook. Otherwise, on a non-mobile platform, we will invoke ocr function directly.

The last part

Run on mobile device

If you would like to run this application on your phone device, please sign up for an account with Monaca and follow this guideline on how to build an android or iOS application.

What is Monaca?

Cross-platform hybrid mobile app development platform and tools in the cloud

All the source code can be found here — https://github.com/yong-asial/ocr

Conclusion

In this article, we have learned how to use Google Cloud Vision API to extract text information from an image. The image is scanned from the PC’s webcam by using UserMedia web API and from the phone’s camera by using the Cordova Camera plugin.

happy Coding.

--

--