Digitize Your Notes With Microsoft Computer Vision API

How many of you have piles of (paper) notes laying around that you never got around to transcribing? Wish you could search and sort your notes just like you do your computer files? Good news. Microsoft’s Cognitive Services, specifically their Computer Vision API, enables one to do just that. Using artificial intelligence and machine learning, you can store your notes as text in just a few lines of code.

The API provides pre-trained models and services to categorize and tag visual data such as photos and text. In this tutorial, we’re going to build a service in Python that can read the text from handwritten notes. You can use it to digitize your shopping lists, sticky notes, reminders, and more.

When finished, you’ll have a Python program that can analyze each of your handwritten notes and spit out an associated text file with the transcribed data. For extra efficiency, you can set this up to run on a schedule so it translates your notes automatically.

Let’s get started.

Get an API Key and Dependencies

To start building with the Microsoft Computer Vision API, you’ll need a key. Navigate to their website and select ‘Trial’ to get a free trial account for 30 days. After signing in with your Microsoft account and accepting the Terms of Service, you’ll get a customized API endpoint and two API keys. Take note of these, we’ll need them.

You’ll also need Python installed. If you have a Mac OSX system, it should already be there. If you have Windows or Linux, take this time to install Python. We’ll be using Python 3.6.x for this tutorial.

These are the Python libraries you’ll need to import in your file (they should be part of the core Python offering, except for requests, which you can install by running pip install requests):

import http.client, urllib.request, urllib.parse, urllib.error, base64, requests, time, json  

Obtain Image Samples

If you have images of your notes stored on disk, it’s easy to run them through Microsoft’s Computer Vision API and tag them. You can also do this if you have your notes stored on a cloud service like iCloud or Dropbox. We’ll go through both ways.

You can use images of your own notes, or use this sample I created for the tutorial:

Set up the Request

To transcribe the handwritten text, we’ll be using Python and the Microsoft Computer Vision API. Remember the API Key and endpoint you generated from Microsoft above? You’ll need it now.

Start a file called digitizer.py and add in two variables for your endpoint and API key. You can use either of the two keys provided by Microsoft.

# Keys  
endpoint = 'https://westcentralus.api.cognitive.microsoft.com/vision/v1.0'  
api_key = 'YOUR_KEY_GOES_HERE'

Then we can build the HTTP request to send to the API. We need headers, a body, and parameters. Microsoft has detailed API reference about the RecognizeText function.

headers = {  
    # Request headers.  
    # Another valid content type is "application/octet-stream".  
    'Content-Type': 'application/json',  
    'Ocp-Apim-Subscription-Key': api_key,  
}

This sets the request headers. We pass in the content type and your API key. Next, we need to specify what image to analyze:

body = {'url':'https://i.imgur.com/W2fF6uC.jpg'}

You can insert your own image in the url section, or use the sample I’ve provided.

params = {'handwriting' : 'true'}

If you set handwriting=false, it will attempt to OCR the text instead.

Handwriting Analysis with Python

To analyze the handwriting in our image, we’ll need to make two REST API calls. One POST will submit the image for processing. Then we’ll use a GET to retrieve the results.

Now that we’ve got the header, body, and parameters for the HTTP request, let’s put it together:

try:  
    response = requests.request('POST', endpoint + '/RecognizeText', json=body, data=None, headers=headers, params=params)

We can tell from the API reference that a HTTP status of 202 is a success. It’s 202 instead of 200 because it means it’s sent the image off for processing. Of course, if there’s an error at this stage, we don’t want to go any further. Let’s add in some error handling code now:

if response.status_code != 202:  
    # Display JSON data and exit if the REST API call was not successful.  
    parsed = json.loads(response.text)  
    print ("Error:")  
    print (json.dumps(parsed, sort_keys=True, indent=2))  
    exit()

That’s one request out of the way. We’ve sent off our image to the Computer Vision API and it’s crunching away at the data as we speak. But how do we get the results back?

We’ll need to make another request for that. The API reference gives more information, but basically we need to take the response from the previous request and grab the operation ID so we can pass it on. We’ll grab that value now:

# grab the 'Operation-Location' from the response  
operationLocation = response.headers['Operation-Location']

Now we can use this operationLocation in the rest of our computation.

Note: Since handwriting recognition is an async operation, the results may not immediately be available. It can take a variable amount of time depending on your system and the image you submit. In our example, we’ll wait 10 seconds. In practice, you may want to add in auto-retries if the response isn’t ready yet.

print('\nHandwritten text submitted. Waiting 10 seconds to retrieve the recognized text.\n')  
time.sleep(10)

After waiting for 10 seconds, hopefully the response is ready for us. We can go through with the second REST call at this point:

# Execute the second REST API call and get the response.  
response = requests.request('GET', operationLocation, json=None, data=None, headers=headers, params=None)

# 'data' contains the JSON data. The following formats the JSON data for display.  
parsed = json.loads(response.text)

Here you can see we’re querying the operationLocation we generated above. The response comes in the form of a JSON data object. If we wanted to, we could print out all the JSON data here. But for this example, we’re just going to grab the lines of text that the API returns to us.

To get the transcribed lines of text, we need to navigate down through the JSON object to find the right values.

lines = parsed['recognitionResult']['lines']

In the response, each line of text gets its own line in the JSON response. This means that if you have a note that takes up multiple lines on a piece of notebook paper, for instance, then each line of that text will be processed separately.

parsed['recognitionResult']['lines'] contains an array of all the lines of processed text. We can print those out now:

for line in lines:  
    print (line['text'])

We’re almost done! Now we just need to add an except clause to catch any pesky errors:

except Exception as e:  
    print('Error:')  
    print(e)

At this point, you should have a working piece of code that can transcribe the text from your notes! But just printing the text to the terminal isn’t very useful. Let’s do something with it.

Write Transcribed Data to a File

Let’s write that data to a file. We can edit the for loop above to include writing to a file. There’s a built-in statement in Python that lets us do this:

# this opens the file for writing  
with open(“mynote.txt”, “w”) as f:  
    for line in reversed(lines):  
        print line['text']  
        # write the value to the file  
        f.write(line['text'])  

Extended Features

Once you’ve gotten your script up and running, you can take the next step and automate it. This step is optional, but gives an extra boost of power to your workflow.

If you want your notes to be automatically transcribed each day, for instance, you can create a cron job that runs the Python script and points it to a certain directory of files.

What are some other ways you could improve this program? Perhaps you could set up an alert to trigger an action when you process a note with certain keywords. Perhaps you could send the data to another service for further analysis or publishing.

Do you have a cool application of the Computer Vision API? Share your projects in the comments below!