AWS Step-by-Step

Hands-On with Polly, Amazon's AI-Based Speech Synthesizer

Despite some shortcomings, Amazon Polly is much more than a cloud-based reworking of Apple's 30-year-old S.A.M. program. (But don't count on it to read algebra equations just yet.)

One of the biggest trends in the world of IT right now is artificial intelligence (AI).

Amazon Web Services (AWS) has introduced numerous offerings into the AI space, but one of the more interesting Amazon AI services for non-developers is Polly, which was released last November at the 2016 AWS re:Invent conference.

Simply put, Polly is a text-to-speech engine. When viewed through the GUI, Polly really doesn't look all that impressive or sophisticated. The GUI includes a text box where you can enter a text string, and a button that you can click to listen to your computer speak the string that you entered. You can see what this looks like in Figure 1.

[Click on image for larger view.] Figure 1: This is the GUI interface for Amazon Polly.

I have to admit that when I saw the Polly interface for the first time, my mind instantly flashed to the 1980s. Back then, a friend had a program on his Apple II called S.A.M. that would do something almost identical. You could enter a text string and the computer would speak whatever had been typed.

At first, my friends and I were amazed by the robotic-sounding speech coming from the computer, but eventually our awe devolved into trying to come up with creative ways of trying to trick the computer into swearing (it wasn't that hard to do).

The point is that text-to-speech engines have been around for decades, and if you base your opinion of Polly solely on what is shown in Figure 1 above, then it is easy to dismiss Polly as being little more than a cloud-based rehash of a 30-year-old application.

As with so many other things in the world of IT, however, things are not always what they seem. For one thing, the text-to-speech engine supports a variety of languages and dialects. In some regions, there are also multiple voices available.

In preparation for this article, I spent a bit of time experimenting with the various voices associated with the English U.S. option. What I found was that some of the voices, such as "Salli," sound surprisingly lifelike. Others, such as "Joey," sound much more robotic. A few of the voices (such as "Ivy" and "Justin") even sound like children.

The thing that impressed me more than the English U.S. voices, however, was the fact that the accents are distinctly American. Being an American myself, I didn't initially notice the American accent, but then I began to experiment with the voices from other English-speaking countries and found that Polly can speak with an Australian or British accent.

Even though I probably had a little too much fun playing with Polly's voices, Polly is far more than just a simple speech engine that parrots text input. There is a very rich API that developers can use to integrate Polly-based speech into applications. That is certainly one thing that was missing from the Apple text-to-speech engine from so long ago. Even though the Apple text-to-speech engine was amusing, there was no supported method for leveraging it from outside of the program's own interface.

As handy as the Polly API might be for basic speech integration, there are two things that really stand out to me. First, you can actually customize Polly's lexicon. If there is something that Polly isn't pronouncing quite right, you can teach Polly how you want the word to be pronounced.

The other thing that stands out is Polly's ability to create .MP3 files. You probably noticed the "Download .MP3" button back in Figure 1, but Polly does not limit you to creating short .MP3 files of Polly speaking a few lines of text. You can't do it through the GUI, but there is a way to upload a text file and have Polly to convert it into spoken word inside of an .MP3 file. As an author, I am seriously considering using Polly to create audio versions of some of my books.

This brings up an important point. Although Polly seems to work really well for basic text-to-speech conversion, there are some things that Polly needs a bit of help with. For example, I wrote a book called Conversational Rocket Science. As you would probably expect of a book about rocket science and orbital mechanics, the book contains a lot of mathematical formulas. When I tried to get Polly to read the formulas, I found that even reading a simple formula was too much to ask. Polly ignored things like negative signs and fractions, and pronounced "Exp" as "E-X-P," rather than saying "exponential."

In spite of these flaws, I think that Polly is promising. It will likely be necessary, however, to tweak longer manuscripts to make them more Polly-friendly.

About the Author

Brien Posey is a 22-time Microsoft MVP with decades of IT experience. As a freelance writer, Posey has written thousands of articles and contributed to several dozen books on a wide variety of IT topics. Prior to going freelance, Posey was a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the country's largest insurance companies and for the Department of Defense at Fort Knox. In addition to his continued work in IT, Posey has spent the last several years actively training as a commercial scientist-astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space. You can follow his spaceflight training on his Web site.


Subscribe on YouTube