We’re slowly reaching the point where artificial intelligence can replace human interaction, and we’ve just hit another milestone in that journey.
A research paper published published by Google this month talks about a text-to-speech system they call Tacotron 2. In it, the researchers claim the AI can imitate a person talking to an almost human level of accuracy.
It’s the second generation of the AI, composed of two machine learning networks. The first translates text into a spectrogram, which is a visual representation of audio frequencies. Basically, it lets the computer see the words as a collection of sounds instead. The second network then takes that audio chart and generates the matching sounds to form words.
Check out these two examples (here and here) for instance. One of those voices saying, “George Washington was the first President of the United States,” is spoken by a human, and the other generated by Tacotron 2. And we don’t know which is which. The only way to really figure it out is to look at the URLs of the audio sources from the research page.
“That girl did a video about Star Wars Lipstick.”
But it’s not just simple text to speech, Tacotron 2 can even do things like pronounce complex words, learn emphasis, and make allowances for typos.
While also a huge step forward in AI research, this is a breakthrough that can also be immediately applied to services, especially Google Assistant. The only problem so far is the system has been trained to speak in a particular woman’s voice. To speak like a man, or even a different woman, it would have to be trained again.