by Margarita Grubina – Jan 12, 2021 9:18:55 AM • 8 min

Four of the Most Common Synthetic Speech Problems and How to Solve Them

•••

If you've discovered this blog, chances are you are already familiar with the concept of AI generated speech. Usually, synthetic speech is generated from text (text-to-speech). These days, we often hear about this type of speech when discussing apps like Google Maps. The natural audio quality of synthetic speech has made considerable gains in the past few years due to a revolution in artificial intelligence (deep learning).

Google Maps is light years different from the Stephen Hawking voice. But it still struggles with unusual words and puts emphasis in the wrong places. And the problem is made even worse if a dramatic and emotional performance is desired. Imagine watching an entire movie with the characters voiced by Google Maps.

Unfortunately, artificial intelligence (AI) won't be able to completely solve this problem until it develops the ability to perform method acting and listen to the director's tips.

Here at Respeecher, we're developing a different kind of technology. We use artificial intelligence to synthesize speech, but we don't use text at all. Our software does speech-to-speech voice conversion: instead of replacing a human being, it allows a person to speak in a different voice.

You can read more about it in our article Respeecher Explained: The Speech Synthesis Software for Content Creators. But for now, let's dive into the most common problems found in synthetic speech. These problems affect all AI voices, whether generated from text or from the actual speech but not always equally.

Synthetic Speech Problems and Their Solutions

1. Pronunciation errors

There are two main types of pronunciation errors made by synthetic speech systems. Text-to-speech (TTS) systems often just don't know how to pronounce a word (think how tricky this can be, in particular, for unusual words spelled in strange ways or for words that have two possible homographs that are pronounced differently, like "put" the very common verb and "put" the less commonly pronounced verb that has to do with golf).

Speech-to-speech (STS) systems almost entirely avoid this kind of pronunciation error, and if it does happen, it is generally the fault of the source speaker, not of the system. The other type of pronunciation mistake has to do with pronouncing a sound unclearly or substituting one sound for another.

Actually, the result can be the same as the result of not knowing how to pronounce a word -- the system might substitute the "u" of one "put" for the "u" of the other. But the origin is different. Some older text-to-speech systems are hand-constructed. They are incredibly consistent in how they pronounce words, so although they may sound very unnatural (like Stephen Hawking's voice) they are essentially immune to making this mistake.

Sometimes, synthetic speech has the wrong sound (substituting one sound for another) or an unclear pronunciation of a sound. Paradoxically, this issue affects the most up-to-date systems, which rely most heavily on artificial intelligence, the most. And although it affects both TTS and STS, it affects STS more because textual input is very consistent -- the same word always appears the same way -- while one word can be pronounced in all kinds of ways.

At Respeecher, we use many different proprietary algorithmic techniques to fight mispronunciation in voice AI, and we are always looking for new ones. But one thing that usually helps is more data. Currently, for the best results, we ask customers for one hour each of both the person whose voice we are cloning and the person whose voices will be changed.

2. Prosody issues

While modern TTS systems have good audio quality, they also have difficulties pronouncing uncommon words. Probably the worst problem they suffer from is unnatural prosody. "Prosody" is a catch-all term for rhythm, intonation, and in general, features of speech that span over multiple words.

Prosody is difficult for TTS because to really nail it, a system needs to understand the meaning of what it is saying. There is an infinite variety of ways to say something at the prosodic level, unlike at the phonetic level, where there is typically just one way to pronounce a word (in a given dialect).

Speech-to-speech has a natural advantage in prosody over TTS because it excels at duplicating the source speaker's prosody (and the source speaker, hopefully, does understand the text). Respeecher's technology produces far more natural sounding prosody than TTS systems. It offers an infinite prosodic palette for content creators.

On the other hand, even if it could solve the problem of producing natural prosody, TTS would not be able to produce the perfect performance for any directorial intent. And a big part of the performance (though by no means all) is prosody. TTS is and will remain unsuitable for many applications because of this fundamental difference.

3. Vocoding and audio quality issues

Compared to pronunciation and prosody errors, vocoding and audio quality issues are a technical problem that continues to be resolved as technology improves, at least for cases where high quality training data and data to convert are available.

We all have an intuitive understanding of audio quality, but what exactly is a vocoder? What does it have to do with audio quality? Both TTS and STS systems often work internally with signals that vary much more slowly than a waveform.

This makes intuitive sense since a high-quality waveform needs to be sampled about 44,000 times per second, but the physical parameters of sound change only about 100 times per second, and the control signal that the human brain supplies to create speech has an even lower timing precision, especially if we consider how often we tend to change the sound we are producing.

Working with a signal that varies too quickly is computationally inefficient. It can also obscure the true nature of the underlying control signals that produce speech.

Some of the most common issues here are noises, clicks, and other sound artifacts that shouldn't be present in AI generated speech. In fact, it's impossible to catalog all of the vocoding issues. Many of them are hard to describe in words because they represent a variety of sound distortions.

4. Speaker identity issues

The degree to which an AI generated voice sounds like the voice of the target speaker is called speaker identity. Speaker identity problems are common for both TTS and STS technologies.

The issue lies in the lack of original audio data used as a source for speech synthesis and the synthesis system respectfully. Assuming that we have an hour-long audio recording of the original, this should almost completely eliminate the problem. The more audio context a recording contains, including different intonations, emotions, and tempo, the more accurate the AI generated speech will be.

But even when the client doesn’t have high-res sources available, Respeecher built an audio version of the super-resolution algorithm to deliver the highest resolution audio across the board. Learn more by downloading this on increasing audio resolution with Respeecher.

At Respeecher, we are continually working to gain more control over the aspects of voice cloning that are possible to transfer and convert. This helps not only with mimicking speech identity, but accent as well.

Respeecher can help with dubbing in a foreign language when using the voice of the original actor and letting people speak with their own voices in foreign accents. Imagine hearing someone speaking a language you don't know using your voicе.

How to choose the right solution for voice conversion

Now that we've taken a quick glance at the main issues of AI generated speech, you are well equipped to choose the best solution for your needs.

If you need an excellent generic TTS solution, Google is one of the best options available. With its Cloud Text-to-Speech, you can expect some of the best quality on the market. However, it will still contain prosody and vocoding issues. And you cannot use it to mimic a particular person's voice like you can with STS technology.

Keep in mind that other TTS providers may possess systems that are able to sound more natural than Google in some cases, though possibly less robust to specific phonetic issues, or they might have worse vocoding.

Nevertheless, a major advantage of other text-to-speech providers is that they can provide different voices and speaking styles from Google. These sometimes include custom voices, just like Respeecher provides with its speech-to-speech technology.

For additional dialogue recording (ADR) or any other use case where you need to re-create a particular voice, speech-to-speech voice conversion is a game-changer. With an hour-long original speech sample of the consenting speaker, Respeecher can help you create unlimited speech content.

Contact us today and see for yourself why Hollywood studios and sound engineers are so excited about Respeecher's AI voice generator technology.

FAQ

Synthetic speech is generated by AI speech synthesis algorithms, often using text-to-speech (TTS) or speech-to-speech (STS) technologies. In TTS, text is converted to audio, while STS modifies an existing voice into another. The result is synthetic audio that mimics natural speech, with synthetic speech quality improving through deep learning.

Respeecher’s speech-to-speech technology converts one person’s voice to another's, unlike text-to-speech (TTS), which generates audio from text. Respeecher excels in maintaining natural prosody and speaker identity, offering a more accurate and flexible solution compared to TTS, particularly for voice cloning and ethical AI applications.

Challenges in AI voice synthesis include pronunciation errors, unnatural prosody, and vocoding issues. Synthetic speech systems struggle with proper intonation and prosody (rhythm and flow), particularly when converting unusual words. Respeecher works to solve these through better data and advanced algorithms to improve synthetic audio quality.

Respeecher ensures ethical use by obtaining consent from voice owners before cloning. The company adheres to guidelines for ethical voice cloning, prioritizing the responsible application of speech-to-speech technology and mitigating misuse, such as in deepfake scenarios. This maintains transparency and respect for individuals' identities.

The future of AI voice cloning and synthetic speech includes voice conversion for dubbing, personalized AI voices, and realistic synthetic audio for entertainment, education, and accessibility. Respeecher’s advanced vocoder technology and synthetic speech innovation will play a key role in AI-powered filmmaking and ethically replicating voices for storytelling.

Glossary

Synthetic Speech

AI-generated audio created through speech-to-speech technology or TTS, using AI voice cloning software like Respeecher. It improves synthetic audio quality with advanced vocoder technology and accurate prosody in AI voices, enabling ethical voice cloning and personalized speech.

Speech-to-Speech Technology (STS)

A process where AI voice cloning software like Respeecher transforms synthetic speech with high synthetic audio quality and natural prosody in AI voices, using advanced vocoder technology for ethical voice cloning.

AI Voice Cloning Software

Tools like Respeecher use AI speech synthesis and speech-to-speech technology to generate synthetic speech with high synthetic audio quality, ethical voice cloning, and natural prosody in AI voices.

Prosody

In AI speech synthesis, prosody refers to the rhythm, intonation, and emphasis in synthetic speech, impacting AI voice cloning software like Respeecher and speech-to-speech technology.

Vocoding

In AI speech synthesis, vocoding transforms audio signals, enhancing synthetic speech quality. It's key for AI voice cloning software like Respeecher and speech-to-speech technology.

Margarita Grubina

Business Development Executive

Margarita drives Respeecher's growth through strategic market analysis and nurturing client relations. Her role is pivotal in discovering and tapping into new market opportunities, as well as maintaining strong connections with clients. She combines her industry expertise with a forward-thinking approach, ensuring Respeecher's offerings resonate with evolving market needs in the dynamic field of voice AI technology.