Sep 20, 2022 12:00:00 AM
The benefits of speech synthesis in the film industry
Speech synthesis allows audio engineers to replicate anyone’s voice. Once the voice model has been created, it can be reused as many times as needed — from dubbing an actor's voice in post-production to bringing back the voice of an actor who passed away.
With voice cloning, a film or TV creator is able to streamline processes related to film production. Some of the benefits include:
- Flexibility. It’s hard to schedule top actors for voiceover or dubbing work. Our system allows you to scale any voice and gives you the flexibility to record at any time.
- Saving time. No need to waste time bringing a high-demand actor back to the recording studio over and over again.
- Resurrecting voices from the past. You can bring back the voice of an actor who has passed away or rejuvenate the voices of actors you need for your project. For example, an AI voice was used in The Mandalorian for Luke Skywalker’s reveal. A cloned voice replaced the voice of the real Mark Hamill who is now 70 years old.
- Adding dialog. Decided to add a few lines after filming? Just turn on your microphone and start speaking, no need to bring actors back into the studio.
- Recording any voice in any language. Looking to market to an overseas audience? Our language-agnostic technology empowers you to record in any language. Need it in Chinese, Spanish, or Italian? No problem.
- Replicating children’s voices. Kids say the darndest things — but they’re challenging to work with. With Respeecher, an adult actor sounds just like a kid.
Dubbing and localization with the help of speech synthesis
Two of the most popular application areas of speech synthesis in the film industry are dubbing and localization.
In addition to sounding unnatural, classical dubbing has one huge drawback. Having to adjust the localized text to an actor's facial expressions often means changing the meaning of the dialog itself.
In general, this results in a less pleasant experience for the viewer. With subtitles, although the actor’s voice is authentic, those who are reading subtitles will not have the same experience as the native audience because they are reading text at the bottom of the screen.
This is where AI and synthetic dubbing technology come to the rescue.
The traditional dubbing process is pretty straightforward but challenging to execute.
First, a producer locates a studio that can dub in the language they need.
The producer then sends the original video material and the texts for every dialog to the studio.
The agency starts searching for voice actors (they are often the same people who voice dozens of films in their countries every year).
Then the complex dubbing process begins. Actors work in the studio, reading the dialogs to match what is happening on the screen, taking into account the expressions of the original actors.
The audio directors then mix the new audio track with the video. And voila, the movie is ready to be distributed to local cinemas. This process has several significant disadvantages, both in terms of viewing experience and production.
The costs for traditional dubbing are incredibly high. The exact cost is difficult to estimate, but you can reasonably imagine that the price varies from 100 to 150 thousand dollars per language for a film.
Dubbing is not fast. Although voice acting takes less time than creating original content, the time it takes to complete a proper dub is sometimes measured in months.
Dubbing overshadows the original acting. We already mentioned this at the beginning of the blog.
AI/deepfake dubbing technology helps to eliminate just about every difficulty introduced by the traditional dubbing approach. At the same time, and this is important, it does not give rise to new intricacies.
In short, a neural network, using the dialog of an actor's original voice, learns to distinguish the characteristic features of their face.
The same network then analyzes the same features in people speaking a different language. Thus, when the foreign language dub is ready, the network can edit the original actor's face to perfectly lip-sync with the foreign dialog.
Voice cloning technology introduces an entirely new set of tools. Respeecher allows movie producers and content creators to make anyone sound as if they are someone else.
The modified facial animations allow an actor's original voice to be transferred to another language. Thus, the dubbed voice matches the original actor's facial expressions.
Plus, dubbing itself is produced to give the impression that the actor is speaking Chinese or Japanese, for example. This means that viewers would never suspect that the actor was never able to speak their native language.
So, is it possible to achieve realistic speech synthesis?
Digital voices have been around for a long time. These traditional models, including the voices of the original Siri and Alexa, consisted of glued words and sounds. Everything that was produced sounded somewhat awkward, and that “I’m talking to a robot” vibe was unavoidable. Thus, making them sound more natural was a laborious manual task without today’s sophisticated voice modeling algorithms.
Deep learning has changed all that. Voice designers no longer need to program the exact tempo, pronunciation, or intonation of generated speech. Instead, they can feed several hours of audio into the algorithm and the system will learn those patterns on its own.
Today, the high quality of sound and the overall increase in the "humanity" of a digital voice appeals to an increasing number of filmmakers. Recent advances in deep learning have made it possible to reproduce many of the subtleties of human speech. These voices feature the necessary pauses, and can even convey aspiration or inhalation/exhalation. Digital voices can change their style or emotions on a whim.
AI voices are scalable and easy to work with. Unlike a recording of an actor's human voice, synthetic voices can also change timbre, emotionality, and other vocal parameters in real time, opening up new possibilities for personalizing ads.
Creating a convincing synthetic voice requires attention to detail. Diversity in speech characteristics is also required thanks to inconsistency, expressiveness, and the ability to reproduce the same lines in completely different styles, depending on the context. Achieving these results requires quality samples of real human speech.
And where to get them? You need to find the right voice actors that can provide the relevant dialog. Using them, experts can debug deep learning models. With this in mind, the concerns of voice actors about the possibility of losing their work to AI voices, are in vain. Check out this blog post where we discuss this topic in more detail.