Voice Cloning | Blog - Respeecher

Code[ish] Podcast: The Ethical and Technical Side of Deep Fakes featuring Respeecher

Written by Orysia Khimiak | Jan 28, 2021 1:02:06 PM

We’ve recently been invited to talk about deep fakes at Code[ish], a podcast created by Salesforce’s developer advocate team Heroku, exploring subjects like code, technology, tools, tips, and the life of the developer. 

During two episodes hosted by Julián Duque, our CEO Alex Serdiuk and CTO Dmytro Bielievtsov talk about ethical deep fakes and the technical aspects of creating them.

Listen to The Ethical Side of Deep Fakes and The Technical Side of Deep Fakes to broaden your knowledge about synthetic media, specifically voice synthesis.

The synthetic media industry is based on AI generated media including technologies such as text, music, video, image and AI voice generation. For example, CGI and Photoshop generate synthetic media, because they help others create modified content. At this point, synthesized video is much more advanced than synthesized audio.

The ethical side of deep fakes

Alex explains where Respeecher fits into all of this. We aim to revolutionize the way content is produced, by bringing more flexibility in industries like entertainment, video games, advertising, and more through the use of our speech-to-speech voice conversion technology.

Voice conversion use cases

  • It’s hard to schedule top actors for voiceover or dubbing work. Voice cloning (or conversion) allows you to scale any voice and gives you the flexibility to record new lines anytime.

  • Resurrect voices from the past: Bring back the voice of an actor who has passed away. Maybe you want to add a historical voice to a project.

  • Record any voice in any language: Ready to capture an overseas audience? Speech-to-speech language agnostic technology empowers you to record in any language.

  • Add dialogue anytime: Decided to add a few lines after filming? Just turn on your microphone and start speaking - without calling an actor back into the studio.

  • Replicate children’s voices: Kids say the darndest things - but they’re challenging to work with. Read more on this topic in our article about voice conversion for children’s voices.

The technical side of deep fakes

Dmytro explains how synthetic audio is produced and why it’s hard to fake. In general terms, there are already a few speech Machine Learning (ML) models already available on the internet, but the best way to clone a voice in a quality manner is to use an audiobook as a sample of the original speaker and to combine it with these pre-existing models. 

The problem here is that the outputs produced by these models are poor in quality, so that's one reason why speech-to-speech technology is hard to fake. Human linguistic variations and patterns and the emotional compound of the speech make the process of voice cloning difficult.

In fact, unlike text-to-speech technologies (TTS) that may produce dull content, speech-to-speech software (STS)  generates more natural content, by preserving the voice intonations and the emotion of the original speaker.

Main concerns about the usage of unethical deep fakes

The main concern about deep fakes is the risk for them to be used in an unethical way, to pretend that someone said or did something that never happened. But it’s also true that all technologies have potentially malicious uses in the hands of the wrong people. 

Dmytro offers more details about what deep fakes are from a technical point of view: a technology that uses deep learning and deep neural networks to fake a video/audio and replace it with another piece of video or audio content.

The term “deep fake” is used now for everything that involves synthesizing human appearance using a neural network.