by Orysia Khimiak – Jan 28, 2021 8:02:06 AM • 8 min

Code[ish] Podcast: The Ethical and Technical Side of Deep Fakes featuring Respeecher

•••

We’ve recently been invited to talk about deep fakes at Code[ish], a podcast created by Salesforce’s developer advocate team Heroku, exploring subjects like code, technology, tools, tips, and the life of the developer.

During two episodes hosted by Julián Duque, our CEO Alex Serdiuk and CTO Dmytro Bielievtsov talk about ethical deep fakes and the technical aspects of creating them.

Listen to The Ethical Side of Deep Fakes and The Technical Side of Deep Fakes to broaden your knowledge about synthetic media, specifically voice synthesis.

The synthetic media industry is based on AI generated media including technologies such as text, music, video, image and AI voice generation. For example, CGI and Photoshop generate synthetic media, because they help others create modified content. At this point, synthesized video is much more advanced than synthesized audio.

The ethical side of deep fakes

Alex explains where Respeecher fits into all of this. We aim to revolutionize the way content is produced, by bringing more flexibility in industries like entertainment, video games, advertising, and more through the use of our speech-to-speech voice conversion technology.

Voice conversion use cases

It’s hard to schedule top actors for voiceover or dubbing work. Voice cloning (or conversion) allows you to scale any voice and gives you the flexibility to record new lines anytime.
Resurrect voices from the past: Bring back the voice of an actor who has passed away. Maybe you want to add a historical voice to a project.
Record any voice in any language: Ready to capture an overseas audience? Speech-to-speech language agnostic technology empowers you to record in any language.
Add dialogue anytime: Decided to add a few lines after filming? Just turn on your microphone and start speaking - without calling an actor back into the studio.
Replicate children’s voices: Kids say the darndest things - but they’re challenging to work with. Read more on this topic in our article about voice conversion for children’s voices.

The technical side of deep fakes

Dmytro explains how synthetic audio is produced and why it’s hard to fake. In general terms, there are already a few speech Machine Learning (ML) models already available on the internet, but the best way to clone a voice in a quality manner is to use an audiobook as a sample of the original speaker and to combine it with these pre-existing models.

The problem here is that the outputs produced by these models are poor in quality, so that's one reason why speech-to-speech technology is hard to fake. Human linguistic variations and patterns and the emotional compound of the speech make the process of voice cloning difficult.

In fact, unlike text-to-speech technologies (TTS) that may produce dull content, speech-to-speech software (STS) generates more natural content, by preserving the voice intonations and the emotion of the original speaker.

Main concerns about the usage of unethical deep fakes

The main concern about deep fakes is the risk for them to be used in an unethical way, to pretend that someone said or did something that never happened. But it’s also true that all technologies have potentially malicious uses in the hands of the wrong people.

Dmytro offers more details about what deep fakes are from a technical point of view: a technology that uses deep learning and deep neural networks to fake a video/audio and replace it with another piece of video or audio content.

The term “deep fake” is used now for everything that involves synthesizing human appearance using a neural network.

The biggest potential danger of deep fakes is not their existence, but people’s inability to detect them.

Respeecher works with leading Hollywood movie studios, game developers, and major multinational corporations and has strict ethical principles.

We do not use voices without permission when this could impact the privacy of the subject or their ability to make a living. In practice, this means we will never use the voice of a private person or an actor without permission.

This aspect ensures that the content produced by Respeecher can't be faked or used in abusive ways. Also, another feature implemented for this goal is watermarks applied to each audio piece. Certain "artifacts" are embedded into the audio, which are imperceptible to humans, but easily identifiable by a computer program.

Conclusion

Our mission is to make sure that voice cloning technology is used in beneficial ways, according to ethical principles. Our goals are very clear and we intend to:

Educate the public about the capabilities of synthetic speech technology;
Develop automatic detection algorithms that can detect AI voices even if they have not been watermarked by us;
Work with gatekeepers of content such as Facebook and YouTube to limit the harm of voice cloning.

Follow our journey on social media on Facebook, Twitter, LinkedIn, and YouTube, and reach out if you’d like to find out more about our AI voice generator software and how you can use it for your content creation project.

FAQ

Deep fake technology utilizes deep learning and neural networks to create realistic fake audio and video content. It’s used in media production, entertainment, advertising, and even historical recreations, allowing for synthetic content generation while raising concerns about ethical implications in its use.

Respeecher ensures ethical AI practices by never using voices without permission, protecting privacy, and applying watermarks to synthetic audio production. The company also collaborates with content gatekeepers like Facebook and YouTube to prevent the malicious use of AI voice cloning software and deep fakes.

The synthetic media industry benefits entertainment, video games, advertising, and education, with AI voice cloning software, deep fake technology, and speech-to-speech solutions providing scalability and flexibility in voice conversion and media production.

Speech-to-speech technology (STS) converts one speaker’s voice into another, preserving emotional tone and intonation, while text-to-speech (TTS) generates synthetic speech from text, often sounding less natural. STS provides more accurate AI speech synthesis, ideal for synthetic audio production and voice conversion.

Safeguards include watermarks embedded into synthetic audio, strict permissions for using voices, and collaboration with platforms like Facebook and YouTube to combat unethical deep fakes. Respeecher also works on developing algorithms for detecting synthetic media to ensure ethical AI applications.

Glossary

Deep Fake Technology

A technique using deep learning for media to create realistic fake content, such as audio and video, powered by AI voice cloning software and synthetic media industry tools.

AI Voice Cloning Software

A tool using deep learning for media and AI speech synthesis to replicate voices, enabling voice conversion technology and synthetic audio production for ethical applications.

Synthetic Media Industry

An evolving sector leveraging deep fake technology, AI voice cloning software, and synthetic audio production to create content with ethical AI applications.

Speech-to-Speech Technology (STS)

A voice conversion technology that uses AI voice cloning software for synthetic audio production, enhancing deep fake technology with ethical AI applications.

Ethical Deep Fakes

Responsible use of deep fake technology and AI voice cloning software, ensuring synthetic media production aligns with ethical AI applications and voice conversion technology.

Orysia Khimiak

PR and Comms Manager

For the past 9 years, have been engaged in Global PR of early stage and AI startups, in particular Reface, Allset, and now Respeecher. Clients were featured in WSJ, Forbes, Mashable, the Verge, Tech Crunch, and Financial Times. For over a year, I Orysia been conducting PR Basics course on Projector. During the war, became more actively involved as a fixer and worked with the BBC, Guardian and The Times.