Aug 22, 2022 8:39:45 AM
Subscribe to our newsletter
Sign up to receive email updates on exclusive content and new product announcements.
What is Deepfake Voice
A deepfake is an imitation of a video, audio, or photo that appears genuine but is the result of manipulation by artificial intelligence (AI) technology. Ian Goodfellow, director of machine learning at Apple Special Projects Group, coined the term "deepfake" in 2014 while still a student at Stanford University. The concept was born as a result of combining the first part of deep learning (from the English “deep learning”) and the word fake (“fake”).
Deepfakes are created by a generative adversarial algorithm. It learns, like a human, from its own mistakes, as if competing with itself. The system “scolds” the algorithm for errors and “encourages” it for correct actions until it produces the most accurate fake.
With the advancement of AI technology, creating deepfakes is getting easier. To produce a deepfake voice, it is enough to record your voice for some time, avoiding reservations and other interference, and then send the resulting recording for processing to a company that provides such a service.
A couple of years ago, the most realistic audio deepfakes were created by recording a person's voice, dividing their speech into component sounds, and then combining them to form new words. Now, neural networks can be trained on sets of speech data of any quality and volume thanks to the principle of competition, which obliges them to determine a real person's speech faster and more accurately. As a result, if systems used to require tens or even hundreds of hours of sound, realistic voices can now be generated from just a few minutes of audio content.
The Difference Between Synthetic Voice and Deepfake Voice
Despite the fact that synthetic voice and deepfake voice have the same basis, they have different connotations and purposes.
Synthetic voices as well as deepfake ones use AI to generate a clone of a person's voice. The technology can closely replicate a human voice with great accuracy in tone and likeness.
However, while synthetic voices are used for business, entertainment, and other purposes to advance them, deepfake voices are usually associated with copying human voices to fool someone.
Respeecher is developing synthetic voices, not allowing any deceptive uses of the technology. We work directly with clients we trust and require written consent from voice owners. This and other requirements help us create speech synthesis content that will not harm people.
Industries that Benefit from Synthetic Voices
The industry leaders and fast-growing AI companies prefer to align with synthetic media rather than deepfake as a term and as an intent of its production. These advancements have given way to synthetic media growing popularity in multiple industries including entertainment, movies, marketing, healthcare, and customer service.
Below are the technology’s most common use cases:
Film and TV. Synthetic voices are used for dubbing an actor’s voice in post-production to allow for the revival of an actor’s voice who has long since passed.
Animation. Speech synthesis software allows your animations to speak the way you want them to.
Game development. Voice synthesis is regularly used to create characters in video games so that they sound exactly like the specific characters they are based on.
Podcasts and audiobooks. A narrator’s voice can be changed to the author’s voice, allowing the audience to listen to the author reading their own words.
Advertising. Speech synthesis help to tailor your ads to particular audiences by using region-specific pronunciation.
Dubbing. The technology allows you to streamline the dubbing process, making it more agile.
Cross-language localization. With the recent advancements in voice cloning technology, it is possible to make one person speak another language in their own voice. Recently, Respeecher launched the campaign Speak Ukrainian, during which Abby Savage, Maye Musk, Anna Ganguzza, and other celebrities spoke fluent Ukrainian in support of the Ukrainian people and their country.
Resurrecting famous voices. Perhaps an actor passed away or quit before a project could be finished. Maybe you want to add a historical voice to a project or rejuvenate a voice. This is all possible with AI generated voices.
However, many other use cases of synthetic voices exist beyond entertainment purposes. One of the most prominent ones is healthcare.
Patients suffering from voice and speech disorders often experience a great reduction in their quality of life due to difficulty when communicating with others. Voice cloning and conversion technology present a unique and promising solution: take whatever speech patients are able to currently produce and transform it into something that sounds more natural and even easier to understand using artificial intelligence. Patients with laryngeal cancer, Parkinson’s disease, amyotrophic lateral sclerosis, multiple sclerosis, amyloidosis, vocal fold paralysis, pharyngeal cancer, and dysarthria can improve the quality of their lives with the help of voice cloning.
Deepfake Voice Threats
Researchers at the University of Chicago's SAND Laboratories have tested voice synthesis programs available on the open source developer platform Github. It turned out that they can trick voice assistants such as Amazon Alexa, WeChat, and Microsoft Azure Bot to respond to their owner’s synthesized voice using the technology.
The SV2TTS program takes only 5 seconds to create an acceptable simulation. The program was able to deceive the Microsoft Azure bot in about 30% of cases, and in 63% of cases, the WeChat and Amazon Alexa voice assistants could not recognize the deepfake. In a survey of real volunteers, more than half of the 200 participants could not guess that it was a deepfake.
Researchers see this as a serious threat in terms of fraud and attacks on entire systems. For example, WeChat allows users to sign in to an account with their voice while Alexa allows them to use voice commands to make payments.
Similar stories have continued to pop up more frequently. In 2019, scammers used a voice deepfake to fool the head of a British energy company. The man was sure that his boss from Germany was calling him, and ended up sending more than $240 thousand to scammers.
The problem with the commercial use of voice deepfakes is that laws pertaining to their use do not exist in any country. The issue of protecting the rights of the deceased in relation to the use of their voice is also still an open issue.
In addition, there is currently no legislative practice in any country in the world that could affect the procedure for removing deepfakes. In the US and China, they are only developing laws to regulate their use. For example, California banned the use of deepfakes in advertising.
The only exception is when a person's name is registered as a commercial brand. These are usually celebrities. In 2020, the American YouTube channel Vocal Synthesis posted several generated humorous recordings of rapper Jay-Z's lyrics without commercial consent. All videos were captioned that the celebrity's speech was synthesized. However, the concert company RocNation, which Jay-Z owns, filed a copyright infringement lawsuit and demanded that the video be removed. In the end, only two of Jay-Z's four videos were removed - it was recognized that the resulting audio product was a derivative work that had nothing to do with any of the rapper's songs.
There is nothing unethical about voice cloning technology itself. And although it uses the same AI technology as video deepfakes, there are significantly fewer examples of defamatory deepfake voices.
However, it is becoming more common for deepfakes to combine audio and video with the goal of deceiving as many people as possible. Here are the most famous examples.
Every human being’s voice is unique. This is why some government and financial institutions use voice authentication to access private assets. In everyday life, most people also rely on their natural ability to distinguish the voices of friends and family when they cannot see them.
All this creates ideal circumstances for those with bad intentions to gain access to people's personal information or financial assets.
Law enforcement agencies in many countries are busy establishing proper regulations for producing and using artificially synthesized voices. The United States has already passed a law called The Defending Each and Every Person from False Appearances by Keeping Exploitation Subject (DEEP FAKES) to Accountability Act in 2019.
In 2020, fake news was estimated to have cost the global economy up to $78 billion. In 2019, cybersecurity company Deeptrace reported that the number of deepfake videos circulating online had surpassed 15,000. And this number would continue to double each year.
Deepfakes are widely used in the political arena — to mislead voters and manipulate facts. All this can create financial risks and damage the very fabric of our society.
Controversial media applications
Aside from malicious intent, some deepfake applications in media don’t quite qualify for compliance with ethical standards.
One such example would be the 2021 Anthony Bourdain deepfake controversy.
A film detailing the life of Anthony Bourdain encountered backlash after the director disclosed that the producers used deepfake voice technology. Some of his quotes were narrated using a cloned voice because the filmmaker did not have access to original audio recordings.
Naturally, this raised concerns in the community. With the ability to alter historical facts, there is a grave need to ensure the ethical production of voice cloning. In this regard, the AI engineering community is constantly working to improve the recognition of audio and video deepfakes.
Ethical Principles when using Synthetic Voices
Today, however, speech synthesis software that is committed to specific ethical principles is readily available. As such, Respeecher adheres to the following:
- One of the most important issues to consider is not using the voices of private individuals without permission. Voice owners should give their written consent before their speech is cloned.
- In order to easily distinguish voice synthesis content from other content, this software should include a unique audio watermark within its products.
- Voice synthesis software shouldn’t utilize any public API for creating voices.
- The voice cloning provider should only work with clients they trust and approve projects that meet strict standards of ethics.
As a key player in the voice cloning market, we take ethical concerns seriously. That's why we follow a strict code of ethics around voice cloning. You can read more about it on the Respeecher FAQ page.
In a Nutshell
Yes, voice cloning can be dangerous. While someone is using it to revolutionize movies, video games, and other creative projects, or help people with speaking disabilities, others can leverage them to fool and rob people.
That's why sticking to the ethical principles of using such technology, educating people on the boundaries of what is permitted, and creating entertaining products that will not harm people should be your highest priority.
In addition to purely technical measures, including developing algorithms for deepfake identification and voice watermarking, Respeecher is working to democratize and educate the market.
Making voice cloning technology legible and accessible to as many businesses and creative projects as possible will protect the community from scammers and unethical use.