by Anna Bulakh – Feb 21, 2023 8:22:15 AM • 8 min

New Ethical Dilemma in Voice Synthesis: Vishing and Its Consequences

•••

The emergence of technology that can generate realistic-sounding voices has created an ethical dilemma: how can we trust voice synthesis technology when criminals can use it to deceive people? In this blog, we'll look at one particular type of fraud enabled by AI voices called vishing, its potential consequences, and how to overcome this latest ethical dilemma.

AI voice synthesis, while offering innovative possibilities in various industries, also raises concerns about the misuse of synthesized voices for malicious purposes. As we delve into the challenges posed by vishing, exploring solutions and safeguards becomes imperative to ensure the responsible and secure development of this generative AI technology.

Voice synthesis introduction

On the topic of voice synthesis technology, people usually refer to it as an advanced form of speech-to-speech (STS) conversion.

This technology clones human speech using recordings from both the target and source voices. Voice cloning operates using advanced AI/ML algorithms to produce a unique, natural voice with the same tonal characteristics as the original speaker.

STS is quite different from a common text-to-speech (TTS) application, which relies on dictionaries and annotations to generate emotions. In contrast, STS utilizes recordings from the source speaker to accurately imitate their voice. Furthermore, TTS systems are unsuited for low-resource languages due to higher data requirements. In contrast, STS systems, powered by generative AI, can produce native-sounding voices regardless of these types of limitations.

Respeecher’s voice cloning stands out as one of the most advanced forms of STS. We allow creators to synthesize realistic audio content that captures all nuances of authentic voices while conveying emotion and other subtle variations into the new speaker's timbre.

Voice synthesis in various applications, enabling people and businesses to create reproductions of human voices in different ways:

Assistive technology makes use of voice synthesis technology to provide support to those with speaking impairments. It lets users personalize their synthetic voice output, making it sound natural. A person that lost their voice could get it back without compromising any degree of likeness.
Dubbing is becoming increasingly popular by utilizing voice synthesis to align the dialog to sync with the actor's lip movements. It is used increasingly in the film industry to create dubbed versions of movies and other similar works.
Audiobooks and podcasts also benefit from this technology, as high-quality audio recordings can be generated without requiring a particular speaker's presence during production.
Call centers and customer support are leveraging this generative AI technology to impress customers and reduce operational costs by automating customer interactions using generated synthetic voices.
In entertainment, synthesized voices have given way to engaging experiences for video games and YouTube projects.
Synthesized voices are also used for educational purposes, such as cloning the voice of historical figures to create realistic interactive experiences. Check these case studies using the voices of Richard Nixon and Manuel Rivera Morales to get a better idea.

As you can see, positive examples of using this technology abound. But as with any technology, one can find a number of different harmful applications. One of them (and the best known to date) is malicious manipulation in the social and political spheres.

Ethical concerns regarding voice synthesis technology

The recent deepfake of Ukraine's President Zelensky have highlighted synthetic media's power and potential danger in modern society, showcasing the impact of gen AI in creating realistic simulations.

Both cases demonstrated how seemingly realistic footage could be generated using voice AI to spread false information or manipulate public opinion. This begs a critical ethical discussion: how should this gen AI technology be used responsibly and ethically?

In a world where synthetic media can increasingly replicate real-life events, we must strive to keep conversations honest, open, and transparent. Individuals must be aware when exposed to misinformation; they must be able to differentiate between real and synthetic media considering AI ethics, to form their own opinion.

We'll discuss some ethical options below. But for now, let's look at vishing, one of the most dangerous forms of synthetic voice fraud.

Vishing with a synthetic voice

Vishing (voice phishing) is a social engineering attack where malicious actors use phone calls to target organizations or individuals for financial gain. It takes advantage of people's trust in familiar companies and brands and other forms of psychological manipulation to steal personal information such as credit card numbers and passwords.

Gender and vocals are integral to a successful vishing campaign as voice can create a sense of trust between caller and victim. Studies have shown that people perceive women as more honest and trustworthy, offering an advantage to cyber attackers in this scenario. With synthetic voice technology, attackers can now sound like anyone they choose, enabling them to launch highly effective vishing campaigns that go undetected by victims. Ethical voice cloning becomes crucial in ensuring responsible and secure use of AI voice generator technology.

The list of potential negative uses of this technology goes beyond bank fraud. The FBI recently issued a public service announcement warning of deepfakes used by fraudsters to impersonate job applicants during online interviews.

The scam is concerning as the targeted jobs involve access to customer PII, financial data, corporate IT databases, and proprietary information. Facing potential business and legal repercussions due to unauthorized access to PII, businesses must be aware of this issue and take necessary steps to prevent it, including implementing ethical voice cloning safeguards with the aid of AI voice generators.

Vishing is a growing threat, and Richey May (a financial services and IT consultancy) and Respeecher are taking steps to combat the problem.

The two companies have developed scenarios for using synthetic speech for social engineering penetration testing. This includes creating simulations where an engineer sounds like a specific person and attempts to acquire information over a call or video conferencing app. By conducting vishing tests, organizations can identify potential personnel vulnerabilities and address these issues with proper training.

Voice synthesis and code of ethics

Voice cloning technology is growing in popularity due to its ease of production and lack of regulation. Companies must take caution when using this gen AI technology in their development as it can have long-lasting implications for individuals and society.

Over the past five years, Respeecher has established itself as the go-to voice cloning provider for Hollywood studios. Famous voice IP owners have chosen to work with itment to producing outstanding cloned voices and developing strict ethical standards throughout the industry.

In addition to anti-voice phishing initiatives, Respeecher is working to develop a broader list of principles that set the standard for ethical voice cloning, incorporating gen AI considerations. Upholding AI ethics is at the core of our mission to ensure responsible and secure use of voice synthesis technology.

To ensure that our technology is not used with malicious intent, Respeecher does not provide any public API for cloning voices. We only work with trusted clients, require written consent from voice owners, and approve projects that meet our standards.

Additionally, Respeecher develops watermarking technology to identify Respeecher-generated content. We are working with broad voice engineering, voice actors, and movie studio communities to educate the public, build detection algorithms, and prevent technology abuse while adhering to AI ethics.

If you want to learn more about our ethical standards, check this page. We hope you appreciate the opportunity to use voice cloning technology in a safe and ethical manner.

FAQ

Voice synthesis technology uses Generative AI to create realistic human-like voices. It includes AI voice cloning and speech-to-speech conversion, enabling industries like entertainment, customer support, and assistive technology to replicate authentic voices for various applications.

Vishing (voice phishing) uses synthetic voices to manipulate victims into disclosing sensitive information. Attackers exploit voice cloning and social engineering tactics, sounding like trusted individuals to deceive people and gain access to personal or financial data.

Generative AI poses ethical challenges like deepfake scams and vishing, where synthetic voices and media can mislead and manipulate people. It's crucial to adopt AI ethics and safeguards to prevent misuse, ensuring responsible use in both synthetic media and voice cloning technologies.

Respeecher enforces ethical AI practices by requiring written consent from voice owners and not providing public APIs for cloning. It also employs watermarking technology to identify synthetic voices and prevent unauthorized use of voice cloning for malicious purposes like vishing.

AI ethics ensures that synthetic media, including voice cloning and deepfake technology, is used responsibly. It guides the prevention of misuse, like social engineering attacks or vishing, by promoting transparency and accountability in AI-driven projects and voice synthesis.

Yes, synthetic voices can be identified through watermarking technology and advanced detection algorithms. Respeecher embeds digital markers within cloned voices to help track and verify the authenticity of voice synthesis and AI voice cloning.

Watermarking technology embeds hidden, trackable markers into synthetic voices produced by voice cloning tools. This ensures the identification and traceability of voice synthesis content, allowing creators to detect misuse, prevent fraud, and uphold AI ethics.

Speech-to-speech conversion (STS) is a form of voice synthesis technology that uses AI voice cloning to replicate one person’s voice in another’s, offering applications in dubbing, assistive tech, and multilingual content. It enables more natural, expressive voice replacements and is different from traditional text-to-speech.

Glossary

Voice synthesis technology

A Generative AI technique that creates human-like voices through voice cloning and speech-to-speech conversion, raising ethical concerns like vishing and deepfake scams.

Vishing

A social engineering attack using voice synthesis technology and voice cloning to impersonate trusted figures, often exploiting Generative AI for deepfake scams.

AI ethics

A set of principles guiding the responsible use of Generative AI, voice cloning, and synthetic media to prevent deepfake scams, vishing, and social engineering attacks.

Generative AI

A technology that creates content like voice cloning and synthetic media, enabling speech-to-speech conversion while requiring strong AI ethics to prevent deepfake scams.

Watermarking technology

A method used to embed markers in synthetic media and voice cloning to track usage, helping to prevent deepfake scams and promote ethical AI practices.

Anna Bulakh

Head of Ethics and Partnerships

Blending a decade of expertise in international security with a passion for the ethical deployment of AI, I stand at the forefront of shaping how emerging technologies intersect with national resilience and security strategies. As the Head of Ethics and Partnerships at Respeecher, I focus on guiding ethical AI development. My role is centered around promoting the responsible use of AI, especially in synthetic media.