by Anna Bulakh – Aug 9, 2022 8:07:14 AM • 8 min

What Is Singing Voice Synthesis and Is It Even Possible?

•••

With advancements in voice cloning, the ability to synthesize vocals to sound like another person, or sing with perfect pitch in different languages, is no longer science fiction. It is now possible to vocalize text in any tone of voice, including that of a child. But what if you want to synthesize… singing? Is AI singing possible? Let’s find out.

What is singing voice synthesis?

Singing voice synthesis (SVS) is a method of generating a singing voice from musical scores with lyrics using computer models.

Singing synthesis has been developing since the 1950s and, like text-to-speech, revolves around two paradigms: statistical parametric synthesis, using statistical models to reproduce the features of a voice, and unit selection, when snippets of vocal recordings are recombined on the fly. Thanks to recent advances in the voice AI technology, maestros can listen to a song immediately after composing it, no recording necessary

Modern SVS models can generate the natural singing voice of a singer in any language using vocals from the original score and recordings of singers in the target languages. This is called cross-lingual singing voice synthesis, which produces remarkably realistic AI voices.

In recent years, the following technologies have been used to achieve SVS:

generic deep neural networks (DNN)
convolutional neural networks
recurrent neural network with long-short term memory (LSTM)
generative adversarial networks (GAN)

Use cases for singing voice synthesis

Singing voice synthesis technology, powered by AI-generated voices, allows musicians and singers to instantly know how their written music will sound. It’s no longer necessary to go through the process of recording a piece of music, investing all the time, money, and resources that go into it. And no need to hire a team to assist with recording sessions.

Another critical use case is creating music for games and other projects that demand high degrees of audio support. Recording songs with real artists is extremely expensive for video game producers. Singing voice synthesis, powered by gen AI, allows smaller indie devs to produce songs from musical scores and text using existing voices.

Artists that want to reach a global audience with their message and provide support to different groups of people all around the world can also benefit from cross-lingual singing voice synthesis. Now, with the assistance of AI singers, they have an inexpensive means of distributing their message in any language.

How does cross-lingual singing voice synthesis work? Respeecher’s example

When synthesizing the singing voice of a particular performer, specialists begin by using samples of their vocals.

In total, about an hour of an individual’s vocals are needed to construct an initial model, and 10-15 minutes of recording will be used for the synthesizing process. This meticulous approach ensures the creation of a realistic AI voice that accurately reflects the nuances and characteristics of the original performer's singing style.

These recordings are loaded into a neural network, which then generates a voice, taking into account all possible nuances. The result is a synthesized voice that is almost indistinguishable from the original.

This is how Respeecher implements cross-lingual singing voice synthesis:

On the fourth anniversary of famous Swedish musician Tim Bergling, known professionally as Avicii, one of his best-known collaborators, Aloe Blacc, paid tribute to the artist. He performed and recorded Avicii’s hit “Wake Me Up” in the English, Mandarin, Spanish, Italian and French languages using AI voice synthesis. In doing so, his aim was to allow more people all around the world to appreciate Avicii’s talent in a deeper way.

Since Aloe’s aim was to sing the song flawlessly, not only in English but also in Mandarin, Spanish, Italian, and French, he was going to need some technological help from singing voice synthesis experts.

In order to facilitate the accuracy of the lyrics while also correctly following the natural beat of the song, Aloe Blacc turned to Respeecher and Metaphysic.ai.

Firstly, Aloe Blacc recorded a video of himself singing “Wake Me Up” in English. In order for him to also sing in Mandarin, Spanish, Italian, and French, the Respeecher team took recordings of other singers performing the song in these languages and applied them to Blacc’s voice using gen AI technology.

Then, Metaphysic.ai was tasked with lip-syncing Blacc’s vocal movements, making his mouth appear natural when singing in various languages. This synchronization process, combined with the use of AI-generated voice technology, ensured a seamless and authentic performance across different linguistic renditions of the song.

In a Nutshell

Thanks to singing voice synthesis technology, artists can “sing” in as many languages as they want. AI speech-to-speech technology clones an actor’s voice and reproduces it in such a way that the same material can be performed in a foreign language using the same voice. All you need is a minimum of one native speaker for the language you intend to reproduce your content for.

We encourage you to get in touch with Respeecher for a brief consultation regarding the use of our technology and scaling singing voice synthesis to meet the demands of your use case.

FAQ

Singing voice synthesis (SVS) is a process where AI-generated voices are used to produce a singing voice from musical scores and lyrics. It employs advanced neural network techniques, such as generative AI in music, to generate realistic vocal performances in any language without recording a real singer.

AI-generated singing relies on neural networks for singing voice to analyze vocal samples and synthesize realistic singing. By using deep learning models, like DNN or GAN, AI can mimic a singer's voice, tone, and style, enabling the creation of songs from text without a physical recording, saving time and resources.

Singing voice synthesis allows musicians to hear their compositions immediately, saving time and costs. It's also valuable for creating AI-powered vocal transformations in video games or media projects, helping small developers produce high-quality music without hiring real artists, and enabling global reach with cross-lingual singing voice synthesis.

Cross-lingual singing voice synthesis is highly accurate, as it uses AI-generated voices combined with precise voice modeling. By applying recordings from native singers, AI can flawlessly reproduce lyrics in different languages, maintaining the original singer’s style, tone, and emotion, as demonstrated by Respeecher's work with Aloe Blacc.

Using AI singers provides numerous benefits: instant vocal production, reduced costs, and the ability to sing in multiple languages. This technology allows artists to reach a global audience without language barriers and enables the creation of music for games or other media without hiring live performers.

AI singing can be ethical when proper consent is obtained from artists, ensuring they are fairly compensated for their voice usage. However, issues like AI voice cloning without permission or deepfake applications raise ethical concerns, which require transparent practices and respect for creators' rights.

To create a synthetic singing voice, recordings of a singer’s vocals are fed into a neural network. About an hour of original vocals is needed to train the model, after which only 10-15 minutes of recording is required for the AI voice synthesis for music to generate the desired vocal performance in different languages or styles.

While synthetic voices can replicate singing styles and tones, they cannot fully replace the emotional depth, creativity, and nuances of a human singer. AI voice synthesis for music is a tool to assist artists and producers, but it cannot entirely replace the artistry and presence of real singers in live performances or original compositions.

Glossary

Singing voice synthesis

A technology using AI-generated voices and neural networks for singing voice to create vocal performances, enabling cross-lingual singing voice synthesis and AI-powered vocal transformation for music composition with AI.

AI voice synthesis

A technology using AI-generated voices and neural networks for singing voice to create synthetic voice technology for music composition with AI and AI-powered vocal transformation.

Generative Adversarial Networks (GAN)

A type of neural network for singing voice that enables AI-generated voices and synthetic voice technology for AI-powered vocal transformation in music composition with AI.

Cross-lingual synthesis

A technique in singing voice synthesis that enables AI-generated voices to sing in multiple languages, using AI-powered vocal transformation for global music composition.

Neural networks for audio

AI models used in singing voice synthesis and AI voice synthesis for music, enabling AI singers and cross-lingual synthesis for realistic synthetic voice technology.

Synthetic voice technology

AI-powered systems enabling singing voice synthesis, AI-generated voices, cross-lingual singing voice synthesis, and AI voice synthesis for music.

Lip-syncing with AI

AI-driven technology synchronizing mouth movements with AI-generated voices in singing voice synthesis, cross-lingual synthesis, and AI-powered vocal transformation.

Music composition software

Tools powered by AI and generative AI in music to create melodies, harmonies, and AI-generated voices for singing voice synthesis and AI-powered vocal transformation.

AI-powered vocal transformation

Technology that uses neural networks and AI-generated voices to alter or create unique vocal sounds, enabling singing voice synthesis and AI singers.

Anna Bulakh

Head of Ethics and Partnerships

Blending a decade of expertise in international security with a passion for the ethical deployment of AI, I stand at the forefront of shaping how emerging technologies intersect with national resilience and security strategies. As the Head of Ethics and Partnerships at Respeecher, I focus on guiding ethical AI development. My role is centered around promoting the responsible use of AI, especially in synthetic media.