Aug 9, 2022 8:07:14 AM
What is singing voice synthesis?
Singing voice synthesis (SVS) is a method of generating a singing voice from musical scores with lyrics using computer models.
Singing synthesis has been developing since the 1950s and, like text-to-speech, revolves around two paradigms: statistical parametric synthesis, using statistical models to reproduce the features of a voice, and unit selection, when snippets of vocal recordings are recombined on the fly. Thanks to recent advances in the technology, maestros can listen to a song immediately after composing it, no recording necessary
Modern SVS models can generate the natural singing voice of a singer in any language using vocals from the original score and recordings of singers in the target languages. This is called cross-lingual singing voice synthesis.
In recent years, the following technologies have been used to achieve SVS:
- generic deep neural networks (DNN)
- convolutional neural networks
- recurrent neural network with long-short term memory (LSTM)
Use cases for singing voice synthesis
Singing voice synthesis technology allows musicians and singers to instantly know how their written music will sound. It’s no longer necessary to go through the process of recording a piece of music, investing all the time, money, and resources that go into it. And no need to hire a team to assist with recording sessions.
Another critical use case is creating music for games and other projects that demand high degrees of audio support. Recording songs with real artists is extremely expensive for video game producers. Singing voice synthesis allows smaller indie devs to produce songs from musical scores and text using existing voices.
Artists that want to reach a global audience with their message and provide support to different groups of people all around the world can also benefit from cross-lingual singing voice synthesis. Now they have an inexpensive means of distributing their message in any language.
How does cross-lingual singing voice synthesis work? Respeecher’s example
When synthesizing the singing voice of a particular performer, specialists begin by using samples of their vocals.
In total, about an hour of an individual’s vocals are needed to construct an initial model, and 10-15 minutes of recording will be used for the synthesizing process.
These recordings are loaded into a neural network, which then generates a voice, taking into account all possible nuances. The result is a synthesized voice that is almost indistinguishable from the original.
This is how Respeecher implements cross-lingual singing voice synthesis:
On the fourth anniversary of famous Swedish musician Tim Bergling, known professionally as Avicii, one of his best-known collaborators, Aloe Blacc, paid tribute to the artist. He performed and recorded Avicii’s hit “Wake Me Up” in the English, Mandarin, Spanish, Italian and French languages. In doing so, his aim was to allow more people all around the world to appreciate Avicii’s talent in a deeper way.
Since Aloe’s aim was to sing the song flawlessly, not only in English but also in Mandarin, Spanish, Italian, and French, he was going to need some technological help from singing voice synthesis experts.
In order to facilitate the accuracy of the lyrics while also correctly following the natural beat of the song, Aloe Blacc turned to Respeecher and Metaphysic.ai.
Firstly, Aloe Blacc recorded a video of himself singing “Wake Me Up” in English. In order for him to also sing in Mandarin, Spanish, Italian, and French, the Respeecher team took recordings of other singers performing the song in these languages and applied them to Blacc’s voice.
Then, Metaphysic.ai was tasked with lip-syncing Blacc’s vocal movements, making his mouth appear natural when singing in various languages.
In a Nutshell
Thanks to singing voice synthesis technology, artists can “sing” in as many languages as they want. AI speech-to-speech technology clones an actor’s voice and reproduces it in such a way that the same material can be performed in a foreign language using the same voice. All you need is a minimum of one native speaker for the language you intend to reproduce your content for.
We encourage you to get in touch with Respeecher for a brief consultation regarding the use of our technology and scaling singing voice synthesis to meet the demands of your use case.