• There are no suggestions because the search field is empty.

Ask Me Anything (AMA) with Respeecher CEO Alex Serdiuk, Part III

Aug 7, 2023 10:30:43 AM

Alex Serdiuk, CEO of Respeecher, answered questions from the audience in a live AMA session we recorded on YouTube. This interview is part of a series of four interviews that will cover topics like: ethics of deepfakes, synthetic media industry, voice resurrection for Hollywood movies, Respeecher in the context of war, and more.

If you haven’t read the first parts, you can do so here:

Watch the full video of the AMA session here:



[Q] Why is permission needed? 

[AS] I guess we covered it. Permission is needed because if you want to reproduce someone's voice, you use their intellectual property. The voice belongs to a particular person. And this person or those who own the rights for this intellectual property should give you permission and consent in order to replicate that voice. 

[Q] Who owns the AI? 

[AS] The voices from the Voice Marketplace Library are owned by the content maker who uses a voice. The Marketplace is being licensed and the content maker can use any of the voices there. But they produce using our library in their pieces so they get a global license. So they own that particular content that's being created in a particular voice.

In all other cases, when we create specific models for clients, the voice owner or their representative owns the voice. And that's the only way it works so far. We don't own any voices aside from the voices you find in Voice Marketplace Library, and we license those to creators. 

The voices from the Voice Marketplace library are owned by the voice owner, the speaker. 

[Q] Why don't you make it available to everyone again?

[AS] Because we protect it from any potential misuse, we don't want our technology to be used without those boundaries in place. I mentioned this before in another section about future applications. 

[Q] Can you tell me more about your new healthcare direction? How is Respeecher’s technology helpful? 

[AS] We want to help people who have speech disorders to be able to improve the quality of their life because it's often the case with speech disorders caused by different medical conditions - people can speak, but it's really hard to understand them. And we hear cases when patients say that they have to repeat whatever they say four or five times in order to be understood on a phone call, including on a phone call with their doctor.


So early this year, in February and March, our technology finally made it to the robustness level and we got some initial results, results that were promising. We started to conduct trials with real patients from two universities in the UK and the US. 

And now we've found out what the limitations are, and what needs to be improved. Luckily, everything that needs to be improved in the healthcare direction is very much correlated with our general scope of improvements. We are introducing a real-time system, which means the ability to get the conversion to happen somewhere closer to the patient's device. Getting rid of the need for an internet connection and the cloud.

Our team is very much excited about this direction. We will invest more in it. We are still defining the path in terms of what we will do next. 

So we started with patients with particular diseases, but the TAC would be helpful for various cases and we are currently exploring that. We started cooperating with people from the community that used to build different products in order to help those patients.

In a week or two, you will see a very interesting case study on our website, and feel free to subscribe to our newsletter to be notified about it. We send a newsletter once per month or even once every two months. 

And another interesting direction is voice banking. So there are some medical conditions when patients know that soon they will lose their voice. And we have many requests like that. We consult people on what exactly needs to be recorded, and what kind of data set they need to put together right now in order to have access to their voice. We are currently working on a couple of projects like this and we will be doing more where we train models for people who are losing their voice. 

So they would be able to use their voice further by using text-to-speech (TTS) or even using speech-to-speech (STS), just giving their model to someone they trust, like their relatives. Also in some cases, they can still speak or whisper, and that could be converted into a healthy voice. 

[Q] Have you ever done a voiceover for books? 

[AS] Yeah, we did audiobooks. We just recently did an interesting audiobook for a YouTube channel, Jolly. That may be the first audiobook that was done using speech-to-speech synthetic voice technology

We have a couple of projects in long-form content, including audiobooks. And that's a very exciting direction because I mean, text-to-speech (TTS) has limitations. Being able to voice over an audiobook in a way where you would listen to it and it would sound just natural, right? And those limitations are not just about prosody. Those limitations are also about some particular inflections, some human touch ability to change style depending on the character, something that voiceover actors do when they voice audiobooks. 

But what can our speech-to-speech technology bring to the table is the ability to change the voice of the actor who is voicing the audiobook in many different voices.

So some of us might want to listen to an audiobook voiceover by Tom Hanks, and you might not be able to record all the audiobooks in his voice. The publisher gets permission from Tom Hanks to use his voice to convert the audiobook library into Tom Hanks’s voice. That would be amazing. And then we would not be limited to just one voice in audiobooks. 

[Q] Can you please explain how your technology is better and different from text-to-speech? 

[AS] There are basically two ways you can synthesize speech. The most popular is text-to-speech (TTS). And we used to hear text-to-speech everywhere. For example, Alexa Speaker, Google Speaker, and chatbots are using text-to-speech. 

The thing is, text-to-speech technology is somewhat limited to language models. It works in vocabularies and word domains. It often struggles with some unusual names.

The bigger issue with text-to-speech is that it's limited in terms of performance. So if you look for the best text-to-speech software out there, you might find a system that would be biased towards advertising. And they do very good advertising prosody and it sounds very natural but you don't have control over the full voice range of emotions. You can make text-to-speech sound sad or excited, or calm, but that's it. That's a very limited thing compared to what we can do with the vocal apparatus we have been born with. 

All those tiny inflections, all those things we produce naturally, can be done only by humans. And that keeps the human touch. 

And that's where speech-to-speech technology comes in because the performance could be enhanced by humans. The director of the movie could say to the voice actor: “Just say it with a bit more warmth. Can you add some violet notes to this particular line?” And the human would understand and would make it.

You cannot say that to text-to-speech. And even if you can imagine a text-to-speech system that would be able to introduce all those tiny things we just naturally have in our voice, it wouldn't be handy in terms of usage, because we would have like a huge soundboard with many buttons that we would have to use in order to control those tiny inflections. And it would be extremely time-consuming.

Other things are singing, whispering, crying, and emotions. Those are not things that are usually covered by text-to-speech. So speech-to-speech always keeps humans in the loop. Humans are in charge of performance. Technology is in charge of changing their voice to sound exactly like another voice. And that gives the ability to control emotions and the ability to perform, the ability to go wider in terms of emotions and say it in an exact way with the exact prosody that you envision. Not relying on some text to speech AI that would guess how it should sound, that would try to make it sound in a way that would resemble a human saying it. In this case, you have the human saying it and you can work with the human.

Subscribe to our newsletter to be notified when the next part is published.

Related Articles
Respeecher On the Future of Voice Cloning for Patients with Speech Disabilities at the 2022 MET-TMOHE Joint Conference

Jun 27, 2022 5:22:52 AM

Speech is a uniquely human ability that we often take for granted. It allows us to...

Ask Me Anything (AMA) with Alex Serdiuk, CEO of Respeecher, Part I

May 31, 2023 10:50:57 AM

Alex Serdiuk, CEO of Respeecher, answered questions from the audience in a live Ask Me...

The Future of Sound: AI Voice Cloning for the Metaverse

May 3, 2022 1:43:47 PM

Technologists have long sought to create a better parallel world rich with opportunity....

Ready to Replicate Voices?

Want to see our technology in action? Get a demo today.