Ask Me Anything (AMA) with Alex Serdiuk, CEO of Respeecher, Part II
Jul 11, 2023 7:26:32 AM
Alex Serdiuk, CEO of Respeecher, answered questions from the audience in a live AMA session we recorded on YouTube. This interview is part of a series of four interviews that will cover topics like: ethics of deepfakes, synthetic media industry, voice resurrection for Hollywood movies, Respeecher in the context of war, and more.
If you haven’t read the first part yet, you can do so here: Ask Me Anything (AMA) with Respeecher CEO Alex Serdiuk, Part I.
Watch the full video of the AMA session here:
[Q] Are you going to do any work with deceased singers like Freddie Mercury or Michael Jackson in the future?
[AS] We do have ongoing projects like this. We used to do quite a lot of what we call voice resurrection work. Just recently, you might have seen us in America’s Got Talent in the finale where we helped Metaphysic do a speech conversion for Elvis Presley. So I guess that's exactly the case you're asking.
That's a very interesting direction when we can bring some iconic voices back to life. And if it's done ethically, with all due respect to the personality, the voice of the owners of this IP, those could be very exciting projects - like we showed in many movies we were part of - or in the Super Bowl opening with the voice of Vince Lombardi.
Actually, personally, I'm also very excited about bringing some voices from the past back to life, because we should be thinking more about history. We should be learning from our history. And now being in Ukraine and going through the war that Russia started against us, I understand the importance of it more than ever.
And the thing is, history is not something very attractive to people now, especially to the youth. We used to do something that's very entertaining. It's really boring to watch some old movies, right? And if you can bring history to the level where it's exciting to watch, it's entertaining, that means that we would learn more.
You would be excited to go back and read about, say, the life of Freddie Mercury after you saw the movie Bohemian Rhapsody. So that's something quite meaningful for me.
[Q] How can Respeecher be helpful if I'm creating a game?
[AS] There are actually plenty of applications in video game creation because it is a complex project that usually requires a lot of voiceovers, way more voiceovers than in a feature film or TV series. So you might be just using a system like ours.
This feature optimizes the time and quality of the voiceover you are doing. So one person can speak in many voices, and many people can speak in one voice. So that's all about the better allocation of work for voice actors and their ability to deliver more work in the same period of time.
You also might have a wider choice of voices because in video games you often have many characters and it's a common practice. When one voice actor performs in five or seven voices, though the same actor could perform in 100 voices. And that brings the quality of the game voiceover to the next level.
Also, NPCs could be done better just because now the common practice for NPCs is to use subtitling or use text-to-speech. But when you apply this feature on top of text-to-speech (TTS), it makes it better. It gives wider choice in terms of voices. But also actors could just do those NPCs and you wouldn't be so much worried about casting the right voice that would fit the character because just a good actor would be able to derive any voice from a Library of Voices. And that means that the process of voiceover is very much streamlined.
Also in video game creation, this feature is already being used in the pre-production stages. When the concept is being created and you just draw a storyboard, and a group of creators is thinking about how to make this storyboard come to life, how to voiceover the game in the future - they often do table reads by one, but this table read could be enhanced too.
A creator can actually voiceover a storyboard using different voices from the library and even try different voices for different characters. And that could be a basis for casting voice actors further on. But you can also stick to a particular voice from the library and just use it in the game. Knowing that a good voice actor would be able to drive this voice for the piece you are making.
[Q] Any plans to reduce pricing for integrators?
[AS] We have the Voice Marketplace, the Library of Voices. It's now $200 per month and we have some discount for the yearly subscription. We already dropped off the requirement of adding your credit card in order to try the system.
In the few weeks, we are planning a new release that would have other pricing options. So the smallest pricing option most probably would be a low double-digit in dollars for the limited scope of conversions. So you would be able to use it for $20, $30, or $40 to be defined. Then the second pricing tier would be close to what we have right now. And then we would introduce some limits in terms of the amount of conversion for this pricing tier.
We would have one unlimited pricing tier. And we are also thinking about adding the ability to buy a package of minutes per conversion. So you will be able to just extend whatever pricing tier you have with additional minutes or hours of conversion. So it would become more affordable for small creators.
And again, it's been a roller coaster for us to bring it to life for small creators and make it a self-serve model. The system is still heavy and it's associated with some significant costs, computational costs, and we had to put this pricing tier in order to build an initial community in the Voice Marketplace of folks that would be able to invest in technology like ours because it can drastically change the process of content creation and provide us the feedback required for building a better product.
Now we are at the stage where we know how the product should look like. We know some bottlenecks and limitations which we are solving and we are ready to scale. And as I said in just several weeks there would be some additional pricing tiers.
[Q] Could you please tell us more about cross-language conversion and how it works? Is there a specific language it works best with? Cross-language conversion from English to Spanish, for example? And if you could give us examples of some use cases.
[AS] Yeah, our system is essentially language agnostic. So that means that it doesn't have any language models inside and you can use it in any language. The issue we might have with the system is the accents, and most of the cross-language projects are something we are doing manually from English to Spanish.
We would need recordings of the target voice in English and then we would need Spanish performers in terms of target voice recordings that would be like 30-40 minutes of speech of the voice that needs to be replicated. And a Spanish voice performer should better be someone who is a native speaker of the Spanish language. And then we would apply our tech to convert new content that's created by the Spanish performer.
In some cases, we might have limitations where we are not able to get rid of the full accent. I believe it's not the case for Spanish right now and it's never the case for converting from any other language into English. So if I record myself speaking in Ukrainian for 30 minutes, and then someone who is, let’s say, a native American speaker creates new content, the conversion to my voice would have 100% of their accent.
But if you go back, we might still not be able to get rid of the full accent, so some slight accents might still appear in conversions. That's something we're working on solving and I'm firmly convinced in the very close future that there would be no limitation there.
Some additional challenges could be with cases when languages on the phoneme level are very much different. For instance, converting from English to Vietnamese. And in those cases, if you do that like one of the projects, we just might need some additional time to gather a bigger data set of this particular language, assuming we never worked with this particular language to make our system perform better. But that's more about time than system limitation.
Okay, so from the technology category, we are done. The next part is ethics.
[Q] What are the current boundaries of deepfake audio technology?
[AS] There are a few of them. Companies like Respeecher that create synthetic content have to put in place ethics statements and legislation. Basically, it's not okay to use someone's voice without their permission. It's not right and it's against the legislation.
You cannot use someone's identity, someone's IP, without their knowledge and without their approval. We are not allowing it, as I said, to use our technology and our services if you don't have permission from the target voice.
We also don't work on projects in politics, even if you might have permission from a politician to be able to scale themselves doing many advertisements because we don't believe that would showcase our synthetic media technology in the best possible way.
We also work with clients we trust when we introduce particular voices to the system, and that's usually big film studios. So we are sure that it's going to be used in the most ethical and respectful way, like you have seen in several Star Wars-related pieces we were a part of and many other players in the industry. They put very similar ethics codes in place. And that's extremely important because the technology of synthetic media is in general in its very early stages.
And as with most other technologies, it might be misused.
All technologies are being commoditized over time, and that means that in several years anyone will be able to create the quality of sound free speech that is created today. And we should be ready for some cases of misuse of the technology.
That means that all the protection that would be put in place, including deepfake detection algorithms, including watermarking - are things we also work on. Those limitations, those boundaries will not fully protect us. And what will protect us from the misuse of synthetic media technology is the way we treat information.
So we should be reasonably skeptical about what information we consume. We should think about trusted sources of information, and that's just common sense. It's not about technology. It's about the way we consume information.
A good example could be the dinosaurs in Jurassic Park, right? When we see dinosaurs in Jurassic Park, we don't think that dinosaurs exist. That's a creative project. And why we don't think that is because we understand that there are techniques that allow filmmakers to put dinosaurs in their movies. And now that we would understand that those visuals can be manipulated and the voice could be manipulated, you just treat the information in a different way.
This feature invests a lot into bringing awareness about the technology's existence to a global audience, including the work we are doing for studios. So we push hard for publicity rights, and case studies, to make people think about some potential misuse of technology. And that would mean that we are on the right path of understanding how we can make synthetic media, and then we will treat information differently.
Subscribe to our newsletter to be notified when the next part is published.