AI voices are sounding more realistic today than ever before in the history of synthetic voice technology.
What started as simple text-to-speech (TTS), combined with hundreds of hours of recorded dialog, has evolved into more natural-sounding AI voices, synthesized from just a couple hours of audio.
You can check out realistic voices audio sample here and here. These are samples from Replica Studios, in which the main character faces off with a monster in a cave.
Why does this matter? The latest advancements in voice AI bring with it a host of new opportunities for creatives, game developers, game media, and more.
Leviathan Games, developers of titles featuring well-known IP such as Spider-Man and The Lord of the Rings, has started using voice AI in its dev cycle. “Creatives will always look for new frontiers to push the creative boundaries. Look at how 3D animation software has changed over the past decade,” said Wyeth Ridgway, the owner and technical director of Leviathan Games. “Pixar animators reshaped the direction of the industry by developing their own disruptive software for modeling, rendering, animation, and lightning. And, now we’re seeing parallels with voice AI technology advancements that have the potential to totally change game development.”
Voice AI is worlds apart from concatenative text-to-speech
Traditional, or concatenative TTS, works by stitching together, or concatenating, different pre-recorded sounds to form words and sentences. It requires voice actors to record hundreds of hours of dialogues, and a lot of manual work to carefully label these sounds.
Because of this, it is extremely difficult to add support for new voices with concatenative TTS.
According to Susan Bennett, the original voice of Siri, she recorded hundreds of phrases and sentences to get all the sound combinations in the English language, and it took four hours a day, five days a week, for five months to get the initial recording and updates completed.
Voice AI is completely different.
In late 2016, DeepMind demonstrated WaveNet, the first deep neural network that could convincingly model the human voice with far fewer audio recordings. It required very basic work to label the training data.
Since then we’ve seen newer deep learning techniques that use LSTMs and GANs that, when trained on just a few hours of audio recordings, the AI will learn to say words and make sounds that weren’t even part of the original training set, while also offering rich customization in terms of emotive and expressive abilities.
You can listen to some samples of expressive tone changes here and here. These are from Replica Studios’ Agartha, a Lord of the Rings-inspired game in which the enemy is attacking a stronghold (think the Battle of Helm’s Deep from The Two Towers).
The advances in research, combined with the spread of cloud computing, means the technology is more accessible than ever. As such, it may be precisely the right time for game developers to explore voice AI and tap its significant efficiencies in time and cost — and for its future promise of greater, more personalized and engaging storytelling.
Dialogue prototyping meets scalability

Game development is rife with opportunities to embrace voice AI.
Think of triple-A games like Red Dead Redemption 2 or The Witcher series that have hundreds of thousands of lines of recorded dialogue. It’s a massive undertaking, and costly, given the hours it takes to book studio time with voice actors, record dialog, edit, revise the script, and re-record, as needed, during development.
Game design is an iterative process. Designers test and collect user feedback on many different areas using a game prototype before launching, such as the first-time user experience (FTUE), specific game mechanics, animations, player character interactions, and much more.
However, prototyping lacks a developer-friendly tool for in-game voice creation. Given the production costs to iterate, refine, and perfect dialog, including bringing voice actors back for multiple recordings, it takes considerable resources and that’s why game studios often forgo it.
But this is changing with the feasibility and increased accessibility of voice AI for rapid prototyping among game designers.
For smaller studios, voice AI solutions can bring significant savings while also raising the bar on production quality (as we’ve seen with animation software).
For larger studios, the benefit is time, cost and production efficiencies. Imagine how much voice AI could have positively impacted Red Dead Redemption 2’s development schedule and release date if it had been used to prototype the 500-plus hours of dialogue recording.
Reaching a turning point for immersion with voice AI
While there are hundreds of thousands of indie games today that have little-to-no voice dialogue, this could all change in the next couple years. At the same time, larger game studios could soon be exploring deeper story narratives with even more NPCs that interact with players.
Listen to these audio samples from Defense Protocols from Replica Studios here and here. This scene is about a ship’s captain and AI (which takes inspiration from the sassy GlaDOS of Portal) dealing with an enemy attack.
As voice actors embrace using AI voice technology, game developers will have access to a rich library of AI voices to choose from for their game, while voice actors can create new revenue streams for themselves through a streamlined voice marketplace. The quality bar will rise as well.
Voice actors are beginning to welcome the change to their industry. Simon J Smith, a writer/director and voice artist, said: “Many wouldn’t expect it, but I’m optimistic about the future of voice AI and how it can help expand opportunities to license my voice, my IP. I see that voice AI is on the same evolution path as animation, and with it will bring more demand as well as accessibility to license my work for game dialog, designed by studios of any size.”
As improvements have continued among the AI algorithms learning human speech patterns (the ever-progression of NLP) and speech synthesis applications using the industry standard of Speech Synthesis Markup Language (SSML), we’re entering a stage where developers are beginning to have the necessary tools at their fingertips to create high quality, text-to-speech in-game voice, truly at scale.
Here are more audio samples from Replica Studios.
It’s still early, but we’re not far off from this vision. And as the technology gains momentum, game developers and content creators, voice actors and other talents will align to create this ecosystem.
Dynamic in-game player personalization
But what about use cases beyond rapid prototyping? The impact of voice AI goes much further than efficiencies and scalability.
Voice AI technologies will unlock new ways to meet the desire for more personalization in games.
Players spend hours on perfecting their avatar creation in games such as Fallout 4 or Fortnite and soon with Cyberpunk 2077. From the physical appearance of their character to clothes and accessories to its gait, customization is all a part of the player experience.
The possibilities are endless for voice AI to enable truly personalized and dynamic in-game narratives with character voices.
Marco DeMiroz is a cofounder and general partner of The Venture Reality Fund who sees how voice AI could elevate VR gaming experiences with custom dialogue and gameplay. DeMiroz: “Imagine the ability to dynamically insert audio and storylines into games. A player could create their personal avatar as they currently do and now will have a plethora of funny, whimsical, and more options to choose for their avatar’s voice. And, their avatar can interact with NPCs and other characters with their own unique voices created by the player as well. Additionally, voice AI can deliver ultra-realistic and customized voices that can dynamically alter the gameplay per player based on their own skills and progression. Voices could automatically adapt to new vectors in the gameplay to give players a high quality, personalized experience.”
While real-time text-to-speech enables straightforward in-game dialogue, the future promise of voice AI technology is turning text into performances. Where the game designer can create the story and script, and each player can play a role in the narrative using their own voice, or even choose a licensed celebrity voice to enact certain key characters. Picture a storefront where players could select Samuel L. Jackson to voice their avatar (much like he’s licensed his voice for the Alexa assistant).
Here’s a sample of a character from Moon Defense from Replica Studios. It’s a sci-fi game you’re playing a member of an alien race.
Or envision a future where game developers could integrate esports commentator dialogue dynamically into games, such as bringing World Cup updates into FIFA, Sunday Night Football updates into Madden NFL.
For engagement as well as player retention, game developers could explore new dynamic speech elements that really push creative boundaries and unlock new types of gameplay mechanics.
Where voice AI is headed in 2021
 
Above: Starfinder is a sci-fi voice game.
As synthetic speech and creative tools that allow for speech customization and scalability progress, we’ll see a fundamental shift in our engagement with AI-enabled digital voice technologies. It will shift from one that is primarily transactional – “Alexa, tell me the weather” — to one based on dynamic interactions and relationships between characters in any digital narrative or experience.
Advancements in the underlying technology of voice AI are continuous at this point in time — driven by the market’s desire for more powerful creative tools and more natural sounding synthetic voices. Voice AI technologies are currently being fueled by better data analysis, newer approaches to model prosody, and other vocal attributes that all add to how we perceive and evaluate synthetic voice quality. It’s a significant step change that no one could have predicted.
At the same time, we anticipate increased investment in digital rights management and security features for digital voice IP going into 2021, which will give rise to more voice actors and other celebrities migrating to a digital marketplace. We anticipate that as the tools, the voice synthesis technology and the marketplace form over this next year, it will further motivate content creators and game designers to embrace voice AI tech. It’s certainly a market to watch, let alone begin to embrace as you consider your game development roadmap for the near future.
Shreyas Nivas is the cofounder and CEO of Replica Studios.
 Author: Shreyas Nivas
 Source: Venturebeat 
