After mastering the art of machine learning (ML) based voice cloning and synthesis, ElevenLabs, the two-year-old AI startup founded by former Google and Palantir employees, is moving to expand its portfolio with a new text-to-sound model.
Teased a few hours ago, the AI will allow creators to generate sound effects by simply describing their imagination in words. It is expected to enrich content in a new way in the age of AI-driven digital experiences.
The model is not available publicly, but ElevenLabs has showcased its capabilities by releasing a minute-long teaser featuring videos produced by OpenAI’s new Sora and enhanced with its own AI sounds. The company has also set up a signup page and is calling potential users to join an early access waitlist for the model.
Going beyond voice with AI sound effects
Founded in 2022, ElevenLabs has been researching AI to make audio and video content – from movies to podcasts – accessible across languages and geographies. The company has debuted a range of offerings to further this, including text-to-speech and speech-to-speech models that can produce AI speech from a given piece of content (text/audio/video) in 29 different languages whilst delivering natural voice and emotions (original speaker’s voice in speech-to-speech).
While both these tools continue to see widespread adoption from enterprises and individuals who produce content, there’s also been the rise of entirely AI-generated content, thanks to tools such as Runway, Pika and most recently OpenAI (with Sora). These products generate realistic AI videos from simple text prompts, but what they lack is default audio. This is where ElevenLabs’ new model will come in, allowing users to produce sound effects for their content by describing what they want.
When put to use, this offering can easily allow AI creators to enhance their work with background sounds that should naturally come with it. The sound effect can be of anything, from chirping birds to moving vehicles and horns. It can even be people talking, eating or walking on a busy street.
“At ElevenLabs, we have only ever shown our text-to-speech models in public. However, we have so much more in development. And when OpenAI announced their Sora model — which generates incredible videos but without sound — we decided to show a sneak peek of our new product line,” Luke Harries, who heads growth at ElevenLabs, wrote while resharing the X post that featured a bunch of Sora-generated videos enhanced with AI sound effects from the company’s model.
Beyond AI-generated content, the sounds produced from the new model might even be applied to plain speech produced from text or any other video – Instagram clip, commercial or video game trailer – that needs a touch of background audio. It remains to be seen how it is used and what kind of quality it delivers.
Sign up for early access
While ElevenLabs has not shared when it plans to launch the model publicly, the company has opened signups for early access. Interested users can head over to this page and register with their name and email while describing what they need the sound effects for. ElevenLabs is also asking early volunteers to write a sample prompt for an AI sound effect, potentially to optimize the responses of the model.
Once the sign-up is complete, the user is included in a waitlist and will get access when the model becomes available. The timeline, however, remains uncertain at this stage.
The new text-to-sound technology may give ElevenLabs a first-mover advantage, but it is important to note that several other companies that are active in the AI speech space also have the potential to venture into this segment. This includes known players such as MURF.AI, Play.ht and WellSaid Labs.
According to Market US, the global market for such tools stood at $1.2 billion in 2022 and is estimated to touch nearly $5 billion in 2032, with a CAGR of slightly above 15.40%.
Author: Shubham Sharma
Source: Venturebeat
Reviewed By: Editorial Team