Audio-Based Generative AI

Harmonizing Innovation with Audio-Based Generative AI

March 22, 2024

Christina Cardoza

Generative AI

Artificial intelligence is an umbrella term for many different technologies. Generative AI is one we hear a lot about—particularly ChatGPT. And ChatGPT gets a whole lot of press, but it’s not at all the only song in the generative AI playbook. And one tune that Ria Cheruvu, AI Software Architect and Generative AI Evangelist at Intel, has been excited about lately is generative AI for the audio space (Video 1).

Video 1. Ria Cheruvu, Generative AI Evangelist for Intel, explores the business and development opportunities for audio-based generative AI. (Source: insight.tech)

But generative AI of any kind can be intimidating, and developers don’t always know exactly where to start or, once they get going, how to optimize their models. Partnering with Intel can really simplify the process. For example, beginning developers can leverage the Intel^® OpenVINO^™ notebooks to take advantage of tutorials and sample codes that will help them get started playing around with GenAI. And then, when they’re ready to take it to the next level or ready to scale, Intel will be right there with them.

Ria Cheruvu talks with us about the OpenVINO notebook repository, as well as the real-world applications suggested by generative AI for audio, and the differences between the aspect of it that works for call centers and the aspect that can actually work for musicians.

What are the different areas of generative AI?

This space is definitely developing in terms of the types of generative AI out there. ChatGPT is not the only example of it! Text generation is a very important form of generative AI, of course, but there is also image generation, for example, using models like Stable Diffusion to produce art and prototypes and different types of images. And there’s also the audio domain, where you can start to make music, or make audio for synthetic avatars, as well as many other types of use cases.

In the audio domain, a fast runtime is especially important, and that’s one of the common pain points. You want models that are super powerful and able to generate outputs with high quality really quickly, and that takes up a lot of compute. So I’d say that the tech stack around optimizing generative AI models is definitely crucial, and it’s something I investigate as part of my day-to-day role at Intel.

What are the specific business opportunities around generative AI for audio?

It’s really interesting to think about using voice AI or conversational AI for reading in and processing audio, which is what you do with a voice agent, like a voice assistant on your phone. Compare that to generative AI for audio, where you’re actually creating the content—being able to generate synthetic avatars or voices to call and talk to, for example. And definitely the first business applications you think about are call centers, or metaverse applications where there are simulated environments that use this created audio.

But there are also some nontraditional business uses cases in the creative domain, in content creation, and that’s where we start to see some of the applications related to generative AI for music. And to me this is incredibly exciting. Intel is starting to look at how generative AI can complement artists’ workflows: for example, in creating a composition and using generative AI to sample beats. There’s also a very interesting cultural element to how musicians and music producers can leverage generative AI as part of their content-creation workflows.

And so while it’s not a traditional business use case—like what you would see in call centers, or in interactive kiosks that use audio for retail—I do believe that generative AI for music has some great applications for content creation. Eventually it could also come into other types of domains where there is a need to generate sound bites, for example, creating synthetic data for AI system training.

“#GenerativeAI for music has some great applications for content creation. Eventually it could also come into other types of domains where there is a need to generate sound bites” – Ria Cheruvu, @intel via @insightdottech

What is the development process for generative AI for audio?

There are a couple of different ways that the generative AI domain is currently approaching this. One of them is definitely adapting the model architectures that are already out there for other types of generative AI models. For example, Riffusion is based on the architecture for Stable Diffusion, the image-generation model; it just generates waveforms instead of images.

I was speaking recently to someone who is doing research in the music domain, and one of the things we talked about was the diversity of input data that you can give these audio-domain models. It could be notes—maybe as part of a piano composition—all the way to just waveforms or specific types of input that are specialized for use cases like MIDI formats. There’s a lot of diversity there.

What technologies are required to train and deploy these models?

We’ve been investigating a lot of interesting generative AI workloads as part of the Intel OpenVINO toolkit and the OpenVINO Notebooks repository. We are incorporating a lot of key examples of audio generation as very useful use cases to prompt and test generative AI capabilities. We had a really fun time partnering with other teams across Intel to create Taylor Swift-type pop beats using the Riffusion model—all the way to more advanced models that generate audio to match something that someone is speaking.

And one of the things that I see with OpenVINO is being able to optimize all these models, especially when it comes to memory and model size, but also enabling flexibility between the edge and the cloud and the client.

OpenVINO really targets that optimization part. There’s a fundamental notion that generative AI models are big in terms of their size and their memory footprint; and the foundations for all of these models—be it audio, image, or text generation—certain elements of them just are very large. By halving the model footprint using compression and quantization-related techniques, we’re able to achieve a lot of reduction of the model size while still ensuring that performance is very similar.

And all of this is motivated by a very interesting notion of local development. Music creators or audio creators are looking to move toward their PCs when creating content—as well as being able to work on the cloud in terms of intensive work like gathering audio data, recording it, annotating it, and collaborating with different experts to create a data set. And then they would be able to do other workloads on a PC and say, “Okay, now let me generate some interesting pop beats locally on my system and then prototype that in a room.”

What are some examples of how developers can get started with generative AI?

One example that I really love to talk about is how exactly you take some of these OpenVINO tutorials and workloads that we’re showing in the notebooks repo and then turn them into reality. At Intel we partner with Audacity, a tool that essentially enables open-source audio-related editing creation. It’s really a one-stop, Photoshop kind of a tool for audio editing. And one of the things we’ve done is integrate OpenVINO with it through a plugin that we provide. Our engineering team took the code in the OpenVINO Notebooks repo from Python, converted it to C++, and then deployed it as part of Audacity.

It allows for more of that performance and memory improvement I mentioned before, but it’s also integrated directly into the same workflow that many different people who are editing and just playing around with audio are leveraging. You just highlight a sound bite and say “Generate,” and OpenVINO will generate the rest of it.

That’s an example of workflow integration that can be used for artist workflows; or to create synthetic audio for voice production for the movie industry; or for interactive kiosks in the retail industry; or for patient-practitioner conversations in healthcare. That seamless integration into workflows is the next step that Intel is very excited to drive and to help collaborate on.

What else is in store for generative AI—especially generative AI for audio?

When it comes to generative AI for audio, I think it’s “blink and you may miss it” for any particular moment in this space. It’s just amazing to see how many workloads have been added. But just looking into the near future—maybe end of year or next year—some of the developments I can start to see popping up are definitely around those workflows I mentioned before, and identifying where exactly you want to run them—is it on your local system, or is it on the cloud, or on some sort of mix of the two? That is definitely something that really interests me.

We are trying some things around audio generation on the AI PC with the Intel^® Core^™ Ultra and similar types of platforms, where—when you’re sitting in a room prototyping with a bunch of fellow musicians and just playing around—ideally you won’t have to access the cloud for that. Instead, you’ll be able to do it locally, export it to the cloud, and just move your workloads back and forth. And key to this is asking how we incorporate our stakeholders as part of that process—how do we exactly create generative AI solutions, instantiate them, and then maintain them over time?

Can you leave us with a final bit of generative AI evangelism?

Generative AI is kind of a flashy space right now, but almost everyone sees the value that can be extracted out of it if there is a future-proof strategy. The Intel value prop for the industry is really being able to hold the hands of developers, to show them what they can do with the technology, and also to help them every step of the way to achieve what they want.

Generative AI for audio—generative AI in general—is just moving so fast. So keep an eye on the workloads, evaluating, testing, and prototyping; they are definitely all key as we move forward into this new era of audio generation, synthetic generation, and so many more of these exciting domains.