Podcast: Generative AI Composes New Opportunities in Audio

Generative AI Composes New Opportunities in Audio Creation

February 29, 2024

Christina Cardoza

Ria Cheruvu

Despite what many may think, generative AI extends beyond just generating text and voice responses. Among the growing fields is audio-based generative AI, which harnesses AI models to create and/or compose fresh and original audio content. This opens a world of new possibilities for developers and business solutions.

In this podcast, we discuss the opportunities presented by audio-based generative AI and provide insights into how developers can start building these types of applications. Additionally, we explore the various tools and technologies making audio-based generative AI applications possible.

Listen Here

Our Guest: Intel

Our guest this episode is Ria Cheruvu, AI Software Architect and Generative AI Evangelist for Intel. Ria has been with Intel for more than five years in various roles, including AI Ethics Lead Architect, AI Deep Learning Research, and AI Research Engineer.

Podcast Topics

Ria answers our questions about:

(1:52) Generative AI landscape overview
(4:01) Generative AI for audio business opportunities
(6:29) Developing generative AI audio applications
(8:24) Available generative AI technology stack
(11:45) Developer resources for generative AI development
(14:36) What else we can expect from this space

Transcript

Christina Cardoza: Hello and welcome to the IoT Chat, where we explore not only the latest developments in the Internet of Things, but AI, computer vision, 5G, and more. Today we’re going to be talking about generative AI, but a very interesting area of generative AI, which is the audio space, with a familiar face and friend of the podcast, Ria Cheruvu from Intel. Thanks for joining us again, Ria.

Ria Cheruvu: Thank you, Christina, excited to be here.

Christina Cardoza: So, not only are you AI Software Evangelist for Intel, but you’re also a Generative AI Evangelist. So, what can you tell us about what that means and what you’re doing at Intel these days?

Ria Cheruvu: Definitely. Generative AI is one of those transformational spaces in the AI industry that’s impacting so many different sectors, from retail to healthcare, aerospace, and so many different areas. I’d say that as part of being an AI evangelist it’s our role to keep up to date and help educate and evangelize these materials regarding AI.

But with generative AI that’s especially the case. The field moves so rapidly, and it can be challenging to keep up to date with what’s going on. So that’s one of the things that really excites me about being an evangelist in the generative AI space around what are some of the newer domains and sectors that we can innovate in.

Christina Cardoza: Absolutely. And not only is there always so much going on, but take generative AI for example: there’s so many different areas of generative AI. I almost think of it—like, artificial intelligence is an umbrella term for so many of these different technologies—generative AI is also sort of an umbrella for so many different things that you can do. I think a lot of people consider ChatGPT as generative AI, and they don’t realize that it really goes beyond that.

So that’s sort of where I wanted to start the conversation today. If you could tell us a little bit more about the generative AI landscape: where things are moving towards, and the different areas of development that we have, such as the text based or audio areas.

Ria Cheruvu: Sure. I think the generative AI space is definitely developing in terms of the types of generative AI. And, exactly as you mentioned, ChatGPT is not the only type of generative AI out there, although it does represent a very important form of text generation. We also have image generation, where we’re able to generate cool art and prototypes and different types of images using models like Stable Diffusion. And then of course there’s the audio domain, which is bringing in some really unique use cases where we can start to generate music: we can start to generate audio for synthetic avatars, and so many other different types of use cases.

So I know that you mentioned the tech stack, and I think that that’s especially critical when it comes to being able to understand what are the technologies powering generative AI. So, especially with generative AI there’s a couple of common pain points. One of them is a fast run time. You want these models that are super powerful and taking up a lot of compute to be able to generate outputs really quickly and also with high quality. That pertains to both text, to image, and then all the way to audio too.

For the audio domain it’s especially important, because you have these synthetic audio elements that are being generated, or music being generated, and it’s one of those elements that we pay a lot of attention to, similar to images and text. So I’d say that the tech stack around optimizing generative AI models is definitely crucial and what I investigate as part of my day-to-day role.

Christina Cardoza: I’m looking forward to getting a little bit deeper into that tech stack that you just mentioned. I just want to call out generative AI for audio. You mentioned the music-generation portion of this, and I just want to call that out because we’ve had conversations around voice AI and conversational AI, and this is sort of separate from that area. It’s probably adjacent to it, but we’re not exactly talking about those AI avatars or chatbots that you’re communicating with and that you can have conversations with.

But, like you said, the music composition of this, the audio composition of this—so I’m curious, what are the business opportunities for generative AI for audio? Just so that we can get an understanding of the type of use cases that we’re looking at before we dive deeper a little bit into that tech stack and development.

Ria Cheruvu: Yeah, and I think that you brought up a great point in terms of conversational voice agents and how does this actually relate. And I think it’s really interesting to think about how we use AI for reading in and processing audio, which is what we do with a voice agent—like a voice assistant on our phones compared to generative AI for audio, where we’re actually creating this content.

And I know you mentioned, for example, being able to generate these synthetic avatars or this voice for being able to communicate and call and talk to. And I think, definitely, the business applications for those, the first ones that we think about are call centers, or, again, metaverse applications where we have simulated environments and we have parties or actors that are operating using this audio. There’s also additional use cases for interaction in those elements.

And then we go into the creative domain, and that’s where we start to see some of the generative AI for music-related applications. And this, to me, is incredibly exciting. Because we’re able to start to look at how generative AI can complement artists’ workflows, whether you’re creating a composition and using generative AI to figure out and sample some beats and tunes in a certain space, and also dig deeper into existing composition. So, to me, that’s also a very interesting cultural element of how musicians and music producers can connect and leverage generative AI as part of their content-creation workflows.

So, while that’s not a traditional business use case—like what we would see in call centers, interactive kiosks that can use audio for retail, and other use cases—I also believe that generative AI for music has some great applications in the content creation, artistic domain. And eventually, that could also come into other types of domains as well where we need to generate certain sound bites, for example, training synthetic data for AI systems to get even better at this.

Christina Cardoza: Yeah, it’s such an exciting space. I love how you mentioned the artistic side of this. Because we see generative AI with the image creation, like you mentioned, creating all of these different types of pictures for people and paintings—things like that. So it’s interesting to see this other form that people can take and express their artistic capabilities with generative AI.

Because we talked about how generative AI—you can use it for text or image generation—I’m curious what the development for generative AI for audio is. Are there similarities that developers can take from text or image generation? Or is this a standalone development process.

Ria Cheruvu: That’s a great question. I think that there’s a couple of different ways to approach it as it is currently in the generative AI domain. One of the approaches is definitely adapting the model architectures that are already out there when it comes to audio and music generation, and also leveraging the architectures for other types of generative AI models. So, for example, Riffusion, which is a really popular earlier model in the generative AI-for-audio space, although considerably it’s pretty new, but with the advancements in generative AI there’s just more and more models being created every day.

This particular Riffusion model is based on the architecture for Stable Diffusion, the image-generation model, in that sense that we’re actually being able to generate waveforms instead of images leveraging Riffusion model. And there are similar variants that are popping up, as well as newer ones that are saying, “How do we optimize the architecture that we’re leveraging for generative AI and structure it in a way that you can generate audio sound bites or audio sound tokens or things like this that are customized for the audio space?”

I was talking to someone who is doing research in the music domain, and one of the things that we were talking about is the diversity and the variety of input data that you can give these models as part of the audio domain—whether that’s notes, like as part of a piano composition, all the way to just waveforms, or specific types of input formats as well that are specialized for different use cases, like MIDI or MIDI formats. There’s a lot of different diversity and application of the types of input data and outputs that we’re expecting from these models.

Christina Cardoza: And I assume with these models, in order to optimize them and to get them to perform well and to deploy them, there is a lot of hardware and software that’s going to go into this. We mentioned a little bit of that tech stack in the beginning. So, what types of technologies make these happen, or train these models and deploy these models, especially in the Intel space? How can developers partner with Intel to start working towards some of these generative AI audio use cases and leverage the technologies that the company has available?

Ria Cheruvu: As part of the Intel^® OpenVINO^™ toolkit, we’ve been investigating a lot of interesting generative AI workloads, but audio is definitely something that is continuing to come back again and again as a very useful and interesting use case, in a way to prompt and test generative AI capabilities. I’d say that as part of the OpenVINO Notebooks repository we are incorporating a lot of key examples when it comes to audio generation—whether it’s the Riffusion model, which we had a really fun time partnering with other teams across Intel to generate pop beats, similar to something that Taylor Swift would make, to some more of these advanced models, like generating audio, again, for being able to match it to something that someone is speaking. So there’s a lot of different use cases and complexity.

With OpenVINO we are really targeting that optimization part, which is based on this fundamental notion that we are recognizing that generative AI models are big in terms of their size and their memory footprint. And naturally the foundations for all of these models—be it audio, image generation, text generation—there’s certain elements of it that are just very large and that can be optimized further. So by halving model footprint or the model size by using compression and quantization-related techniques, we’re able to achieve a lot of reduction in terms of the model size, while also ensuring that the performance is very similar.

Then all of this is motivated by a very interesting notion of local development, where you’re starting to see music creators or audio creators looking to move towards their PCs in terms of creating content, as well as working on the cloud. So with that flexibility you’re able to essentially do what you need to do on the cloud in terms of some of your intensive work—like annotating audio data, gathering it, recording it, collaborating with different experts to create a data set that you need. And then you’re able to do some of your workloads on your PC or on your system, where you’re saying, “Okay, now let me generate some interesting pop beats locally on my system and then prototype it in a room.” Right?

So there’s a lot of different use cases for local versus cloud computing. And one of the things that I see with OpenVINO is optimizing these architectures, especially the bigger elements when it comes to memory and model size, but also being able to enable that flexibility between traversing the edge and the cloud and the client.

Christina Cardoza: I always love hearing about these different tools and resources. Because generative AI—this space—it can be intimidating, and developers don’t know exactly where to start or how they can optimize their model. So I think it’s great that they can partner with Intel or use these technologies, and it really simplifies the process and makes it easy for them so they can focus on the use case, and they don’t have to worry about any of the other complications that they may come across.

And I love that you mentioned the OpenVINO Notebooks. We love the OpenVINO Notebook repository, because you guys just provided a wealth of different tutorials, sample codes, information for all of these different things we talk about in the podcast—how developers can get started, experiment with it, and then really create their own real-world business use cases. Where else do you think developers can learn about generative AI for audio, can learn how to develop it, build it?

Ria Cheruvu: Yeah. I think that—and definitely, Christina, I think we’re very excited about being able to advance a lot of the development, but also the short prototypes that you can do to actually take this forward and partner with developers in this space, and also be able to take it further with additional, deeper engagements and efforts such as that.

I think to answer your question about a deeper tech stack, one of the examples that I really love to talk about—and I was able to witness firsthand as part of working through and creating this—is how do you exactly tape some of the tutorials and the workloads that we’re showing in the OpenVINO Notebooks repo, and then turn it into a reality for your use cases?

So, at Intel we partner with Audacity, a tool that is essentially enabling open-source, audio-related editing creation and a couple of other different efforts. It’s really this one-stop Photoshop kind of tool for audio editing. And one of the things that we’ve done is integrated OpenVINO through a plugin that we provide with that platform. So, as part of that what our engineering team did is they took the code in the OpenVINO Notebooks repo from Python, converted it to C++, and then were able to deploy it as part of Audacity.

So now you’re getting even more of that performance and memory improvement, but you’re also having it integrated directly into the same workflow that many different people who are looking to edit and play around with audio are leveraging. So that means that you just highlight a sound bite and then you say “Generate” with OpenVINO, and then it’ll generate the rest of it, and you’re able to compare and contrast.

So, to me, that’s an example of workflow integration, which can eventually, again, be used for artist workflows, all the way to creating synthetic audio for voice production as part of the movie industry; or, again, interactive kiosks as part of the retail industry for being able to communicate back and forth; or patient-practitioner conversations as part of healthcare. So I’d say that that seamless integration into workflows is the next step that Intel is very excited to drive and help collaborate on.

Christina Cardoza: Yeah, that’s a great point. Because beginner developers can leverage some of these notebooks or at least samples to get started and start playing around with generative AI, especially generative AI for audio. But then when they’re ready to take it to the next level, ready to scale, Intel is still there with them, making sure that everything is running smoothly and as easy as possible. They can start continuing on their journey for generative AI.

I know in the beginning we mentioned how a lot of people consider, like, ChatGPT or text-based AI as generative AI, when it really is all of these other different forms associated with it also. So I think probably still early days in this space, and I’m looking forward to the additional opportunities that are going to come. I’m curious, from your perspective, where do you think this space is going in the next year or so? And what is the future of generative AI, especially generative AI for audio? And how do you envision Intel is going to play a role in that future?

Ria Cheruvu: Sure. And I completely agree. I think that it’s blink and you may miss it when it comes to generative AI for audio, even with the growth of the OpenVINO Notebooks repository. As an observer and a contributor, it’s just amazing to see how many workloads around audio have continued to be added in terms of generative AI workloads and some of the interesting elements and ways that we can implement and optimize that.

But I’d say, just looking into the near future, maybe end of year or maybe next year or so, some of the developments that we can start to see that are popping up are definitely those workflows that we think about. Now we have these models and these technologies, and we’re seeing a lot of companies in the industry creating platforms and toolboxes, as they call it, for audio editing and audio generation and some of these elements using generative AI. So I would say that identifying where exactly you want to run these workloads—is it on your local system, or is it on the cloud, or some sort of mix of it?—is definitely something that really interests me, as I mentioned earlier.

And with Intel, some of the things that we are trying are around audio generation on AI PC with the Intel^® Core^™ Ultra and similar types of platforms around what can you achieve locally when you’re sitting in a room, prototyping with a bunch of fellow artists for music, and you’re just playing around and trying to do some things. And ideally you’re not exactly having to access the cloud for that, but you’re actually able to do it locally, export it to the cloud, and move your workloads back and forth.

So I’d say that that really is the key of it, which is what exactly is going to happen with generative AI for audio. How do we get it to be incorporating our stakeholders as part of that process—whether we’re, again, generating audio for these avatars—how do we exactly create that, instantiate that, and then maintain it over time? I think that these are a lot of the questions that are going to be coming up in the next year. And I’m excited to be collaborating with our teams at Intel and across the industry to see what we’re going to achieve.

Christina Cardoza: Great. And I love that you mentioned maintaining it over time. Because we want to make sure that anything that we do today is still going to make sense tomorrow. How can we future-proof the developments that we’re doing? And Intel is always leading the way to make sure that developers can plug and play or add new capabilities, make their solutions more intelligent without having to rewrite their entire application. Intel has always been great at partnering with developers and letting them take advantage of the latest innovations and technologies. So I can’t wait to see where else the company takes this.

We are running a little bit out of time. So, before we go, I just want to ask you one last time, Ria, if there’s anything else about this space that we should know, or there’s any takeaways that you want to leave our listeners with today.

Ria Cheruvu: I think one takeaway is exactly rephrasing what you said in terms of there’s a lot of steps towards being able to enable and deploy generative AI. It’s kind of that flashy space right now, but almost everyone sees the value that we can extract out of this if we have that future-proof strategy and that mindset. Definitely couldn’t have phrased it better in terms of our value prop or value add that we want to provide to the industry—is really being able to hold the hands of developers, show you what you can do with the technology and the foundations, and also help you with every step of the way in order to achieve what you want.

But I’d say, based off of everything that we’ve gathered up until now, as I mentioned earlier, generative AI for audio and specifically generative AI in general is just moving so fast. So, keeping an eye on the workloads, evaluating, testing and prototyping is definitely key as we move forward into this new era of audio generation, synthetic generation, and so many more of these exciting domains.

Christina Cardoza: Of course we’ll also be keeping an eye on the work that you’re doing at Intel. I know you often write and publish a lot of blogs on the Intel or OpenVINO media channels and different areas. There’s different edge reference kits that are published online every day. So we’ll continue to keep an eye on this space and the work that you guys are doing.

So, just want to thank you again for joining us on the podcast and for the insightful conversation. And thank you to our listeners for tuning into this episode. Until next time, this has been the IoT Chat.

The preceding transcript is provided to ensure accessibility and is intended to accurately capture an informal conversation. The transcript may contain improper uses of trademarked terms and as such should not be used for any other purposes. For more information, please see the Intel^® trademark information.

This transcript was edited by Erin Noble, copy editor.