Machine Learning with MindsDB

Machine Learning Simplified: With MindsDB

September 15, 2022

Christina Cardoza

Erik Bovee

Machine learning is no longer just for the AI experts of the world. With ongoing initiatives to democratize the space, it’s for business users and domain users now, too. Users no longer need to have any programming language knowledge to build and deploy machine learning models.

But democratizing machine learning does not mean data scientists and machine learning engineers are now obsolete. When machine learning becomes simplified, it means less time spent acquiring the data, transforming the data, cleaning the data, and preparing the data to train and retain models. Instead, they can focus on the core aspects of machine learning like unlocking valuable data and enabling business results.

In this podcast, we talk to machine learning solution provider MindsDB about why machine learning is crucial to a business’ data strategy, how democratizing machine learning helps AI experts, and the meaning of in-database machine learning.

Listen Here

Our Guest: MindsDB

Our guest this episode is Erik Bovee, Vice President of Business Development for MindsDB. Erik started out as an investor in MindsDB before taking a larger role in the company. He now helps enable sophisticated machine learning at the data layer. Prior to MindsDB, he was a Venture Partner and Founder of Speedinvest as well as Vice President and Advisory Board Member at the computer networking company Stateless.

Podcast Topics

Erik answers our questions about:

(2:34) The current state of machine learning
(7:07) Giving businesses the confidence to create machine learning models
(8:48) Machine learning challenges beyond skill set
(11:24) Benefits of democratizing machine learning for data scientists
(13:39) The importance of in-database machine learning
(17:22 ) How data scientists can leverage MindsDB’s platform
(19:37) Use cases for in-database machine learning
(23:35) The best places to get started on a machine learning journey

Transcript

Christina Cardoza: Hello, and welcome to the IoT Chat, where we explore the latest developments in the Internet of Things. I’m your host, Christina Cardoza, Associate Editorial Director of insight.tech. And today we’re talking about machine learning as part of your data strategy with Erik Bovee from MindsDB. But before we jump into the conversation, let’s get to know our guest. Erik, welcome to the show.

Erik Bovee: Thank you, yeah, it’s great to be here.

Christina Cardoza: What can you tell us about MindsDB and your role there?

Erik Bovee: So MindsDB is a machine learning service, and I’ll get into the details. But the goal of MindsDB is to democratize machine learning, make it easier and simpler and more efficient for anybody to deploy sophisticated machine learning models and apply them to their business. I’m the Vice President of Business Development, which is a generic title with a really broad role. I do—I’m responsible for our sales, but that’s kind of a combo of project management, some product management, kind of do everything with our customers. And then a really important aspect that I handle are our partnerships. So one of the unique things about MindsDB is that we enable machine learning directly on data in the database. So we connect to a database and allow people to run machine learning, especially on their business data, to do things like forecasting, anomaly detection. So I work with a lot of database providers, MySQL, Cassandra, MariaDB, Mongo, everybody. And that’s one of the key ways that we take our product to market: working with data stores and streaming brokers, data lakes, databases, to offer machine learning functionality to their customers. So I’m in charge of that. And also work with Intel^®. Intel’s provided a lot of support. They’re very close with MariaDB, who’s one of our big partners, and Intel also provides OpenVINO^™, which is a framework which helps accelerate the performance of our machine learning model. So I’m in charge of that as well.

Christina Cardoza: Great. I love how you mentioned you’re working to democratize machine learning for all. I don’t think it’s any surprise to businesses out there that machine learning has become a crucial component of a data management strategy, especially, you know, when all the influx of data is coming from all of these IoT devices, it’s difficult to sift through all of that by yourself. But a challenge is that there’s not a lot of machine learning skills to go around for everybody. So I’m wondering, let’s start off the conversation, if you can talk about what the state of machine learning adoption looks like today.

Erik Bovee: Yeah, I mean, you summed up a couple of the problems really well. The amount and the complexity of data is growing really quickly. And it’s outpacing human analytics, and even algorithmic-type analytics, traditional methods. And also, machine learning is hard. You know, finding the right people for the job is kind of difficult. These resources are scarce. But in terms of the state of the market, there are a couple of interesting angles. First, the state of the technology itself, and core–machine learning model, is amazing. You know, just the progress made over the last five to ten years is really astonishing. And cutting-edge machine learning models can just solve crazy hard real-world problems. If you look at things like what OpenAI has done, with their large GPT-3 large language models, which can produce human-like text or even consumer applications, there’s a—you’ve probably heard of Midjourney, which you can access via Discord, which, based on a few key words, can produce really sophisticated, remarkable art. There was a competition—I think it was in Canada recently—that a Midjourney-produced piece won, much to the annoyance of actual artists. So the technology itself can do astonishing things.

From an implementation standpoint though, I think the market has yet to benefit broadly from this. You know, even autonomous driving is still more or less in the pilot phase. And the capabilities of machine learning are amazing in dealing with big problem spaces—dynamic, real-world problems, but adapting these to consumer tech is a process. And they’re just—there are all kinds of issues that we’re tackling along the way. One is trust. You know, not just, can this thing drive me safely? But then also, how do I trust that this model’s accurate? Can I put my—the fate of my business on this forecasting model? How does it make decisions? So those are, I think those are important aspects to getting people to implement it more broadly.

And then I think one of the things that’s really apparent in the market, and as I’m dealing with customers, are some of the hurdles to implementation. So, cutting-edge machine learning resources are rare, which we said, but then also a lot of the simpler stuff, like machine learning operations, turns out to be more of a challenge, I think, than people anticipated. So the data collection, the data transformation, building all the infrastructure to do this ETL link data, extracting, transforming it, loading it from your database into a machine learning application, and then maintaining all this piping and all these contingencies. Model serving is another one. Your web server is not going to cut it when you’re talking about large machine learning models for all kinds of technical reasons. And these are all being solved piecemeal as we speak. But the market for that is in an early stage. Those are dependencies that are really important for broad adoption of machine learning.

But there are a few, I would say there are a few sectors where commercial rollout is moving pretty fast. And I think they’re good bellwethers for where the market is headed. Financial services is a good example and has been for a few years. Big banks, investment houses, hedge funds, they’ve got the budgets and the traditional approach to hiring around a good quant strategy. They’re moving ahead pretty quickly, and often with well-funded internal programs. Those give them a really big edge, but they’ve got the money to deploy this, and it’s this narrow business advantage for things like forecasting, algorithmic trading are tremendously important to their margins. So I’ve seen a lot of progress there. But a lot of it is also throwing money at the problem and kind of solving internally these MLOps questions, not necessarily applicable to the broader market.

The next are, I would say, industrial use cases. You had mentioned IoT. That’s where I see a lot of progress as well, especially in things like manufacturing. For example, taking tons of high-velocity sensor data and doing things like predictive maintenance. You know, what’s going to happen down the line? When will this server overheat, or something? That’s where we’ve seen a lot of implementation as well. I think those sectors, those market actors are clearly maturing quickly.

Christina Cardoza: So, great. Yeah. I want to go back to something you said about trust. Because I think trust goes a little bit both ways. Here you mentioned how businesses have to trust the solution or the AI to do this correctly and accurately. But I think there’s also a trust factor that the person deploying the machine learning models or training the machine learning models, knows what they’re doing. And so, when you democratize AI, how can business stakeholders be confident and comfortable that a business or an enterprise user is training and creating these models and getting the results that they’re looking for?

Erik Bovee: Yeah. I think a lot of that starts with the data. Really understanding your data, make sure there aren’t biases. Explainable AI has become an interesting subject over the last few years as well. Looking at visualizations, different techniques like Shapley values or counterfactuals to see where, how is this model making decisions? We did a big study on this a few years back. Actually, one of the most powerful ways of getting business decision makers on board and understanding exactly how the model operates—which is usually pretty complex even for machine learning engineers, once the model is trained what the magic is that’s going on internally is not always really clear—but one of the most powerful tools is providing counterfactual explanations. So, changing the data in subtle ways that you get a different decision. Maybe the machine learning forecast will change dramatically when one feature in the database or a few data points, or just a very slight change, and understanding where that threshold is. It’s like, here’s what’s really triggering the decision making or the forecasting on the model in which columns or which features are really important. If you can visualize those, it gives people a much better sense of what’s going on and how the decisions are weighted. That’s very important.

Christina Cardoza: Absolutely. So, I’m also curious, you know we mentioned some of the challenges, a big one being not enough skill set or data scientists available within an organization, but I think even if you do have the skills available, it’s still complex to train machine learning models or to deploy these two applications. So can you talk about some of the challenges businesses face beyond skill set?

Erik Bovee: Interestingly—so, skill set is one, but that’s, I think that will diminish over time. There are more and more frameworks that allow people to get access, just data analysts or data scientists to get access to more sophisticated machine learning features that AutoML has become a thing over the past few years. And you can do a lot, you can go a long way with automobile frameworks, like DataRobot or H2O. What is often challenging are some of the simple things, some of the simple operational things in the short term, on the implementation side. You know, a lot of the rocket science is already done by these fairly sophisticated core machine learning models, but a huge amount of a data scientist’s or ML engineer’s time is spent on data acquisition, data transformation, cleaning the data and coding it, building all the pipeline for preparing this data to train and retrain a model. Then maintaining that over time.

You know, the data scientist tool set is often based on Python, which is where a lot of these pipelines are written. Python’s not necessarily, arguably not very well adapted to data transformations. And then what happens, you’ve often got this bespoke Python code written by a data scientist, and maybe things that are being done in a Jupyter Notebook somewhere, then it becomes a pain to update and maintain. What happens when your database tables change? Then what do you do? You’ve got to go back into this Python and it’s all reliant on this one engineer to kind of update everything over time. And so they—that’s, I think, the MLOps side is one of the biggest challenges. How do you do something that is efficient and repeatable and also predictable in terms of cost and overhead over time? And that’s something that we’re trying to solve.

And one of the theories behind that, behind our approach, is just to bring machine learning closer to the data and to use existing tools like SQL to do a lot of this stuff. They were very—SQL’s pretty well adapted to data transformation and manipulating data—that’s what it’s designed for. And so why not find a way where you can apply machine learning directly, via connection to your database, and use your existing tools, and not have to build any new infrastructure. So I think that’s a big pain point—one of the bigger bottlenecks that we’re trying to solve actively.

Christina Cardoza: So, you touched on this a little bit, but I’m wondering if you can expand on the benefits that the data scientists will actually see if we democratize machine learning. How can they start working with some of those business users together on initiatives for machine learning?

Erik Bovee: Yeah. So one of our goals is to give data scientists a broader tool set, and to save them a lot of time on the operational, the cleanup and the operational tasks that they have to perform on the data, and allow them really to focus on core machine learning. So the philosophy of our approach—we take a data-centric approach to machine learning. You’ve got data sitting in the database, so why not bring the machine learning models to the database, allow you to do your data prep, to train a model. Let’s say, for example, in an SQL-based database, using simple SQL with some modifications as SQL syntax from the MindsDB standpoint. We don’t—we’re not consuming database resources; you just connect MindsDB to your database. We read from the database, and then we can pipe machine learning predictions, let’s say business forecasts, for example, back to the database as tables that can then just be read like your other tables.

The benefit there for data analysts and any developer who’s maybe building some application on the front end that wants to make decisions, algorithmic trading, or, you know, anomaly detection. You want to send up an alert when something’s going wrong, or you just want to visualize it in a BI tool like Tableau is that you can use the existing code that you’ve got. You simply query the database just like you have from another application. There’s no need to build a special Python application or connect to another service. It’s simply there. And you access it just like you would access your data normally. So that’s one of the business benefits, is that it cuts down considerably on the bespoke development, is very easy to maintain in the long term, and you can use the tools you already have.

Christina Cardoza: So you mentioned you’re working to bring machine learning closer to the data, or bringing machine learning into the database. I’m wondering, is this how, traditionally, machine models have—machine learning models have been deployed, or is there another way of doing it? So, can you talk about how that compares to traditional methods—bringing it into the database versus the other ways that organizations have been doing this?

Erik Bovee: So, traditionally machine learning has been approached like people would approach a PhD project, or something. It’s, you would write a model using an existing framework like TensorFlow or PyTorch, usually writing a model in Python. You would host it somewhere, probably not with a web server there are Ray and other frameworks that are well adapted to model serving. And then you have data you want to apply. It might be sitting all over the place, and maybe it’s in a data lake, some in Snowflake, some is in MongoDB, wherever. You write pipelines to extract that data, transform it. You often have to do some cleaning, and then data transformations and encoding. Sometimes you need to turn this data into a numerical representation, to a tensor, and then feed it into a model, train the model. The model will spit out some predictions, and then you have to pipe those back into another database, perhaps, or feed them to an application that’s making some decisions. So that would be the traditional way. So you can see there’s a bespoke model that’s been built. There’s a lot of bespoke infrastructure, pipelines, ETLA that’s been done. That’s the way it’s been done in the past.

With MindsDB what we did is we have two kind of—MindsDB has two components. One is a core suite of machine learning models. There’s an AutoML framework that does a lot of the data prep and encoding yourself. And we’ve built some models of our own, also built by the community. But I forgot to mention MindsDB is a large, one of the largest machine learning open source projects. We have close to 10,000 GitHub stars. And there’s a suite of machine learning models that are adapted—regression models, gradient boosters, neural networks, all kinds of things that are adapted to different problem sets. MindsDB can make a decision looking at your data what model best applies and choose that.

The other piece of this core, this ML core of MindsDB, is that you can bring your own model to it. So if there’s something you like particularly—Hugging Face, which is like an NLP model, language processing model—you can actually add that to the MindsDB ML core using a declarative framework. So, back in the day you would have to, if you wanted to make updates to a model or add a new model, you’d have to root around in someone else’s Python code. But we allow you to do this to select models—select the model you want, bring your own model. You can tune some of the hyper-parameters, some things like learning rate, or change weights and biases using JSON, using human-readable format. So it makes it much easier for everybody to use.

And then the other piece of MindsDB is the database connector—a wrapper that sits around these ML models and provides a connection to whatever data source you have. It can be a streaming broker, Redis, Kafka. It can be a data lake like Snowflake. It can be an SQL-based database where MindsDB will connect to that database, and then using the natural—using the query language, native query language, you can tell MindsDB, “Read this data and train a predictor on this view or these tables or this selection of data.” MindsDB will do that and then it will make the predictions available. Within your database you can query those predictions just like you would a table. So it’s a very, very different concept than your traditional kind of homegrown, Python-based machine learning applications.

Christina Cardoza: And it sounds like a lot of the features that MindsDB is offering with its solution, data scientist can go in themselves and expand on their machine learning models and utilize this even more. So if you do have a data science team available within your organization, what would be the benefit of bringing MindsDB in?

Erik Bovee: This is the thing that I think it’s important to make really clear. We are not replacing anybody, and it’s not really an AutoML framework. It allows for far more sophisticated application machine learning than just a tool that gives you in a good approximation of what a hand-tuned model would do. So it basically, for a machine learning engineer or a data scientist internally, MindsDB, we would just save a tremendous amount of, you know, that 80% of their work that goes into data wrangling. Cleaning, transforming, and coding. They don’t have to worry about that. They can really focus on the core models, selecting the data they want to train from, and then building the best models, if that suits them, or choosing from a suite of models that work pretty well within MindsDB, and then also tuning those models. A lot of the work goes into kind of adapting, changing the tuning, the hyper-parameters of a model to make sure you get the best results, make that much simpler; you can do that in a declarative way rather than rooting around in someone’s Python code. So the whole thing is about time savings, I think, for data scientists.

And then, in the longer term, if you connect this directly to your database, what it means is you don’t have to maintain a lot of the ML infrastructure that up until now has been largely homegrown. If your database tables change, you just change a little bit of SQL—what you’re selecting and what you’re using to train a predictor. You can set up your own retraining schema. There are just lots and lots of operational time- and cost-saving measures that come with it. So it allows data scientists and machine learning engineers really to focus on their core job and produce results in a much faster way, I think.

Christina Cardoza: Great. Yeah. I love that point you made that it’s not meant to replace anybody or data science per se, but it’s really meant to boost your machine learning efforts and make things go a little bit smoother.

Erik Bovee: Yeah. In a nutshell, it just saves a data scientist tons of time and gives them a richer tool set. That’s—that was our goal.

Christina Cardoza: So, do you have any customer examples or use cases that you can talk about?

Erik Bovee: Yeah, tons. I mean, we concentrate. They fall into two buckets. We really focus on business forecasting, often on time-series data. And time-series data can be a bit tricky even for seasoned machine learning engineers, because you’ve got a high degree of cardinality. You’ll have tons of data, let’s say, where there are many, many unique values in a column, for instance—by definition that’s what a time series is—and if you could imagine you’ve got something like a retail chain that has maybe thousands of SKUs, thousands of product IDs across hundreds of retail shops, right? That’s just the complex data structure, and trying to predict what’s going to sell well—maybe a certain SKU sells well in Wichita, but it doesn’t sell well in Detroit. And how do you predict that? That’s a sticky problem to solve because of the high degree of cardinality in these large, multi-variate time series. But it also tends to be a very common type of data set for business forecasting. So we’ve really focused our cutting-edge large models on time-series forecasting. So that’s what we do. We will tell you what your business is going to look like in the future, in weeks or months.

The other thing that we see in the use cases—so it’s forecasting on time series, and then also anomaly detections. So it’s fraudulent transactions, or, is this machine about to overheat. Getting down into the details, I can tell you, across the board, all kinds of different use cases. One very typical one is for a big cloud service provider. We do customer-conversion prediction. They have a generous free-trial tier, and we can tell them with a very high degree of accuracy based on lots of columns in their customer data store and lots of different types of customer activity and the structure of their customer accounts who’s likely to convert to paying tier and when. And precisely when, which is important for their business planning. We’re working with a large infrastructure company, Telco, on network planning, capacity planning. So we can, we can predict fairly well where network traffic is going to go, and where it’s going to be heavy and not, and where they need to add infrastructure.

We’ve also worked on—this is a typical IoT case—manufacturing process optimization and semiconductor. So we can look at in real time sensor data coming in from the semiconductor process. And we can say, when do you stop and go on to the next phase of the process, and where default’s also likely to arise based on some anomaly detection on the process. That’s one we’ve seen working on one project in particular, but we’ve seen a couple like that in pilot phases. Been doing credit scoring real estate, like, payment-default prediction, as well as part of the business forecasting. So, those are all typical, and we, across the board, we see forecasting problems on time series.

One of the actually most enjoyable projects, it’s unique and interesting, but it’s really close to my heart, is we’re working with a big esports franchise building forecasting tools for coaching for video games. For professional video game teams. Like, how would you—what can you predict what the other team’s going to do for internal scrimmages and internal training for their teams? And what would be the best strategy given a certain situation on some complex, like MOBA games, like League of Legends or Dota 2? So that’s something we’re working on right now. They’ve already built the tools in the front end of these forecasting tools. And they’re—we’re working with very large data sets, proprietary data sets of internal training data to help them optimize their coaching practices. It’s an exotic case, but I guarantee you that’s going to grow in the future. So that’s one of the most interesting ones.

Christina Cardoza: So, lots of different use cases in ways that you can bring these capabilities into your organization efforts. But I’m wondering, in your experience, what is the best place to start on democratizing machine learning for a business? Where can businesses start using this? And where do you recommend they start?

Erik Bovee: Super easy. “Cloud.mindsdb.com.” It’s—we have a free-trial tier. It’s super easy to set up and get you signed up for an account. And then we have—God knows how many, 50-plus data connectors. Wherever your data’s living, you can simply plug in MindsDB and start to run some forecasting and do some testing and see how it works. I mean, you can take it for a test drive immediately. I would—that’s one of the first things that I would recommend that you do. The other thing is you can join our community. If you go to MindsDB.com, we’ve got a link to our community Slack and to GitHub, which is extremely active. And there you can find support and tips. And if you’re trying to solve a problem, almost guaranteed someone solved it before and are available on the Slack community.

Christina Cardoza: Great. I love when projects and initiatives have a community behind it, because it’s really important to learn what other people have been doing and to get that outside support or outside thinking that you may not have been thinking about. And I know you mentioned, Erik, in the beginning, you guys are also working with Intel on this. I should mention the IoT chat and insight.tech as a whole are sponsored by Intel. But I’m curious how you are working with Intel and what the value of that partnership has been.

Erik Bovee: Yeah, so that’s actually been—Intel has been extremely supportive on a number of fronts. So, obviously, Intel has a great hardware platform, and we have implemented their OpenVINO framework, whichoptimizes machine learning for performance on Intel hardware. So make great performance gains that way. And on top of that, they just, Intel provides tons of technology and kind of go-to-market opportunities. We work with them on things like this. I’ll be presenting at the end of the month, if anybody wants to come check us out at Intel Innovation in San Jose, I think it’s on the 27th, 28th, 28th, 29th of this month at the San Jose Convention Center. And we’ll have a little booth in the AI/ML part of their innovation pavilion. And I’ll be demoing how we work, running some machine learning on data in MariaDB, which is an Intel partner. Actually MariaDB introduced us to Intel, and that’s been really fruitful. Their cloud services are hosted on Intel. So if anybody wants to come and check it out, that’s—Intel has provided us this forum. So they’re—we’re extremely grateful.

Christina Cardoza: Perfect, and insight.tech will also be on the floor at Intel Innovation. So, looking forward to that demo that you guys have going on there at the end of the month. Unfortunately, we’re running towards the end of our time. I know this is a big, important topic and we could probably go on for a long time, Erik, but before we go, are there any final key thoughts or takeaways you want to leave our listeners with today?

Erik Bovee: I would just urge you to go—I mean, we love the feedback, even if you’re, you know—go test it out. It’s actually fun. MindsDB is pretty fun to play with. That’s how I got involved. I discovered MindsDB by chance and installed it and started using it and found it was just useful in all kinds of data sets and just doing science experiments. We love it. If you take it for a test drive and provide feedback on the community Slack, we’re always looking for product improvements and people to join the community. And so we’d really welcome that—“ cloud.mindsdb.com.” And thanks very much for the opportunity, Christina.

Christina Cardoza: Yep, of course. Thank you so much for joining the podcast today. It’s been a pleasure talking to you. And thanks to our listeners for tuning in. If you like this episode, please like, subscribe, rate, review, all of the above on your favorite streaming platform. And until next time, this has been the IoT Chat.

The preceding transcript is provided to ensure accessibility and is intended to accurately capture an informal conversation. The transcript may contain improper uses of trademarked terms and as such should not be used for any other purposes. For more information, please see the Intel^® trademark information.

This transcript was edited by Erin Noble, copy editor.