Democratizing ML MindsDB

Take AI for a Test Drive: Democratizing ML with MindsDB

October 26, 2022

Christina Cardoza

Democratizing ML

Machine learning has become a crucial component of a data management strategy—particularly with the huge influx of data from IoT devices these days—but it can be challenging to sift through all that information. An additional challenge is the dearth of available machine learning (ML) experts. But there are businesses out there working to democratize sophisticated ML models, making it easier and more efficient for anyone to deploy them.

Machine learning solution provider MindsDB is one of those companies, and Erik Bovee, its Vice President of Business Development, wants to encourage new members of the ML community to get started. He talks to us about challenges of ML adoption, learning to trust the model, and bringing machine learning to the data rather than the other way around.

What is the state of machine learning adoption today?

The amount and complexity of data are growing really quickly, outpacing human analytics. And machine learning is hard, so finding the right people for the job is difficult. But in terms of the state of the market there are a couple of interesting angles. First, the state of the technology itself is amazing—just the progress made over the past five to 10 years is really astonishing—and cutting-edge machine learning models can solve crazy-hard, real-world problems. Look at what OpenAI has done with its GPT-3 large language models, which can produce human-like text. There’s also Midjourney, which, based on a few keywords, can produce really sophisticated, remarkable art.

From an implementation standpoint, though, I think the market has yet to benefit broadly from all of this. Even autonomous driving is still more or less in the pilot phase. Adapting these capabilities to consumer tech is a process, and all kinds of issues need to be tackled along the way. One is trust. Not just, “Can this autonomous car drive me safely?” But also, “How do I trust that this model is accurate? Can I put the fate of my business on this forecasting model?” So I think those are important aspects to getting people to implement machine learning more broadly.

But there are a few sectors where commercial rollout is moving pretty fast, and I think they’re good bellwethers for where the market is headed. Financial services is a good example—big banks, investment houses, hedge funds. The business advantage for things like forecasting and algorithmic trading is tremendously important to their margins, and they’ve got the budgets and a traditional approach to hiring around a good quant strategy. But a lot of that is about throwing money at the problem and solving these MLOps questions internally, which is not necessarily applicable to the broader market.

I also see a lot of progress in industrial use cases, especially in manufacturing. For example, taking tons of high-velocity sensor data and doing things like predictive maintenance: What’s going to happen down the line? When will this server overheat? I think those sectors, those market actors, are clearly maturing quickly.

“One of our goals is to give #DataScientists a broader tool set, and to save them a lot of time on cleanup and operational tasks, allowing them to really focus on core #MachineLearning.” – Erik Bovee, @MindsDB via @insightdottech

How does democratizing AI give business stakeholders more trust?

A lot of that starts with the data—really understanding your data, making sure there aren’t biases. Explainable AI has become an interesting subject over the past few years. One of the most powerful ways of getting business decision-makers on board and understanding exactly how the model operates is providing counterfactual explanations—that is, changing the data in subtle ways to get a different decision. That tells you what’s really triggering the decision-making or the forecasting on the model, and which columns or features are really important.

What are some of the machine learning challenges beyond skill set?

Skill set, I think, is a challenge that will diminish over time. What is often challenging is some of the simple things, some of the simple operational things in the short term on the implementation side. The data scientist tool set is often based on Python, which is arguably not very well adapted to data transformation. There’s often this bespoke Python code written by a data scientist—but what happens to it when your database tables change? It’s all reliant on this one engineer to update everything over time. So how do you do something that is efficient and repeatable, and also predictable in terms of cost and overhead over time? That’s something we’re trying to solve.

One of the theories behind our approach is to bring machine learning closer to the data, and to use existing tools like SQL, which is pretty well adapted to data transformation and manipulating data. Why not find a way to apply machine learning directly—via connection to your database—so you can use your existing tools and not have to build any new infrastructure? I think that’s a big pain point.

How does this benefit data scientists?

One of our goals is to give data scientists a broader tool set, and to save them a lot of time on cleanup and operational tasks, allowing them to really focus on core machine learning. You’ve got data sitting in the database, so, again, why not bring the machine learning models to the database? And we’re not consuming database resources either; you just connect MindsDB to it. We read from the database and then pipe machine learning predictions back to the database as tables, which can then be read just like any other tables you have. There’s no need to build a special Python application or connect to another service; it’s simply there. It cuts down considerably on the bespoke development, is very easy to maintain in the long term, and you can use the tools you already have.

How does this compare to traditional methods of deploying machine learning models?

Traditionally you would write a model using an existing framework, like TensorFlow or PyTorch, usually writing in Python. You would host it somewhere. And then you would have data you want to apply—maybe it’s in a data lake, or in Snowflake, or in MongoDB. You write pipelines to extract that data and transform it. You often have to do some cleaning, and then data transformations and encoding. The model would spit out some predictions, and then perhaps you’d have to pipe those back into another database, or feed them to an application that’s making decisions. That’s the way it’s been done in the past.

MindsDB, on the other hand, has two components. One is a core suite of machine learning models that are adapted to different problem sets. MindsDB can look at your data and make a decision about which model best applies, and choose that. The other possibility in this component is that you can bring your own model. If there’s something you particularly like you can add that to the MindsDB ML core using a declarative framework.

The other piece of MindsDB is the database connector—a wrapper that sits around these ML models and provides a connection to whatever data source you have. It can be a streaming broker; it can be a data lake; it can be an SQL-based database where MindsDB will connect to that database. Then, using the native query language, you can tell MindsDB, “Read this data and train a predictor on this view or these tables or this selection of data.”

What is the benefit of using MindsDB?

I think it’s important to make this really clear: We are not replacing anybody. For an internal machine learning engineer or a data scientist, MindsDB just saves a tremendous amount of the work that goes into data wrangling, cleaning, transforming, and coding. Then they can really focus on the core models, on selecting the data they want to train from, and then building the best models. So the whole thing is about time saving for data scientists.

And then, in the longer term, if you connect this directly to your database, you don’t have to maintain a lot of the ML infrastructure. If your database tables change, you just change a little bit of SQL. You can set up your own retraining schema. It all saves a data scientist tons of time and gives them a richer tool set. That’s our goal.

Can you provide some examples of use cases?

We really focus on business forecasting, often on time-series data. Imagine you’ve got something like a retail chain that has thousands of SKUs—thousands of product IDs across hundreds of retail shops. Maybe a certain SKU sells well in Wichita but doesn’t sell well in Detroit. How do you predict that? That’s a sticky problem to solve, but it also tends to be a very common type of data set for business forecasting.

One very typical use case we have is with a big cloud service provider, where we do customer-conversion prediction. It has a generous free-trial tier, and we can tell it with a very high degree of accuracy who’s likely to convert to a paying tier, and when. We’re also working with a large infrastructure company on network planning, capacity planning. We can predict fairly well where network traffic is going to go, where it’s going to be heavy and not, and where the company needs to add infrastructure.

One of our most enjoyable projects, one that’s really close to my heart, is working with a big e-sports franchise, building forecasting tools for coaching professional video game teams. For example, predicting what the other team is going to do for internal scrimmages and internal training. Or what would be the best strategy given a certain situation on MOBA games like League of Legends or Dota 2? It’s an exotic case now, but I guarantee it’s one that’s going to grow in the future.

Where is the best place for a business to start with machine learning?

Super easy: Cloud.mindsdb.com. We have a free-trial tier, and it’s super easy to set up. Wherever your data’s living, you can simply plug MindsDB in and start to run some forecasting—do some testing and see how it works. You can take it for a test drive immediately. The other thing is to join our community. At MindsDB.com we’ve got a link to our community Slack and to GitHub, which is extremely active, and you can find support and tips there.

How are you working with Intel^®, and what has been the value of that partnership?

Intel has been extremely supportive on a number of fronts. Obviously, it has a great hardware platform, and we have implemented their OpenVINO^™ framework. We’ve made great performance gains that way. And, on top of that, Intel provides tons of technology and go-to-market opportunities.

Any final thoughts or key takeaways to leave us with?

Go test it out. MindsDB is actually pretty fun to play with—that’s how I got involved. If you take it for a test drive, provide feedback on the community Slack. We’re always looking for product improvements and people to join the community.