MLOps For The Cowboy

Getting Things Done on Limited Rations

Getting Things Done on Limited Rations

There’s a lot of things you can do, but when you’re solo out on the ranch, with water and food running low, these are the lean MLOps practices you need to survive.

1) Design

Before you spend an ounce of resources on a project you need to intimately understand the problem, why it’s worth solving, why existing alternatives don’t work, and the resources you will have access to for solving the problem. By uncovering this information a design will manifest naturally.

1.1) Define The Problem, and Question That it’s Worth Solving

While some might not include this in the MLOps pipeline, it’s arguably the most important step. The easiest way to fix an issue is to either expose it as a non-issue, or present it as a symptom of a larger issue. Machine learning is complex, time consuming, expensive, and fundamentally experimentative. It’s vitally important to trim off any excess fat on a project, and the beginning is the best time to do it.

Your goal in this phase is to understand the problem being faced, probably by the person asking you to do the project or a prospective customer for a new product. You need to understand what problem they’re facing, why they’re pressed to solve that problem, and what it would look like for them to consider it solved.

Here’s some questions you should be able to answer before kicking off a project:

  1. “Why is it useful for this problem to be solved?”

  2. “Do all stakeholders really find this project worth while?”

  3. “Is this the most important problem? What are some similar problems?”

  4. “Who is involved with this problem?”

  5. “What do stakeholders expect to change if the problem gets solved?”

You’re looking to verify that the stakeholders of the project to have enough interest to be able to weather the ups and downs of a new machine learning project, which has all the turbulence of a normal software project and then some. You also want to get to the point where you can guess peoples responses. If you can’t, you should ask more questions.

This general idea is co-opted from the customer discovery phase of the lean startup process. I highly recommend The Lean Startup by Eric Reis for any data scientist trying to improve their non-technical skills.

1.2) Prioritize

You now have a problem which, if it’s like any problem I’ve ever been tasked to tackle, is actually a collection of smaller sub-problems. Using a combination of your experience in software and knowledge gained about the specific problem you’re solving, you can now begin prioritizing pieces of the larger problem.

Your job is to make as many things as low a priority as possible. If it can be pushed off as an added feature, it’s low priority. If there’s not a compelling reason the customer would use a particular feature consistently, it’s low priority.

Your client came to you with a problem and, as a result, they have a mental bias to consider every aspect of that problem as high priority. It’s your job to reign in feature creep and hone your backlog of tasks down to a razors edge. This will give you the freedom of time and the necessary focus to be successful. The name of the game is getting the maximum utility at a minimal cost.

1.3) Resource Check

Once you’ve broken down your project into individual tasks, these are the questions you should be asking, in order:

Is there a product that solves this problem? Is there a non-ml algorithm that solves this problem? Is there a pre-trained model that solves this problem? Is there a similar model I can adapt to solve this problem? Is there available data for solving this problem? Does the client have data, and in what condition is it? How do I collect data for this problem?

Too many developers jump right to planning architecture. By going through this list of questions you can turn a 3 month job into a 3 day job.

1.4) Design

The design phase for a new ML project is a bit different from other types of software projects, chiefly in terms of lifecycle. In app development for instance, once you hammer down your requirements you plan a list then execute. Of course there are bugs which might delay progress, but any dev worth their salt wouldn’t set off to make a social media app and accidentally make a banking app.

Machine learning is fundamentally different. I liken it to carving a statue out of marble. The data (the stone) has a family of possible models inside of it, and it’s our job as data scientists to try to uncover the model we want (the statue). However, we won’t know if there’s a crack in the marble until we cut into it. We might find that an outside glance at the data looks good, but through modeling we discover hidden imbalances, insurmountable overlap, etc. The statue your client wants simply might not exist in the data.

As a result, even more-so than other software projects, the design of an ML project has to be fluid, and all the stakeholders and participants in the project need to be aware of that. Start off with the easiest and most impactful requirements, which you have the most ample resources to accomplish, and design a solution with those in mind. Then, in the next phase, your job is to de-risk that design as quickly as possible.

2) Develop

This is how to go from your first design to your first deployment:

2.1) Discovery

So you‘ve designed a solution, is it any good? It’s time to find out. Set up a notebook, ingest your data, and get some results. You don’t need to worry about reaching the required performance for the project, or the deployability of the model, or anything else. Your only job in this phase is to get results that either prove or disprove the feasibility of the design.

This is incredibly important, and I can’t stress this enough: Regardless of what anyone wants or says, your job is to prove or disprove feasibility. If you try too hard to prove feasibility you’ll inevitably create some complex operation that leaks data into your test set, or otherwise compromises your testing process. This is something that junior data scientists underestimate, and senior data scientists (under pressure of deadlines, bonuses, life, and death) can attest to as being a surprisingly slippery issue.

If there is ever a phase to take your time, this is the phase to do it in. At the end you should be pretty sure the project will sink or float.

2.2) Data Ingestion and Versioning

Assuming you’re building a new model based on custom data, it will be critical to set up data versioning and highly recommended to finalize data ingestion before proceeding.

By Dataset Versioning I mean having a complete snapshot of your modeling data, cleaned up and ready to model on, saved as an artifact in an easy to access place. If you’re on Amazon that might be as a csv or folder of csv’s in an s3 bucket. If you’re on GCP, that might be in Google Storage or Google Drive. When you want to model on more data, you create a new snapshot of a new dataset, you don’t modify or delete the old one. You do this to make sure you can re-run and replicate old code.

I generally recommend CSVs for data versioning. I’ve experimented with a few file systems, but CSVs are easy to read and are un-versioned meaning they play nice with different environments. The cost of larger file sizes and and slower read-times is often out-weighed by sheer convenience. Pandas, for instance, can read a CSV straight from s3. If you’re in an enterprise with an entire data engineering department you might be cringing at this advice, but this article simply isn’t for that type of work environment.

By Data Ingestion I mean the creation of new data. Odds are you will need some additional data, especially if you’re working for a company that’s hiring an individual or small team to do their data science work. It’s useful to know how new data is coming in, where it’s going, and that this new data will respect old data schemas and formats.

2.2) Experiment Until It’s Good Enough

Calling machine learning “development” sometimes seems like a misnomer. Really, you run a bunch of different tests with a bunch of different ideas a bunch of times until you end up with something that looks good on the bench.

Some people like to employ fancy training pipelines, which can be vital for large modeling projects in big companies, but it is perfectly fine to have a well-documented series of notebooks. In fact, in many ways, it’s superior. Just remember to use proper validation strategies if you’re doing hyperparameter optimization, to make sure you’re not fitting to a test set unwittingly.

3) Deploy

3.1) Up on the cloud

So you have a model that seems good. Of course, you don’t know it’s good, you can’t be 100% sure until it’s been exposed to a test audience and given the green light. However, results on your test partition appear promising.

There are a tone of ways to deploy a model. Some of them are simple, many of them are excessively complex. Depending on your application you might need fancy queues, batching strategies, or GPU acceleration. However, I implore you to keep it simple. You really don’t even know if this model is going to work yet. Throw it up on a server and let application developers and testers start playing with it before you invest the time and effort optimizing a resource that might never get used.

I’m a fan of wrapping the model in a docker container with a http Flask app (or Fast api if you want to be a bit more prod-ready), sticking it in a docker container, and pushing it to a docker hosting service. AWS elastic beanstalk is a great choice, ECR and ECS is a bit more customizable. I’m sure GCP has similar resources available. Similarly to dataset versioning, I also like storing my models along with the notebook which generated them. This is an incredibly easy way to give you high degrees of traceability.

If you want to get to testing faster, you can use python anyware, which is a lightweight deployment service run by the Anaconda foundation. If you want to be a bit more coacher, but still keep it lean, I recommend BentoML.

3.2) Testing

It’s very possible that you may have implemented a testing strategy that’s not indicative of what key stakeholders in the project actually care about. Let them have an intuitive experience with the machine learning product, and listen to their feedback very closely. If humans like it, and the numbers in your tests look good, then congratulations. If they don’t like it, kick off another (this time much faster) round of designing, developing, and deploying.

Follow For More!

In future posts, I’ll also describing several landmark papers in the ML space, with an emphasis on practical and intuitive explanations.

Please like, share, and follow. As an independent author, your support really makes a huge difference!

Attribution: All of the images in this document were created by Daniel Warfield, unless a source is otherwise provided. You can use any images in this post for your own non-commercial purposes, so long as you reference this article, https://danielwarfield.dev, or both.