This is a template with questions/answers I find useful when designing an ML system. It can also be useful for your ML SD interview preparation, and you can use it as a check list.

Template

I will give a template with a very quick definitions, so you can use the table of contents as a quick reference. Just remember that every decision has trade-offs and you must at least mention them in your design doc.

Company constraints

Main goal, transform into money

Here we try define why we want to solve this problem. Do we want to grow a product, increase revenue, cut costs? What would be the success and a failure?

Team size + deadline

For the first iteration, what resources do we have and how fast do we have to deliver?

Problem exploration

Scope, scale and users

Define the big scope, and then narrow it. Scope creep is bad! Start small. Think about the scale, requests per second, storage needed, response times. Think about users, what’s special about them (b2b vs b2c, located in one place, spikes after lunch etc)

Metrics

Sub goals + business KPIs

Basically the key metric of success. Usually you have several, pick the most relevant and focus on it. Don’t forget about other metrics and common sense, as gaming metrics is really common and leads to bad product.

ML metrics

Find a good proxy metric for optimization. You have to know the idea behind the most popular ones, and in details about the ones you worked with. Here are Machine Learning Tasks and Common Metrics

Baseline, architecture, it’s scale and limits

Always start with a simplest baseline to figure out the architecture and understand the problem. It can be just a formula, like a moving average for a regression task. You may see here already it won’t be possible to just use transformer as it requires very high response time, for example.

Data

Data collection and storage

What kind of data do we need now? What should we store for the future? Where will it be stored, how is it served for training/inference? Can we start collecting with our baseline model now, while we’re building something more complex? Remember, interactions not collected are lost forever.

Data processing and Feature engineering

Target variable, previous system vs manual labels

Sometimes you have a data bias from the previous system, sometimes you don’t have labels yet

Preprocessing and outliers

Normalization, encoding, embeddings, text transforms like lemmatization, standardization for images, and other domain-specific stuff

User, query, context, target features and their combination

User based like usage history, query (content) like the product recommended, context-based like day of the week and device, …

Modelling

In-depth solution, pros and cons

Usually there’s a time/size trade-off, explainability and accuracy. You have to know the idea behind the most popular ones like regressions, gradient boostings, neural networks, and deep details about your area.

Model training and debugging

Model convergence, overfitting, architecture changes

What happens if model prediction is not good enough? How to improve? What is a trade off when we increase model size, for example? What happens when model is overfitting? How do we fix it? What is a trade off here?

Data drift, feature creation, upstream data bugs

How do you train on corona period data? How do you create features for online serving and make sure they’re the same as in your offline test? How do you handle incorrect data coming into your model? How do you catch it during inference?

Prediction post-processing, guardrails, fallback

Are there any additional filters on your predictions? For example, do you need to add any boundaries for a regression model? What if your model predicts unsafe results? What would be the fallback behavior?

Serving

Offline/online evaluation

Classic system design decisions

Here you can just open one of the usual system design templates and think about

Product integration

Is it a real-time or batched system? Where exactly is it located (S3/db blobs) and why?

Ways to speed up

Do you need any caching? How do you update it? Any problems during parallelization?

Model monitoring and support

Roll out, shadowing, A/B testing, early bugs

How do we start serving it? What if it goes wrong right away? Any kill switches?

Retraining, alarms, dashboards, data drifts

How do we support the model? What happens if something breaks in the middle of the night? How do we find out about it? Do we have to retrain? What if there’s another covid? How do we find out about it?

Useful links