Machine Learning Engineer Interview Questions and Prep Pointers

Q: Design an ML system to detect fraudulent transactions in real time at the scale of millions of transactions per day.

Start by clarifying scope: what 'fraud' means, label availability and delay, latency budget, acceptable false positive rate, regulatory constraints. Cover the full stack: data ingestion, feature engineering (especially streaming features), feature store (online vs. offline parity), model choice (why GBDT often beats deep learning here), serving, monitoring. Explicitly address class imbalance strategy AND the label delay problem — fraud labels often arrive days later. Discuss the human-in-the-loop component; pure ML systems rarely work in fraud. Be prepared to deep-dive on any one component; interviewers will pick the area you sounded weakest on.

Q: Explain how you would detect data drift and concept drift in a deployed model, and what you'd do about each.

Be precise about the distinction: data/feature drift (P(X) changes) vs. concept drift (P(Y|X) changes) vs. label drift (P(Y) changes). Know specific detection methods: PSI, KL divergence, KS test for features; performance degradation tracking with delayed labels for concept drift. Discuss thresholds and alerting — at what point do you retrain vs. investigate vs. roll back? Address the chicken-and-egg problem: detecting concept drift requires labels, which often arrive late. Mention shadow models and champion/challenger setups as detection tools, not just deployment patterns.

Q: Given a 100GB training dataset that won't fit in memory, how would you train a gradient-boosted model on it?

Cover multiple approaches: sampling strategies, distributed frameworks (XGBoost/LightGBM distributed mode, Spark MLlib), out-of-core / external memory algorithms. Discuss when sampling is actually fine and how to validate the sample is representative. Address feature engineering at scale — when do you push transformations to Spark/SQL vs. in-memory. Be ready for the follow-up about evaluation: cross-validation strategy at scale, leakage risks with temporal data. Don't jump straight to distributed training — discuss whether you actually need all 100GB first.

Q: How would you evaluate a recommendation model before launching it to users?

Cover offline metrics (NDCG, MAP, recall@k, hit rate) AND their limitations — they're proxies for user behaviour, not the thing itself. Discuss counterfactual evaluation and the off-policy estimation problem when training data comes from a previous policy. Address the cold start, popularity bias, and diversity dimensions — a model with great NDCG can still be a bad product. Be ready to discuss interleaving, A/B testing design, and what guardrail metrics you'd monitor. Mention business metrics (engagement, retention) and the offline-online correlation problem.

Q: Your production model's accuracy dropped 8% overnight. Walk me through the first hour of your investigation.

Lead with what you check FIRST — usually data pipeline health, not the model. Models don't change overnight; data does. Walk through a clear triage hierarchy: is the metric calculation correct → did inputs change → did a deploy happen → did upstream systems change → is there a label issue. Discuss rollback criteria: at what point do you revert to the previous model version vs. continue investigating. Mention communication — who do you tell, when, and what's your update cadence. Avoid leaping to 'retrain the model' as the first action; that's usually the wrong move during an incident.

Q: A product manager asks you to build a model to 'predict customer churn' with no further specification. How do you proceed?

Resist the urge to dive into algorithm choice; the interviewer is screening for problem framing. Walk through the questions you'd ask: what is churn (cancellation? inactivity? what window?), what action will be taken on predictions, what's the cost of false positives vs. false negatives, what data exists. Discuss whether ML is even the right answer — sometimes a simple recency/frequency rule is enough. Cover how you'd define success — both offline metric AND the downstream business metric (retained revenue, intervention conversion). Mention you'd propose a phased approach: rule-based baseline first, then iterate.

About Machine Learning Engineer interviews

Machine Learning Engineer interviews sit at the intersection of software engineering, applied ML, and production systems thinking — and the loop is usually longer than candidates expect. A typical process runs four to six stages: a recruiter screen focused on stack alignment and scope, a hiring manager conversation probing your past ML projects end-to-end, a coding round (LeetCode-style or ML-flavoured data manipulation), an ML system design interview, an ML depth/theory round, and a behavioural or team-fit final. Interviewers are usually senior MLEs, applied scientists, or engineering managers. The coding round screens for whether you can actually ship; the ML system design round screens for whether you understand the realities of training pipelines, feature stores, online serving, monitoring, and retraining cadence. The depth round tests whether you understand what's happening inside the models you use, not just the API. Where candidates most often stumble: treating ML system design like generic web system design and forgetting data drift, label latency, training/serving skew, and offline-online evaluation gaps; being unable to defend modelling choices beyond "it worked on Kaggle"; and underestimating the engineering bar — many MLE rejections happen at the coding round, not the ML round. Strong candidates demonstrate they've owned a model in production through at least one full lifecycle, including the unglamorous parts.

Typical stages

Recruiter screen
Hiring manager interview
Coding interview
ML system design
ML depth / applied theory
Behavioral / team fit final

Common formats

Live coding (Python/SQL)
ML system design whiteboard
ML theory deep-dive
Behavioral STAR
Take-home modelling exercise
Portfolio / past project walkthrough

What hiring managers screen for

End-to-end ownership: has shipped a model to production and dealt with the aftermath (monitoring, drift, retraining)
Strong software engineering fundamentals — clean code, testing, CI/CD, not just notebook experimentation
Pragmatic modelling judgement: picks the simplest model that meets the business metric, not the most fashionable one
Fluency with ML infrastructure: feature stores, orchestration (Airflow/Kubeflow), model registries, serving frameworks
Ability to translate ambiguous product problems into a measurable ML formulation with the right offline and online metrics

Red flags to avoid

Only Kaggle or coursework experience with no production deployment story
Cannot explain training/serving skew, data leakage, or how they validated their model beyond a single train/test split
Reaches for deep learning or LLMs when a logistic regression or gradient-boosted tree would solve the problem
No awareness of monitoring, drift detection, or what would trigger a retrain
Weak coding fundamentals — struggles to write clean, tested Python outside of a notebook

Primary questions (15)

Behavioural

Tell me about an ML model you took from prototype to production. What did the journey actually look like?

Why this comes up: This is the single most common opening question because it instantly separates candidates with real production experience from those with only experimental work.

Prep pointers

Pick a project where YOU owned the deployment, not one where you handed a notebook to a platform team.
STAR Situation: anchor the business problem and the baseline (rule-based? human? older model?). STAR Task: your specific scope. STAR Action: walk through data pipeline, training, validation strategy, deployment pattern (batch vs. online), monitoring. STAR Result: business metric impact AND model performance, plus what broke after launch.
Be ready for the follow-up: 'What went wrong post-launch?' — having no answer signals you weren't actually on-call for it.
Avoid getting stuck in modelling detail; interviewers want to hear about the unglamorous infrastructure and rollout parts.
Quantify wherever possible — latency targets hit, % uplift, cost per inference.

Behavioural

Describe a time you had to push back on a stakeholder who wanted you to use a specific model or approach you disagreed with.

Why this comes up: MLEs constantly face pressure to use LLMs, deep learning, or 'the latest thing' when simpler approaches suffice — interviewers want to see backbone and judgement.

Prep pointers

Choose a story where you actually changed the outcome, not one where you complied and grumbled.
STAR Action should explicitly cover how you framed the trade-off in their language — cost, latency, maintainability, time-to-value — rather than purely technical arguments.
Show you ran a small experiment or back-of-envelope analysis to make the case evidence-based.
Avoid making the stakeholder sound stupid; the strongest version shows you understood why they wanted what they wanted.
Result should ideally include what you learned about communicating with non-technical stakeholders.

Behavioural

Tell me about a time a model you deployed underperformed in production compared to offline metrics. How did you diagnose and fix it?

Why this comes up: Training/serving skew and offline-online metric gaps are bread-and-butter MLE problems; how you debug reveals seniority.

Prep pointers

Have a specific story ready — vague 'we sometimes see drift' answers will get probed hard.
STAR Action should walk through your diagnostic hierarchy: data distribution check → feature parity check → label delay → selection bias → model staleness.
Mention specific tools or techniques (PSI, KS tests, shadow deployment, A/B comparison) where relevant.
Be honest about what surprised you — interviewers value the lesson more than the heroics.
Avoid framing this as a one-off; show you put monitoring in place to catch it earlier next time.

Behavioural

Walk me through a time you had to balance model performance against latency, cost, or interpretability constraints.

Why this comes up: Production ML is constraint-driven; this question screens for whether you've actually had to make these trade-offs rather than optimising AUC in isolation.

Prep pointers

Have concrete numbers ready: latency budget in ms, infra cost per 1000 inferences, accuracy delta you accepted.
STAR Task should make the constraint explicit and non-negotiable (e.g. p99 < 50ms for a checkout flow).
Action should cover what you tried that didn't work, not just the final solution — quantisation, distillation, feature pruning, simpler architecture.
Common failure: presenting this as a pure ML problem when the interesting trade-offs are usually engineering.
Result should tie back to business impact, not just the technical metric.

Technical

Design an ML system to detect fraudulent transactions in real time at the scale of millions of transactions per day.

Why this comes up: Real-time, high-volume, imbalanced-class problems are a canonical MLE system design prompt because they force you to address every part of the lifecycle.

Prep pointers

Start by clarifying scope: what 'fraud' means, label availability and delay, latency budget, acceptable false positive rate, regulatory constraints.
Cover the full stack: data ingestion, feature engineering (especially streaming features), feature store (online vs. offline parity), model choice (why GBDT often beats deep learning here), serving, monitoring.
Explicitly address class imbalance strategy AND the label delay problem — fraud labels often arrive days later.
Discuss the human-in-the-loop component; pure ML systems rarely work in fraud.
Be prepared to deep-dive on any one component; interviewers will pick the area you sounded weakest on.

Technical

Explain how you would detect data drift and concept drift in a deployed model, and what you'd do about each.

Why this comes up: Drift handling is the most common production ML failure mode and a frequent depth question to separate practitioners from theorists.

Prep pointers

Be precise about the distinction: data/feature drift (P(X) changes) vs. concept drift (P(Y|X) changes) vs. label drift (P(Y) changes).
Know specific detection methods: PSI, KL divergence, KS test for features; performance degradation tracking with delayed labels for concept drift.
Discuss thresholds and alerting — at what point do you retrain vs. investigate vs. roll back?
Address the chicken-and-egg problem: detecting concept drift requires labels, which often arrive late.
Mention shadow models and champion/challenger setups as detection tools, not just deployment patterns.

Technical

Given a 100GB training dataset that won't fit in memory, how would you train a gradient-boosted model on it?

Why this comes up: Tests practical engineering knowledge of distributed training, out-of-core algorithms, and whether you've actually worked at scale or only on toy datasets.

Prep pointers

Cover multiple approaches: sampling strategies, distributed frameworks (XGBoost/LightGBM distributed mode, Spark MLlib), out-of-core / external memory algorithms.
Discuss when sampling is actually fine and how to validate the sample is representative.
Address feature engineering at scale — when do you push transformations to Spark/SQL vs. in-memory.
Be ready for the follow-up about evaluation: cross-validation strategy at scale, leakage risks with temporal data.
Don't jump straight to distributed training — discuss whether you actually need all 100GB first.

Technical

How would you evaluate a recommendation model before launching it to users?

Why this comes up: Offline evaluation of ranking/recommendation systems is notoriously tricky and reveals depth of applied ML thinking beyond classification metrics.

Prep pointers

Cover offline metrics (NDCG, MAP, recall@k, hit rate) AND their limitations — they're proxies for user behaviour, not the thing itself.
Discuss counterfactual evaluation and the off-policy estimation problem when training data comes from a previous policy.
Address the cold start, popularity bias, and diversity dimensions — a model with great NDCG can still be a bad product.
Be ready to discuss interleaving, A/B testing design, and what guardrail metrics you'd monitor.
Mention business metrics (engagement, retention) and the offline-online correlation problem.

Situational

Your production model's accuracy dropped 8% overnight. Walk me through the first hour of your investigation.

Why this comes up: Incident response is part of the MLE role; this question tests systematic debugging under pressure rather than panic.

Prep pointers

Lead with what you check FIRST — usually data pipeline health, not the model. Models don't change overnight; data does.
Walk through a clear triage hierarchy: is the metric calculation correct → did inputs change → did a deploy happen → did upstream systems change → is there a label issue.
Discuss rollback criteria: at what point do you revert to the previous model version vs. continue investigating.
Mention communication — who do you tell, when, and what's your update cadence.
Avoid leaping to 'retrain the model' as the first action; that's usually the wrong move during an incident.

Situational

A product manager asks you to build a model to 'predict customer churn' with no further specification. How do you proceed?

Why this comes up: Translating vague business asks into well-formed ML problems is a daily MLE skill, and many candidates jump straight to modelling without scoping.

Prep pointers

Resist the urge to dive into algorithm choice; the interviewer is screening for problem framing.
Walk through the questions you'd ask: what is churn (cancellation? inactivity? what window?), what action will be taken on predictions, what's the cost of false positives vs. false negatives, what data exists.
Discuss whether ML is even the right answer — sometimes a simple recency/frequency rule is enough.
Cover how you'd define success — both offline metric AND the downstream business metric (retained revenue, intervention conversion).
Mention you'd propose a phased approach: rule-based baseline first, then iterate.

Situational

You inherit a legacy model with no documentation, no tests, and no monitoring. What's your plan for the first 30 days?

Why this comes up: Most MLEs spend significant time on inherited systems; this reveals pragmatism, prioritisation, and engineering maturity.

Prep pointers

Lead with risk reduction, not improvement: monitoring and reproducibility come before any model changes.
Walk through a concrete sequence: understand current performance → instrument it → make training reproducible → add tests → THEN consider improvements.
Discuss how you'd build trust with stakeholders during a period where you might not ship visible improvements.
Address the temptation to rewrite everything; that's usually wrong.
Mention you'd want to find out who built it and what context exists, even informally, before changing anything.

Competency

How do you decide between a simpler interpretable model and a more complex black-box model for a given problem?

Why this comes up: Tests modelling judgement and whether you default to complexity — a major signal of MLE maturity.

Prep pointers

Frame it as a decision driven by constraints, not preference: regulatory needs, stakeholder trust, debugging cost, performance ceiling.
Discuss the 'simpler model first' principle and when you'd actually escalate complexity — only when the simple model demonstrably leaves value on the table.
Mention specific tooling (SHAP, LIME) and their limitations rather than treating interpretability as a binary.
Have a real example where you chose the simpler model and it was the right call, and ideally one where complexity was justified.
Avoid sounding dogmatic in either direction; the answer is always 'it depends' but you need to articulate ON what.

Competency

How do you approach experimentation and iteration when improving an existing model?

Why this comes up: Reveals whether you have a disciplined experimental process or chase ideas randomly — a common weakness in self-taught MLEs.

Prep pointers

Discuss having a clear baseline and a single primary metric before starting any experiment.
Cover experiment tracking (MLflow, W&B, Neptune) and why it matters — not for tooling theatre but for reproducibility and comparison.
Talk about error analysis as the highest-leverage activity: looking at where the current model fails before trying new techniques.
Mention prioritisation: rank candidate ideas by expected impact vs. effort before running them.
Avoid making this sound like a Kaggle competition; production iteration is constrained by deployment cost and risk.

Culture fit

How do you stay current with the ML field without getting distracted by every new paper or framework?

Why this comes up: Filters for engineers who can separate hype from substance — critical given the LLM/foundation model noise of recent years.

Prep pointers

Be specific about your information diet: which newsletters, papers, or communities, and crucially what you filter OUT.
Show you distinguish between 'interesting to read' and 'worth implementing' — the bar for the latter is much higher.
Mention how you evaluate whether a new technique is actually relevant to your problems vs. just trending.
Avoid name-dropping papers you haven't actually read or applied.
A strong answer acknowledges that fundamentals (statistics, software engineering) compound more than chasing new architectures.

Culture fit

Describe how you work with data scientists, software engineers, and product managers on an ML project.

Why this comes up: MLE is inherently cross-functional; companies want to know you can operate at the seams between disciplines without territorial friction.

Prep pointers

Be specific about hand-off points — where does the DS hand off, where does the SWE pick up, what do you own end-to-end.
Acknowledge the role ambiguity that exists in most companies and how you navigate it rather than complain about it.
Mention concrete artefacts you produce or consume: model cards, evaluation reports, design docs.
Discuss how you've handled disagreements with a DS over model choice or with a SWE over deployment patterns.
Avoid sounding like you do everyone else's job; show respect for what each role brings.

Researching the Machine Learning Engineer role?

See the full skills, salary and market breakdown — what employers actually want, and the biggest skills gaps.

Machine Learning Engineer skills & salary →

Get a prep pack tailored to your experience

describe.me matches these questions against your real work history, flags your prep priorities, and gives you a STAR scaffold per question.

Start free →

Machine Learning Engineer Interview Questions

About Machine Learning Engineer interviews

Typical stages

Common formats

What hiring managers screen for

Red flags to avoid

Primary questions (15)

Tell me about an ML model you took from prototype to production. What did the journey actually look like?

Describe a time you had to push back on a stakeholder who wanted you to use a specific model or approach you disagreed with.

Tell me about a time a model you deployed underperformed in production compared to offline metrics. How did you diagnose and fix it?

Walk me through a time you had to balance model performance against latency, cost, or interpretability constraints.

Design an ML system to detect fraudulent transactions in real time at the scale of millions of transactions per day.

Explain how you would detect data drift and concept drift in a deployed model, and what you'd do about each.

Given a 100GB training dataset that won't fit in memory, how would you train a gradient-boosted model on it?

How would you evaluate a recommendation model before launching it to users?

Your production model's accuracy dropped 8% overnight. Walk me through the first hour of your investigation.

A product manager asks you to build a model to 'predict customer churn' with no further specification. How do you proceed?

You inherit a legacy model with no documentation, no tests, and no monitoring. What's your plan for the first 30 days?

How do you decide between a simpler interpretable model and a more complex black-box model for a given problem?

How do you approach experimentation and iteration when improving an existing model?

How do you stay current with the ML field without getting distracted by every new paper or framework?

Describe how you work with data scientists, software engineers, and product managers on an ML project.

More practice questions (15)

Explain the bias-variance trade-off in the context of a model you've actually built.

How does XGBoost differ from a random forest, and when would you choose one over the other?

Walk me through how you'd build a feature store from scratch and why you'd need one.

How would you handle severe class imbalance in a binary classification problem?

Explain how you'd serve a deep learning model with sub-100ms p99 latency requirements.

What's the difference between batch and online inference, and how do you decide which to use?

How would you A/B test a new model against the production model? What guardrails would you set?

Tell me about a time you had to debug an ML pipeline failure that turned out to be unrelated to the model itself.

Describe a project where you had to make significant trade-offs due to limited labelled data.

A regulator asks you to explain why your model declined a specific customer's application. How do you respond?

You discover the training data has been contaminated with future information. What do you do?

How do you decide when a model is 'good enough' to deploy?

How do you document an ML system so that another engineer can take it over?

What kind of ML problems are you most drawn to, and why?

How do you feel about on-call rotations for production ML systems?

Researching the Machine Learning Engineer role?

Get a prep pack tailored to your experience

About Machine Learning Engineer interviews

Typical stages

Common formats

What hiring managers screen for

Red flags to avoid

Primary questions (15)

Tell me about an ML model you took from prototype to production. What did the journey actually look like?

Describe a time you had to push back on a stakeholder who wanted you to use a specific model or approach you disagreed with.

Tell me about a time a model you deployed underperformed in production compared to offline metrics. How did you diagnose and fix it?

Walk me through a time you had to balance model performance against latency, cost, or interpretability constraints.

Design an ML system to detect fraudulent transactions in real time at the scale of millions of transactions per day.

Explain how you would detect data drift and concept drift in a deployed model, and what you'd do about each.

Given a 100GB training dataset that won't fit in memory, how would you train a gradient-boosted model on it?

How would you evaluate a recommendation model before launching it to users?

Your production model's accuracy dropped 8% overnight. Walk me through the first hour of your investigation.

A product manager asks you to build a model to 'predict customer churn' with no further specification. How do you proceed?

You inherit a legacy model with no documentation, no tests, and no monitoring. What's your plan for the first 30 days?

How do you decide between a simpler interpretable model and a more complex black-box model for a given problem?

How do you approach experimentation and iteration when improving an existing model?

How do you stay current with the ML field without getting distracted by every new paper or framework?

Describe how you work with data scientists, software engineers, and product managers on an ML project.

More practice questions (15)

Explain the bias-variance trade-off in the context of a model you've actually built.

How does XGBoost differ from a random forest, and when would you choose one over the other?

Walk me through how you'd build a feature store from scratch and why you'd need one.

How would you handle severe class imbalance in a binary classification problem?

Explain how you'd serve a deep learning model with sub-100ms p99 latency requirements.

What's the difference between batch and online inference, and how do you decide which to use?

How would you A/B test a new model against the production model? What guardrails would you set?

Tell me about a time you had to debug an ML pipeline failure that turned out to be unrelated to the model itself.

Describe a project where you had to make significant trade-offs due to limited labelled data.

A regulator asks you to explain why your model declined a specific customer's application. How do you respond?

You discover the training data has been contaminated with future information. What do you do?

How do you decide when a model is 'good enough' to deploy?

How do you document an ML system so that another engineer can take it over?

What kind of ML problems are you most drawn to, and why?

How do you feel about on-call rotations for production ML systems?

Researching the Machine Learning Engineer role?

Related roles

Get a prep pack tailored to your experience