InterviewAlly

Data Science Interview: SQL, Stats & ML Guide

A complete guide to data science interviews — from SQL window functions and hypothesis testing to ML concepts and take-home projects.

March 1, 2026 · 13 min read

Data Science · SQL · Interview Prep

Data visualization dashboard with statistical charts and machine learning model diagrams

Data science interviews are among the most demanding in tech, blending software engineering, statistics, and business thinking into a single process. This guide covers the full spectrum of data science interview questions you'll encounter — from SQL deep dives and probability puzzles to machine learning system design and product case studies.

What Data Science Interviews Test

Unlike pure software engineering loops, data science interviews evaluate a uniquely broad skill set. Expect four to six rounds covering:

  • SQL & Data Manipulation — Complex queries with window functions, CTEs, and self-joins
  • Statistics & Probability — Distributions, hypothesis testing, Bayesian reasoning, A/B test design
  • Machine Learning — Algorithm explanations, trade-offs, end-to-end ML pipeline design
  • Product Analytics / Business Case — Metric definition, metric investigation, data-driven solutions
  • Coding — Python/R data manipulation, sometimes LeetCode-style problems
  • Take-Home Project — Clean data, build a model, present findings

SQL Deep Dive: Window Functions, CTEs & Optimization

SQL is the universal language of data science interviews. If you've worked through our top 20 SQL interview questions, you have the foundations. Data science rounds push further.

Window Functions

Window functions are the most tested advanced SQL topic. You'll need fluency with ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), and running aggregates.

-- Find the second-highest salary per department
SELECT department, name, salary
FROM (
  SELECT department, name, salary,
         DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk
  FROM employees
) ranked
WHERE rnk = 2;

CTEs and Recursive Queries

A classic question: "Find users who made a purchase on three consecutive days."

WITH daily_purchases AS (
  SELECT user_id, DATE(purchase_time) as purchase_date,
         LAG(DATE(purchase_time), 1) OVER (PARTITION BY user_id ORDER BY DATE(purchase_time)) as prev_day,
         LAG(DATE(purchase_time), 2) OVER (PARTITION BY user_id ORDER BY DATE(purchase_time)) as two_days_ago
  FROM purchases
)
SELECT DISTINCT user_id
FROM daily_purchases
WHERE purchase_date = prev_day + INTERVAL '1 day'
  AND prev_day = two_days_ago + INTERVAL '1 day';

Query Optimization

  • Indexing — Add indexes on columns used in WHERE, JOIN, and ORDER BY
  • Avoid SELECT * — Only retrieve needed columns
  • EXPLAIN plans — Read execution plans to identify full table scans
  • Partitioning — For very large tables, partition by date or region
  • Materialized views — Pre-compute expensive aggregations

Statistics: Probability, Hypothesis Testing & A/B Tests

Probability Essentials

  • "You roll two dice. What's the probability the sum is 7?" — 6/36 = 1/6
  • Bayes' Theorem — Nearly always tested. Practice the "disease testing" problem.

Hypothesis Testing

  1. State H0 and H1
  2. Choose significance level (alpha, typically 0.05)
  3. Select the appropriate test (t-test, chi-squared, z-test)
  4. Compute test statistic and p-value
  5. Decide whether to reject H0

A/B Test Design

  • Metric selection — Primary metric, guardrail metrics
  • Sample size calculation — Based on baseline rate, MDE, power (80%), significance (5%)
  • Randomization unit — User-level vs. session-level
  • Duration — At least one full business cycle (1-2 weeks)
  • Novelty and primacy effects — Exclude first few days or use holdout

Machine Learning Concepts Interviewers Expect

AlgorithmTypeWhen to UseKey Trade-off
Linear RegressionRegressionContinuous target, interpretability neededAssumes linearity; sensitive to outliers
Logistic RegressionClassificationBinary outcomes, need probabilitiesLinear decision boundary; fast
Random ForestBothGeneral-purpose, reduced varianceLess interpretable; slower inference
XGBoost / LightGBMBothTabular data competitionsRequires careful hyperparameter tuning
K-MeansClusteringUnsupervised grouping, EDAMust specify K; assumes spherical clusters
Neural NetworksBothUnstructured data (images, text)Data-hungry; black box

Commonly Asked ML Questions

  • "What is the bias-variance trade-off?" — Bias: underfitting. Variance: overfitting. Goal: minimize total error.
  • "How do you handle class imbalance?" — SMOTE, undersampling, class weights, or precision-recall AUC.
  • "Explain regularization." — L1 (Lasso) drives coefficients to zero. L2 (Ridge) shrinks smoothly. Elastic Net combines both.
  • "Bagging vs. boosting?" — Bagging: parallel, reduces variance. Boosting: sequential, reduces bias.

Take-Home Projects: How to Stand Out

  1. Start with EDA — Show distributions, check for nulls, identify outliers
  2. Document assumptions — State every assumption explicitly
  3. Use a simple baseline first — Show the complex model actually improves performance
  4. Feature engineering — Create domain-relevant features
  5. Validate properly — Stratified k-fold or time-based splits
  6. Write production-quality code — Functions, type hints, clear names, README
  7. Communicate results — What you tried, what worked, what you'd do with more time

Product Analytics & Business Case Questions

These are common at Meta, Airbnb, Spotify, and Uber.

"Daily active users dropped 10% yesterday. How would you investigate?"

  1. Verify the data — Logging issue? Broken ETL? Timezone change?
  2. Segment the drop — By platform, geography, new vs. existing users
  3. Check external factors — Holidays, competitor launches, outages
  4. Check internal changes — Recent deployments, A/B tests, feature removals
  5. Quantify and communicate — Narrow down root cause and estimate impact

Building Your Data Science Interview Prep Plan

WeekFocus AreaDaily Practice
1-2SQL2-3 SQL problems per day
3-4Statistics & ProbabilityReview theory + 2 puzzles per day
5-6Machine LearningExplain one algorithm per day; build a mini project
7Product Analytics2 metric/case questions per day
8Mock Interviews3-4 full mocks with peers

Read our guide on acing the technical phone screen for tips on communicating under time constraints, and our system design guide if your target companies include ML system design rounds.

Ready to start your data science interview prep? Try InterviewAlly free and practice with real-time AI assistance across SQL, ML, and product analytics questions.