Data Science Interview: SQL, Stats & ML Guide
A complete guide to data science interviews — from SQL window functions and hypothesis testing to ML concepts and take-home projects.
Data science interviews are among the most demanding in tech, blending software engineering, statistics, and business thinking into a single process. This guide covers the full spectrum of data science interview questions you'll encounter — from SQL deep dives and probability puzzles to machine learning system design and product case studies.
What Data Science Interviews Test
Unlike pure software engineering loops, data science interviews evaluate a uniquely broad skill set. Expect four to six rounds covering:
- SQL & Data Manipulation — Complex queries with window functions, CTEs, and self-joins
- Statistics & Probability — Distributions, hypothesis testing, Bayesian reasoning, A/B test design
- Machine Learning — Algorithm explanations, trade-offs, end-to-end ML pipeline design
- Product Analytics / Business Case — Metric definition, metric investigation, data-driven solutions
- Coding — Python/R data manipulation, sometimes LeetCode-style problems
- Take-Home Project — Clean data, build a model, present findings
SQL Deep Dive: Window Functions, CTEs & Optimization
SQL is the universal language of data science interviews. If you've worked through our top 20 SQL interview questions, you have the foundations. Data science rounds push further.
Window Functions
Window functions are the most tested advanced SQL topic. You'll need fluency with ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), and running aggregates.
-- Find the second-highest salary per department
SELECT department, name, salary
FROM (
SELECT department, name, salary,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk
FROM employees
) ranked
WHERE rnk = 2;
CTEs and Recursive Queries
A classic question: "Find users who made a purchase on three consecutive days."
WITH daily_purchases AS (
SELECT user_id, DATE(purchase_time) as purchase_date,
LAG(DATE(purchase_time), 1) OVER (PARTITION BY user_id ORDER BY DATE(purchase_time)) as prev_day,
LAG(DATE(purchase_time), 2) OVER (PARTITION BY user_id ORDER BY DATE(purchase_time)) as two_days_ago
FROM purchases
)
SELECT DISTINCT user_id
FROM daily_purchases
WHERE purchase_date = prev_day + INTERVAL '1 day'
AND prev_day = two_days_ago + INTERVAL '1 day';
Query Optimization
- Indexing — Add indexes on columns used in WHERE, JOIN, and ORDER BY
- Avoid SELECT * — Only retrieve needed columns
- EXPLAIN plans — Read execution plans to identify full table scans
- Partitioning — For very large tables, partition by date or region
- Materialized views — Pre-compute expensive aggregations
Statistics: Probability, Hypothesis Testing & A/B Tests
Probability Essentials
- "You roll two dice. What's the probability the sum is 7?" — 6/36 = 1/6
- Bayes' Theorem — Nearly always tested. Practice the "disease testing" problem.
Hypothesis Testing
- State H0 and H1
- Choose significance level (alpha, typically 0.05)
- Select the appropriate test (t-test, chi-squared, z-test)
- Compute test statistic and p-value
- Decide whether to reject H0
A/B Test Design
- Metric selection — Primary metric, guardrail metrics
- Sample size calculation — Based on baseline rate, MDE, power (80%), significance (5%)
- Randomization unit — User-level vs. session-level
- Duration — At least one full business cycle (1-2 weeks)
- Novelty and primacy effects — Exclude first few days or use holdout
Machine Learning Concepts Interviewers Expect
| Algorithm | Type | When to Use | Key Trade-off |
|---|---|---|---|
| Linear Regression | Regression | Continuous target, interpretability needed | Assumes linearity; sensitive to outliers |
| Logistic Regression | Classification | Binary outcomes, need probabilities | Linear decision boundary; fast |
| Random Forest | Both | General-purpose, reduced variance | Less interpretable; slower inference |
| XGBoost / LightGBM | Both | Tabular data competitions | Requires careful hyperparameter tuning |
| K-Means | Clustering | Unsupervised grouping, EDA | Must specify K; assumes spherical clusters |
| Neural Networks | Both | Unstructured data (images, text) | Data-hungry; black box |
Commonly Asked ML Questions
- "What is the bias-variance trade-off?" — Bias: underfitting. Variance: overfitting. Goal: minimize total error.
- "How do you handle class imbalance?" — SMOTE, undersampling, class weights, or precision-recall AUC.
- "Explain regularization." — L1 (Lasso) drives coefficients to zero. L2 (Ridge) shrinks smoothly. Elastic Net combines both.
- "Bagging vs. boosting?" — Bagging: parallel, reduces variance. Boosting: sequential, reduces bias.
Take-Home Projects: How to Stand Out
- Start with EDA — Show distributions, check for nulls, identify outliers
- Document assumptions — State every assumption explicitly
- Use a simple baseline first — Show the complex model actually improves performance
- Feature engineering — Create domain-relevant features
- Validate properly — Stratified k-fold or time-based splits
- Write production-quality code — Functions, type hints, clear names, README
- Communicate results — What you tried, what worked, what you'd do with more time
Product Analytics & Business Case Questions
These are common at Meta, Airbnb, Spotify, and Uber.
"Daily active users dropped 10% yesterday. How would you investigate?"
- Verify the data — Logging issue? Broken ETL? Timezone change?
- Segment the drop — By platform, geography, new vs. existing users
- Check external factors — Holidays, competitor launches, outages
- Check internal changes — Recent deployments, A/B tests, feature removals
- Quantify and communicate — Narrow down root cause and estimate impact
Building Your Data Science Interview Prep Plan
| Week | Focus Area | Daily Practice |
|---|---|---|
| 1-2 | SQL | 2-3 SQL problems per day |
| 3-4 | Statistics & Probability | Review theory + 2 puzzles per day |
| 5-6 | Machine Learning | Explain one algorithm per day; build a mini project |
| 7 | Product Analytics | 2 metric/case questions per day |
| 8 | Mock Interviews | 3-4 full mocks with peers |
Read our guide on acing the technical phone screen for tips on communicating under time constraints, and our system design guide if your target companies include ML system design rounds.
Ready to start your data science interview prep? Try InterviewAlly free and practice with real-time AI assistance across SQL, ML, and product analytics questions.