Data Science Interview: SQL, Stats & ML Guide

Name: InterviewAlly Desktop
Author: InterviewAlly

A complete guide to data science interviews — from SQL window functions and hypothesis testing to ML concepts and take-home projects.

March 1, 2026 · 13 min read

Data Science · SQL · Interview Prep

Data visualization dashboard with statistical charts and machine learning model diagrams

Data science interviews are among the most demanding in tech, blending software engineering, statistics, and business thinking into a single process. This guide covers the full spectrum of data science interview questions you'll encounter — from SQL deep dives and probability puzzles to machine learning system design and product case studies.

What Data Science Interviews Test

Unlike pure software engineering loops, data science interviews evaluate a uniquely broad skill set. Expect four to six rounds covering:

SQL & Data Manipulation — Complex queries with window functions, CTEs, and self-joins
Statistics & Probability — Distributions, hypothesis testing, Bayesian reasoning, A/B test design
Machine Learning — Algorithm explanations, trade-offs, end-to-end ML pipeline design
Product Analytics / Business Case — Metric definition, metric investigation, data-driven solutions
Coding — Python/R data manipulation, sometimes LeetCode-style problems
Take-Home Project — Clean data, build a model, present findings

SQL Deep Dive: Window Functions, CTEs & Optimization

SQL is the universal language of data science interviews. If you've worked through our top 20 SQL interview questions, you have the foundations. Data science rounds push further.

Window Functions

Window functions are the most tested advanced SQL topic. You'll need fluency with ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), and running aggregates.

-- Find the second-highest salary per department
SELECT department, name, salary
FROM (
  SELECT department, name, salary,
         DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rnk
  FROM employees
) ranked
WHERE rnk = 2;

CTEs and Recursive Queries

A classic question: "Find users who made a purchase on three consecutive days."

WITH daily_purchases AS (
  SELECT user_id, DATE(purchase_time) as purchase_date,
         LAG(DATE(purchase_time), 1) OVER (PARTITION BY user_id ORDER BY DATE(purchase_time)) as prev_day,
         LAG(DATE(purchase_time), 2) OVER (PARTITION BY user_id ORDER BY DATE(purchase_time)) as two_days_ago
  FROM purchases
)
SELECT DISTINCT user_id
FROM daily_purchases
WHERE purchase_date = prev_day + INTERVAL '1 day'
  AND prev_day = two_days_ago + INTERVAL '1 day';

Query Optimization

Indexing — Add indexes on columns used in WHERE, JOIN, and ORDER BY
Avoid SELECT * — Only retrieve needed columns
EXPLAIN plans — Read execution plans to identify full table scans
Partitioning — For very large tables, partition by date or region
Materialized views — Pre-compute expensive aggregations

Statistics: Probability, Hypothesis Testing & A/B Tests

Probability Essentials

"You roll two dice. What's the probability the sum is 7?" — 6/36 = 1/6
Bayes' Theorem — Nearly always tested. Practice the "disease testing" problem.

Hypothesis Testing

State H0 and H1
Choose significance level (alpha, typically 0.05)
Select the appropriate test (t-test, chi-squared, z-test)
Compute test statistic and p-value
Decide whether to reject H0

A/B Test Design

Metric selection — Primary metric, guardrail metrics
Sample size calculation — Based on baseline rate, MDE, power (80%), significance (5%)
Randomization unit — User-level vs. session-level
Duration — At least one full business cycle (1-2 weeks)
Novelty and primacy effects — Exclude first few days or use holdout

Machine Learning Concepts Interviewers Expect

Algorithm	Type	When to Use	Key Trade-off
Linear Regression	Regression	Continuous target, interpretability needed	Assumes linearity; sensitive to outliers
Logistic Regression	Classification	Binary outcomes, need probabilities	Linear decision boundary; fast
Random Forest	Both	General-purpose, reduced variance	Less interpretable; slower inference
XGBoost / LightGBM	Both	Tabular data competitions	Requires careful hyperparameter tuning
K-Means	Clustering	Unsupervised grouping, EDA	Must specify K; assumes spherical clusters
Neural Networks	Both	Unstructured data (images, text)	Data-hungry; black box

Commonly Asked ML Questions

"What is the bias-variance trade-off?" — Bias: underfitting. Variance: overfitting. Goal: minimize total error.
"How do you handle class imbalance?" — SMOTE, undersampling, class weights, or precision-recall AUC.
"Explain regularization." — L1 (Lasso) drives coefficients to zero. L2 (Ridge) shrinks smoothly. Elastic Net combines both.
"Bagging vs. boosting?" — Bagging: parallel, reduces variance. Boosting: sequential, reduces bias.

Take-Home Projects: How to Stand Out

Start with EDA — Show distributions, check for nulls, identify outliers
Document assumptions — State every assumption explicitly
Use a simple baseline first — Show the complex model actually improves performance
Feature engineering — Create domain-relevant features
Validate properly — Stratified k-fold or time-based splits
Write production-quality code — Functions, type hints, clear names, README
Communicate results — What you tried, what worked, what you'd do with more time

Product Analytics & Business Case Questions

These are common at Meta, Airbnb, Spotify, and Uber.

"Daily active users dropped 10% yesterday. How would you investigate?"

Verify the data — Logging issue? Broken ETL? Timezone change?
Segment the drop — By platform, geography, new vs. existing users
Check external factors — Holidays, competitor launches, outages
Check internal changes — Recent deployments, A/B tests, feature removals
Quantify and communicate — Narrow down root cause and estimate impact

Building Your Data Science Interview Prep Plan

Week	Focus Area	Daily Practice
1-2	SQL	2-3 SQL problems per day
3-4	Statistics & Probability	Review theory + 2 puzzles per day
5-6	Machine Learning	Explain one algorithm per day; build a mini project
7	Product Analytics	2 metric/case questions per day
8	Mock Interviews	3-4 full mocks with peers

Read our guide on acing the technical phone screen for tips on communicating under time constraints, and our system design guide if your target companies include ML system design rounds.

Ready to start your data science interview prep? Try InterviewAlly free and practice with real-time AI assistance across SQL, ML, and product analytics questions.