Amazon Data Scientist Interview Preparation Guide (Mid-Level)

Data Scientist

Amazon

Mid Level

8 rounds

Updated 6/16/2026

Amazon's Data Scientist interview process consists of an initial recruiter screen followed by two technical phone screens and five onsite rounds. The process evaluates candidates across SQL, Machine Learning, Python coding, Statistics, Algorithms, and Behavioral/Cultural fit. Interviewers assess both technical depth and ability to translate business problems into data-driven solutions. The entire process typically spans 4-6 weeks.

Interview Rounds

Recruiter Screening

30 min4 focus topicsculture fit

What to Expect

Your initial conversation with an Amazon recruiter focused on understanding your background, motivation for the role, and basic qualifications. This is primarily a fit assessment and not a technical evaluation. The recruiter will discuss the role expectations, interview process, and answer initial questions about Amazon as an employer.

Tips & Advice

Research Amazon's Data Science function beforehand. Be prepared to discuss your career trajectory and why you're interested in a Data Scientist role at Amazon specifically. Highlight projects where you've made impact with data. Ask thoughtful questions about the team and role to demonstrate genuine interest. Keep answers concise and results-oriented. Mention any experience with AWS tools or large-scale data problems.

Focus Topics

Career Motivation & Growth Mindset

Articulate why you want to work at Amazon, what excites you about the Data Scientist role, and your long-term career aspirations. Discuss how you stay current with data science trends and technologies. Show genuine enthusiasm for solving ambiguous business problems.

Practice Interview

Study Questions

Understanding of Amazon Data Scientist Role

Demonstrate knowledge of what Data Scientists actually do at Amazon including working with massive datasets, building predictive models, conducting statistical analysis, and partnering with business stakeholders. Show understanding of how data science drives business decisions at scale.

Practice Interview

Study Questions

Communication & Collaboration Style

Discuss how you communicate technical findings to non-technical stakeholders, collaborate with engineers and product managers, and handle disagreements on technical approaches. Provide examples of working effectively across teams.

Practice Interview

Study Questions

Professional Background & Relevant Experience

Clearly articulate your career path, data science projects, technical skills, and measurable impacts. Be ready to discuss 2-3 projects that showcase your ability to work end-to-end on data science problems. Emphasize domain expertise relevant to Amazon's business (e-commerce, recommendation systems, logistics, customer analytics).

Practice Interview

Study Questions

Technical Phone Screen 1: SQL & Data Analysis

60 min4 focus topicstechnical

What to Expect

A focused 45-60 minute technical interview assessing your SQL proficiency and data analysis capabilities. You'll be expected to write SQL queries to solve business problems, optimize query performance, and demonstrate understanding of databases and data manipulation. The interviewer may present business scenarios requiring you to extract insights from databases.

Tips & Advice

Practice writing complex SQL queries involving multiple joins, subqueries, window functions, and aggregations. Focus on query optimization and explaining your approach. Use an online SQL environment like LeetCode or DataLemur to practice before the interview. When given a problem, clarify requirements before coding. Explain your logic as you write. Test edge cases. Be comfortable working with large datasets conceptually. Discuss time and space complexity of your queries. Have a structured approach to problem-solving.

Focus Topics

Python for Data Exploration & Validation

Use Python (Pandas, NumPy) to validate query results, explore data characteristics, check data quality, and perform initial analysis. Understand when to use Python vs SQL for data manipulation based on problem complexity and dataset size.

Practice Interview

Study Questions

Query Optimization & Performance Analysis

Understand query execution plans, indexing strategies, and how to optimize slow queries. Learn to identify N+1 query problems and bottlenecks. Discuss trade-offs between query readability and performance. Understand database concepts like EXPLAIN plans and how to use them.

Practice Interview

Study Questions

Business Metrics & Data Analysis via SQL

Calculate key business metrics like revenue, user retention, conversion rates, customer lifetime value, product performance trends, and cohort metrics using SQL. Practice defining metrics accurately and handling edge cases like null values and data quality issues.

Practice Interview

Study Questions

Complex SQL Query Writing

Master writing SQL queries involving joins (INNER, LEFT, RIGHT, FULL), self-joins, subqueries, CTEs (WITH clauses), and window functions (RANK, ROW_NUMBER, LEAD, LAG). Practice aggregations with GROUP BY and HAVING. Be able to handle business problems like calculating retention rates, churn prediction, cohort analysis, and time-series metrics.

Practice Interview

Study Questions

Technical Phone Screen 2: Machine Learning & Modeling

60 min5 focus topicstechnical

What to Expect

A 45-60 minute technical interview focused on Machine Learning concepts, model development, and Python coding. You'll discuss ML algorithms, model evaluation metrics, handling data quality issues, and potentially implement a simple model or solve ML-related coding problems. Expect questions on regularization, class imbalance, and practical ML considerations.

Tips & Advice

Study both fundamental ML concepts and practical considerations. Be ready to explain how you'd build an end-to-end model for a given business problem. Know the assumptions, strengths, and weaknesses of common algorithms. Prepare Python implementations using scikit-learn. Discuss model evaluation metrics appropriate for different problems. Explain your approach to feature engineering. Practice handling scenarios like imbalanced datasets and missing values. Connect technical concepts to business impact. Walk through your debugging process for model issues.

Focus Topics

Feature Engineering & Data Preparation

Design and implement features from raw data. Understand feature scaling, encoding categorical variables, handling temporal features, and feature interactions. Discuss feature importance and selection techniques. Know when to create new features vs when existing features suffice.

Practice Interview

Study Questions

Python ML Implementation & Coding

Write clean, efficient Python code for model development using scikit-learn, TensorFlow, or similar libraries. Implement data preprocessing pipelines, model training, evaluation, and prediction. Write reproducible code with proper logging and error handling. Solve ML-related coding problems efficiently.

Practice Interview

Study Questions

Handling Data Quality & Class Imbalance

Develop strategies for handling missing data, outliers, and imbalanced datasets. Know techniques like oversampling, undersampling, SMOTE, adjusting class weights, and threshold adjustment. Discuss when each approach is appropriate. Handle data quality issues that arise from real-world data.

Practice Interview

Study Questions

Model Evaluation Metrics & Validation

Master different evaluation metrics: accuracy, precision, recall, F1-score for classification; MSE, RMSE, MAE, R-squared for regression. Understand when each metric is appropriate. Practice cross-validation techniques, train-test splits, and learning curves. Understand overfitting vs underfitting and how to detect them. Discuss the difference between optimizing for business metrics vs statistical metrics.

Practice Interview

Study Questions

Machine Learning Algorithms & Model Selection

Understand classification, regression, and clustering algorithms including logistic regression, decision trees, random forests, SVM, k-means, and gradient boosting. Know when to use each algorithm based on problem type, dataset size, and interpretability requirements. Discuss trade-offs between algorithms. Be able to explain how algorithms work conceptually and mathematically at a mid-level depth.

Practice Interview

Study Questions

Onsite Round 1: Machine Learning & Modeling Deep Dive

60 min6 focus topicstechnical

What to Expect

A 60-minute onsite interview with an Amazon Data Scientist diving deep into machine learning concepts, advanced modeling techniques, and your ability to translate complex business problems into ML solutions. Expect detailed discussions on model architecture, optimization, regularization, and real-world ML considerations. You may discuss a past project you led or work through a detailed ML design problem.

Tips & Advice

Prepare a detailed technical project to discuss with full understanding of trade-offs and lessons learned. Practice explaining ML concepts clearly at different technical levels. Be ready for deep-dive questions on regularization, optimization algorithms, and hyperparameter tuning. Discuss how you handle ambiguity in problem definition. Talk about measuring model impact in production. Discuss scalability challenges and solutions. Show ownership of end-to-end model lifecycle. Mention A/B testing strategies for model deployment. Have opinions backed by data on different approaches.

Focus Topics

Model Architecture Design & Deep Learning Concepts

Understand neural network architectures, activation functions, and when to use deep learning. Discuss CNNs, RNNs, and transformers at a conceptual level. Know about batch normalization, dropout, and optimization algorithms like Adam vs SGD. Be able to design appropriate architectures for different problem types.

Practice Interview

Study Questions

Model Interpretability, Explainability & Debugging

Explain model predictions to non-technical stakeholders. Use techniques like SHAP, LIME, or feature importance analysis. Debug failing models systematically. Discuss when interpretability is critical vs when black-box models are acceptable. Understand model bias and fairness considerations.

Practice Interview

Study Questions

Building Scalable ML Pipelines & Production Considerations

Design ML pipelines that scale to large datasets. Understand batch vs online prediction. Discuss model serving, inference optimization, and latency constraints. Know about feature stores, model versioning, and monitoring. Understand the complete ML lifecycle from experimentation to production.

Practice Interview

Study Questions

End-to-End Project Ownership & Impact Measurement

Own ML projects from problem definition through deployment and monitoring. Define success metrics and measure actual impact. Iterate based on results and feedback. Collaborate with engineers, product managers, and other stakeholders. Document decisions and lessons learned. Drive projects to completion despite ambiguity and obstacles.

Practice Interview

Study Questions

Advanced Regularization & Hyperparameter Tuning

Understand L1/L2 regularization, dropout, early stopping, and other regularization techniques. Know the difference between regularization methods and when to apply each. Practice hyperparameter tuning using grid search, random search, or Bayesian optimization. Understand the bias-variance trade-off deeply. Discuss cross-validation strategies for hyperparameter selection.

Practice Interview

Study Questions

Translating Business Problems to ML Solutions

Take vague business problems and define them as ML problems. Identify whether a problem requires classification, regression, clustering, or other approaches. Define success metrics aligned with business goals. Discuss data requirements, feasibility, and timeline. Handle ambiguity by asking clarifying questions and making reasonable assumptions.

Practice Interview

Study Questions

Onsite Round 2: Data Analysis & A/B Testing

60 min4 focus topicstechnical

What to Expect

A 60-minute onsite interview assessing your ability to design and analyze experiments, understand statistical testing, and drive business decisions with data. You'll work through A/B testing scenarios, design experiments for product changes, calculate statistical significance, and translate analysis into actionable recommendations. Expect discussion of metrics, sample size calculation, and common pitfalls in experimental design.

Tips & Advice

Study experimental design and A/B testing thoroughly. Practice designing experiments for real business problems. Understand statistical concepts including p-values, confidence intervals, and power analysis. Know how to calculate sample sizes. Discuss common A/B testing mistakes like peeking, multiple comparisons problem, and confounding variables. Be able to interpret results and make recommendations. Practice explaining statistical concepts to non-technical audiences. Discuss trade-offs between statistical significance and practical significance. Have opinions on experiment design choices backed by reasoning.

Focus Topics

Metrics Definition & Selection

Define appropriate metrics for different business questions. Understand leading vs lagging indicators. Design metrics that align with business goals. Discuss metric trade-offs and gaming metrics. Handle metrics with long feedback loops. Understand how metrics interact and affect each other. Practice explaining metrics to business stakeholders.

Practice Interview

Study Questions

Business Impact Analysis & Recommendations

Analyze experimental results in business context. Calculate return on investment or other business impact measures. Make clear recommendations based on data. Discuss confidence in conclusions. Highlight key learnings and uncertainties. Present findings to decision-makers effectively. Connect statistical results to business implications.

Practice Interview

Study Questions

A/B Testing Design & Implementation

Design comprehensive A/B tests for product decisions. Define control and treatment groups clearly. Calculate required sample sizes based on baseline metrics and desired sensitivity. Discuss randomization strategies and avoiding bias. Plan analysis approach before running the experiment. Handle multiple testing corrections. Discuss trade-offs in test design.

Practice Interview

Study Questions

Statistical Hypothesis Testing & Significance

Understand the fundamentals of hypothesis testing including null/alternative hypotheses, p-values, confidence intervals, and Type I/II errors. Know when to use parametric vs non-parametric tests. Understand statistical power and its importance. Practice calculating statistical significance. Discuss the difference between statistical and practical significance.

Practice Interview

Study Questions

Onsite Round 3: SQL & Database Optimization

60 min4 focus topicstechnical

What to Expect

A 60-minute onsite technical interview focused on advanced SQL skills, query optimization, and working with large-scale datasets. You'll solve complex SQL problems, optimize existing queries, design efficient database solutions, and demonstrate understanding of database architecture. Expect discussion of indexing strategies, query execution plans, and handling billion-row databases.

Tips & Advice

Master advanced SQL techniques before this round. Practice with complex multi-table queries, window functions, and CTEs extensively. Study query optimization and use EXPLAIN plans to understand query execution. Understand indexing strategies and how they impact performance. Be ready to optimize slow queries systematically. Discuss trade-offs between different SQL approaches. Explain your reasoning for query structure choices. Practice working with large datasets conceptually. Know about database partitioning and sharding. Have opinions on when to denormalize or normalize data.

Focus Topics

Data Quality & Aggregation in SQL

Handle data quality issues directly in SQL. Deduplicate data, handle nulls appropriately, and validate data integrity. Create reliable aggregations with proper grouping and filtering. Discuss data freshness and consistency. Calculate metrics that account for data quality issues.

Practice Interview

Study Questions

Large-Scale Data Handling & Architecture

Understand database architecture for handling billion+ row tables. Discuss partitioning strategies and their benefits. Know about indexes (B-tree, hash, covering indexes) and when to use each. Understand data warehouse concepts. Discuss trade-offs between query speed and storage costs. Handle scenarios where queries might be slow due to data scale.

Practice Interview

Study Questions

Query Optimization & Performance Tuning

Read and interpret query execution plans. Identify performance bottlenecks using EXPLAIN ANALYZE. Optimize queries through rewriting, indexing, and structural changes. Understand join strategies and their costs. Discuss query hints and optimizer behavior. Benchmark query performance improvements. Know when to denormalize for performance.

Practice Interview

Study Questions

Complex SQL Queries & Advanced Techniques

Master window functions (RANK, DENSE_RANK, ROW_NUMBER, LAG, LEAD, running aggregates), Common Table Expressions (CTEs) with multiple levels, self-joins, complex aggregations, and recursive queries. Solve business problems requiring multi-step logic. Handle data quality issues in SQL like nulls and duplicates. Optimize for both correctness and clarity.

Practice Interview

Study Questions

Onsite Round 4: Algorithms & Problem Solving

60 min3 focus topicstechnical

What to Expect

A 60-minute onsite technical interview assessing your problem-solving skills, algorithm knowledge, and coding ability under pressure. You'll solve coding problems involving data structures and algorithms, implement efficient solutions, and optimize for time and space complexity. These problems may or may not be directly ML-related but assess computational thinking and code quality.

Tips & Advice

Practice LeetCode medium to hard problems, especially those related to data manipulation, arrays, strings, and graphs. Focus on understanding problem requirements before coding. Use clear variable names and structure code logically. Test edge cases. Discuss time and space complexity of your solutions. Optimize brute force solutions. Practice coding in Python under time pressure. Explain your approach before coding. Walk through your logic as you code. Be comfortable with common data structures and algorithms. Show clean coding practices.

Focus Topics

Time & Space Complexity Analysis

Analyze algorithm complexity accurately. Understand different Big O complexities and their practical implications. Make trade-offs between time and space. Identify bottlenecks and optimize them. Discuss how complexity scales with dataset size. Know when optimization matters vs premature optimization.

Practice Interview

Study Questions

Coding Problem Solving & Implementation

Solve coding problems systematically. Understand the problem fully before coding. Design solutions considering edge cases. Implement clean, bug-free code. Test your solution thoroughly. Optimize from brute force to efficient solutions. Write readable code with meaningful variable names. Practice writing code quickly and accurately.

Practice Interview

Study Questions

Data Structures & Algorithms Fundamentals

Master common data structures (arrays, linked lists, stacks, queues, heaps, trees, graphs, hash tables) and their operations. Understand algorithm paradigms like sorting, searching, dynamic programming, greedy algorithms, and graph algorithms. Know Big O notation and analyze complexity accurately. Choose appropriate data structures for problems.

Practice Interview

Study Questions

Onsite Round 5: Amazon Leadership Principles & Behavioral

60 min5 focus topicsbehavioral

What to Expect

A 60-minute onsite interview with an Amazon HR manager or senior team member assessing cultural fit, leadership principles alignment, and your soft skills. You'll discuss past experiences using the STAR method, demonstrating how you embody Amazon's leadership principles. Expect questions about handling conflict, collaborating with others, dealing with ambiguity, and driving results despite obstacles.

Tips & Advice

Research Amazon's 16 Leadership Principles thoroughly. Prepare 5-7 specific project stories using the STAR framework (Situation, Task, Action, Result). Ensure each story demonstrates different leadership principles clearly. Practice telling these stories concisely (2-3 minutes). Focus on your personal actions and impact, not just team achievements. Prepare stories showing: delivering results under pressure, making something simpler for customers, admitting mistakes, disagreeing respectfully, and innovating. Show genuine enthusiasm for Amazon's mission. Ask thoughtful questions about the team. Connect past experiences to how you'll contribute at Amazon.

Focus Topics

Teamwork, Collaboration & Cross-Functional Influence

Demonstrate ability to collaborate effectively with diverse teams including engineers, product managers, and business stakeholders. Share examples of influencing others without authority. Discuss handling disagreements respectfully and finding common ground. Show genuine interest in others' perspectives.

Practice Interview

Study Questions

Amazon Leadership Principle: Invent and Simplify

Share examples of approaching problems creatively, challenging status quo, and finding simpler solutions. Discuss balancing innovation with pragmatism. Show willingness to experiment and learn from failures. Demonstrate that you simplify for customers and teams, not just accepting complexity.

Practice Interview

Study Questions

Handling Ambiguity & Complex Situations

Discuss approaching undefined problems methodically. Show comfort making decisions with incomplete information. Share examples of clarifying unclear requirements, making reasonable assumptions, and moving forward decisively. Demonstrate ability to work effectively despite uncertainty.

Practice Interview

Study Questions

Amazon Leadership Principle: Ownership

Show accountability for outcomes beyond your direct responsibilities. Discuss taking initiative on problems, following through on commitments, and not blaming external factors. Share examples of persisting despite obstacles. Demonstrate long-term thinking and wanting the best outcome even when inconvenient.

Practice Interview

Study Questions

Amazon Leadership Principle: Customer Obsession

Demonstrate focus on customer needs and willingness to think long-term for customer benefit. Share examples of going beyond requirements to serve customers better. Show understanding that customer obsession drives product and technical decisions. Discuss how data science should ultimately serve customers.

Practice Interview

Study Questions

Frequently Asked Data Scientist Interview Questions

Applying Data Science Techniques to Business ProblemsMediumTechnical

73 practiced

Given these tables:

orders(order_id bigint, user_id bigint, order_date date, revenue numeric)
users(user_id bigint, signup_date date)

Write a PostgreSQL query that produces cohort_monthly_ltv with columns: cohort_month (date), month_number (int; 0 = signup month), users_in_cohort, month_revenue, cumulative_revenue, avg_ltv_per_user (cumulative) for the first 12 months after signup. Explain assumptions and performance tuning tips for large datasets.

Sample Answer

Approach: assign each user to a cohort defined by signup month (first day of that month), compute month_number as months elapsed between signup and order, aggregate revenue per cohort × month_number for months 0..11, compute cumulative revenue with a window, and compute avg LTV as cumulative_revenue / users_in_cohort.

SQL (Postgres):

sql

WITH cohorts AS (
  -- one row per user with their cohort month (first day of signup month)
  SELECT
    user_id,
    date_trunc('month', signup_date)::date AS cohort_month
  FROM users
),
orders_with_month_offset AS (
  -- join orders to cohort and compute month offset (0 = signup month)
  SELECT
    c.cohort_month,
    o.user_id,
    /* integer month difference between signup month and order month */
    ((date_part('year', o.order_date) - date_part('year', c.cohort_month)) * 12
     + (date_part('month', o.order_date) - date_part('month', c.cohort_month)))::int AS month_number,
    o.revenue
  FROM orders o
  JOIN cohorts c USING (user_id)
  WHERE o.order_date >= c.cohort_month  -- ignore pre-signup orders if any
),
monthly AS (
  -- sum revenue per cohort_month x month_number
  SELECT
    cohort_month,
    month_number,
    SUM(revenue)::numeric(18,2) AS month_revenue
  FROM orders_with_month_offset
  WHERE month_number BETWEEN 0 AND 11
  GROUP BY 1,2
),
cohort_users AS (
  -- number of users in each cohort
  SELECT
    cohort_month,
    COUNT(*) AS users_in_cohort
  FROM cohorts
  GROUP BY 1
),
all_months AS (
  -- ensure all month_number 0..11 appear for every cohort (fill zeros)
  SELECT
    c.cohort_month,
    m.month_number
  FROM cohort_users c
  CROSS JOIN generate_series(0,11) AS m(month_number)
)
SELECT
  a.cohort_month,
  a.month_number,
  cu.users_in_cohort,
  COALESCE(m.month_revenue, 0)::numeric(18,2)      AS month_revenue,
  SUM(COALESCE(m.month_revenue,0)) OVER (PARTITION BY a.cohort_month ORDER BY a.month_number
                                         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)::numeric(18,2)
    AS cumulative_revenue,
  (SUM(COALESCE(m.month_revenue,0)) OVER (PARTITION BY a.cohort_month ORDER BY a.month_number
                                          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
   / NULLIF(cu.users_in_cohort,0))::numeric(18,4)  AS avg_ltv_per_user
FROM all_months a
LEFT JOIN monthly m
  ON a.cohort_month = m.cohort_month AND a.month_number = m.month_number
JOIN cohort_users cu
  ON a.cohort_month = cu.cohort_month
ORDER BY cohort_month, month_number;

Assumptions:- signup_date and order_date are in same timezone and valid.- revenue is positive and stored in numeric; orders before signup are ignored.- cohort_month uses first day of month as representative.

Performance tuning for large datasets:- Indexes: orders(user_id, order_date) and users(user_id, signup_date) — helps joins and date filters.- Pre-aggregate: build a daily/weekly orders summary table (ETL) to reduce raw orders scanned.- Partition orders by order_date (range partitions by year/month) to limit scanned partitions.- Materialized view: refresh daily for fast dashboard queries.- Avoid COUNT(DISTINCT) in hot paths; compute users_in_cohort from users table (cheap) not from orders.- Use appropriate numeric precision or bigint cents to avoid expensive numeric math.- LIMIT scope: filter cohorts/months you need (e.g., last 24 months) to reduce work.

Advanced Querying with Structured Query LanguageEasyTechnical

18 practiced

Given a table events(user_id, event_time, event_type), write a SQL query (Postgres/ANSI) that returns the latest event per user (user_id, event_time, event_type). Use window functions (row_number) and briefly explain why window functions may be preferred over a correlated subquery here.

Sample Answer

Approach: use row_number() partitioned by user_id ordered by event_time desc to rank events per user and select the top-ranked row (latest) for each user.

sql

SELECT user_id, event_time, event_type
FROM (
  SELECT
    user_id,
    event_time,
    event_type,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) AS rn
  FROM events
) t
WHERE rn = 1;

Why window functions are preferred over a correlated subquery:- Performance: the window version computes ranks in one pass and can be optimized with indexes (e.g., (user_id, event_time DESC)), whereas a correlated subquery may re-scan or re-evaluate per row.- Readability & maintainability: the intent (top-N per group) is explicit and easier to extend (top 3, ties, additional columns).- Flexibility: you can easily change to RANK() or DENSE_RANK() to handle ties, or return top-N per user without rewriting queries.

Edge cases:- Ties on event_time: use RANK()/DENSE_RANK() if you want multiple rows for tied latest times.- NULL event_time: explicitly handle with ORDER BY event_time DESC NULLS LAST (if needed).

Model Evaluation and ValidationEasyTechnical

69 practiced

Explain what stratified sampling achieves in cross-validation. Give an example using a 10-fold stratified CV for a binary classification task with 1% positives. Why is stratification important for rare classes?

Hypothesis Testing and InferenceHardTechnical

29 practiced

Write Python code that implements the Benjamini-Hochberg procedure to control the false discovery rate at level q given an array of p-values. Your implementation should return the indices of hypotheses declared significant and adjusted p-values. Discuss time complexity and how to handle tied p-values or grouped hypotheses.

Sample Answer

Approach: sort p-values, compute BH thresholds k/n*q, find largest k with p_(k) <= (k/n) q, declare p_(1..k) significant. For adjusted p-values (a.k.a. q-values under BH), compute p_adj[i] = min_{j>=i} (n/j * p_(j)) and clip to 1, then reorder to original indices. Ties and grouped hypotheses: use stable sorting and average ranks for tie-aware thresholding; for grouped hypotheses, apply BH within groups and/or use hierarchical FDR (Benjamini–Bogomolov) or weighted BH with group-level weights.

python

import numpy as np

def benjamini_hochberg(pvals, q=0.05, return_indices=True):
    """
    Benjamini-Hochberg FDR control.
    Inputs:
      pvals: array-like of p-values (floats). NaNs are ignored (treated as non-significant).
      q: target FDR level (0 < q < 1)
      return_indices: if True, return indices of significant hypotheses
    Returns:
      significant_idx: np.array of original indices declared significant
      p_adj: np.array of BH-adjusted p-values in the original order (NaN preserved)
    """
    p = np.asarray(pvals)
    n = len(p)
    # Validate
    if not (0 < q < 1):
        raise ValueError("q must be in (0,1)")
    # Prepare arrays, mark NaNs
    is_nan = np.isnan(p)
    p_clean = np.where(is_nan, np.inf, p)  # NaNs become non-significant (inf)
    # Sort p-values with stable sort to preserve tie order
    order = np.argsort(p_clean, kind='mergesort')
    p_sorted = p_clean[order]
    # Compute thresholds
    ranks = np.arange(1, n+1)  # 1-based ranks
    thresholds = ranks / n * q
    # Find largest k with p_sorted[k-1] <= thresholds[k-1]
    below = p_sorted <= thresholds
    if not np.any(below):
        # No discoveries
        p_adj = np.empty(n)
        p_adj.fill(np.nan)
        p_adj[~is_nan] = 1.0
        p_adj[is_nan] = np.nan
        return np.array([], dtype=int) if return_indices else (np.array([], dtype=int), p_adj)
    k = np.max(np.nonzero(below)[0]) + 1  # convert 0-based index to rank
    # Significant original indices
    significant_sorted = order[:k]
    # Compute adjusted p-values (step-up, then monotone decreasing)
    # raw_adj_j = n / j * p_(j)
    with np.errstate(divide='ignore', invalid='ignore'):
        raw_adj = (n / ranks) * p_sorted
    # Make monotone: p_adj_j = min_{t>=j} raw_adj_t
    monotone_adj = np.minimum.accumulate(raw_adj[::-1])[::-1]
    monotone_adj = np.clip(monotone_adj, 0, 1)
    # Put back to original order
    p_adj = np.empty(n)
    p_adj.fill(np.nan)
    p_adj[order] = monotone_adj
    p_adj[is_nan] = np.nan
    sig_idx = significant_sorted[~is_nan[significant_sorted]]
    if return_indices:
        return np.sort(sig_idx)
    return np.sort(sig_idx), p_adj

Key points:- Time complexity: dominated by sorting O(n log n); subsequent passes O(n). Space O(n) for index arrays.- Ties: stable sorting + using ranks preserves tie-consistent behavior. If you prefer average ranks for thresholding, compute ranks with scipy.stats.rankdata(method='average') and use those ranks in thresholds — but standard BH uses ordered p-values.- Grouped hypotheses: options: - Apply BH within each group (controls FDR per-group, not overall). - Use weighted BH (assign group weights proportional to prior info) or hierarchical procedures (e.g., Benjamini–Bogomolov) to control FDR across groups.- Edge cases: handle NaNs, p-values outside [0,1] (validate), all-equal p-values (ties) handled by monotone adjustment.

Cross Functional Collaboration and CoordinationMediumTechnical

44 practiced

You notice repeated misunderstandings about data lineage are causing duplicated work across teams. How would you create sustainable documentation and processes to reduce handoffs and ensure a single source of truth? Include tooling and governance ideas.

A and B Test DesignEasyTechnical

63 practiced

Define type I error (false positive), type II error (false negative), statistical power, significance level (alpha), and Minimum Detectable Effect (MDE). For each concept provide a practical interpretation in the context of a conversion-rate A/B test and a short note on how product trade-offs influence acceptable values.

Applying Data Science Techniques to Business ProblemsHardSystem Design

68 practiced

Design an analytics pipeline that computes near real-time experiment metrics (e.g., conversion rate) with 1M events/sec ingestion and target dashboard latency < 30 seconds. Discuss streaming ingestion, stateful windowed aggregation, exactly-once processing semantics, storage choices for materialized views, consistency trade-offs, backfills, and cost optimizations. Name concrete technologies you would consider.

Sample Answer

Requirements & constraints:- 1M events/sec ingestion; dashboard latency <30s for experiment metrics (conversion rate).- Support partitioned experiments, low-latency windowed aggregations, exactly-once correctness, backfills, and cost control.

High-level architecture:- Ingest events into Kafka (Confluent/Kafka Cloud) partitioned by experiment/user-id → stream processing in Apache Flink (or Apache Beam on Dataflow) → materialized view store for dashboards (real-time OLAP like Apache Pinot/Druid or ClickHouse) + fast cache (Redis) for sub-second reads → long-term raw events in S3 for backfills.

Streaming ingestion & partitioning:- Kafka with enough partitions (e.g., 1000+) to handle 1M eps; use partition key (experiment_id or user_id) to collocate related events and scale consumers.- Use Kafka Producers with batching/acks=all, compression (lz4/snappy).

Stateful windowed aggregation:- Use Flink keyed streams + event-time processing with tumbling/sliding windows (e.g., 1s micro-windows aggregated into 30s materializations).- Use RocksDB state backend with incremental checkpoints to durable store; set low checkpoint interval (5–10s) and async snapshots.- Use watermarks and allowed lateness to handle out-of-order events; maintain out-of-window correction logic.

Exactly-once semantics:- Enable Flink’s EOS via Two-Phase Commit sink connectors + Kafka transactions for source/sink; use idempotent writes or transactional writes to OLAP (Pinot ingestion via Kafka or insert APIs that support idempotency).- Deduplication via unique event_id keys maintained in state (TTL) for small window-scoped dedupe.

Materialized views & storage choices:- For sub-30s dashboards: Apache Pinot or Druid ingesting Kafka topics for near-real-time segments; they support fast aggregations and low query latency.- Alternative: ClickHouse with buffer layer or pre-aggregate into Redis for hottest metrics.- Store long-term raw events in S3 (Parquet) and optionally in partitioned Hive/BigQuery for analytics and backfills.

Consistency trade-offs:- Strong consistency (exactly-once) increases complexity and cost (frequent checkpoints, transactional sinks). For many experiments, "at-least-once with idempotent/compensating updates" + visible correction windows may be acceptable.- Choose consistency per metric: conversion counts need high accuracy (EOS), ancillary metrics can be eventually consistent.

Backfills and reprocessing:- Keep raw events in immutable storage (S3) and re-run batch jobs (Spark/Beam/Flink in batch mode) to regenerate aggregates or rebuild materialized views; write results to OLAP via bulk load APIs.- Use changelog/export of current state to resume incremental rebuilds and avoid full recompute.

Cost optimizations:- Pre-aggregate at edge (client/ingest) to reduce event volume.- Use micro-batching and compression in Kafka to reduce throughput cost.- Tier storage: hot in Pinot/ClickHouse for recent 24–72h; cold in Parquet on S3.- Autoscale stream processors; use spot instances where acceptable.- Sample low-value events and compute incrementally for less critical metrics.

Operational concerns:- Monitor ingestion lag, checkpoint durations, state size, and query latencies; alert on backpressure.- Test failure modes: broker loss, task-manager restart, schema evolution.- CI for data correctness: golden datasets, drift detection, SLA dashboards.

Concrete tech stack:- Kafka (Confluent/Kafka Cloud), Apache Flink (RocksDB, checkpointing, EOS), Apache Pinot or Druid (real-time OLAP), Redis for hot caches, S3/Parquet for raw storage, Spark/Beam for backfills/batch, Grafana for observability.

Advanced Querying with Structured Query LanguageMediumTechnical

21 practiced

Explain partitioning strategies for a large table events(event_date DATE, user_id, event_type, payload). Which partition key and method (range, list, hash) would you choose? Show a sample query that benefits from partition pruning and explain how pruning reduces scanned data.

Sample Answer

Best choice: range partitioning on event_date (e.g., daily or monthly) because most queries filter by time. For hot/warm data you can combine with subpartitioning (hash on user_id for write/read distribution) or list on event_type if queries target specific event types.

Example: monthly range partitions + hash subpartition by user_id (Postgres-like syntax varies by DB):

sql

CREATE TABLE events (
  event_date DATE NOT NULL,
  user_id   BIGINT NOT NULL,
  event_type TEXT,
  payload   JSONB
) PARTITION BY RANGE (event_date);

-- monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
  FOR VALUES FROM ('2025-01-01') TO ('2025-02-01')
  PARTITION BY HASH (user_id);

CREATE TABLE events_2025_01_p0 PARTITION OF events_2025_01 FOR VALUES WITH (modulus 4, remainder 0);
-- create p1..p3 similarly

Simple alternative: single-level RANGE (event_date) is sufficient in many warehouses (BigQuery, Redshift, Snowflake support date partitioning).

Query that benefits from pruning:

sql

SELECT user_id, event_type, COUNT(*)
FROM events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'
  AND event_type = 'purchase'
GROUP BY user_id, event_type;

Why pruning helps:- Partition pruning uses the WHERE event_date range to skip all partitions outside January 2025. Instead of scanning years of data, the engine reads only the few monthly/daily partitions that match.- If subpartitioning or clustering exists (hash on user_id or clustering on event_type), IO and CPU reduce further because fewer files/blocks and indexes are scanned.- Practical impact: reduces bytes read, lowers query latency and cost (in cloud warehouses billed by scanned data), and reduces memory and CPU during aggregation.

Edge considerations:- Choose partition granularity based on query patterns and retention (daily for high volume, monthly for moderate).- Manage partitions lifecycle: drop/archive old partitions to reclaim space.- Avoid too many tiny partitions (metadata overhead).

Model Evaluation and ValidationEasyTechnical

93 practiced

You built a multiclass classifier (5 classes). Explain the difference between macro, micro, and weighted averaging when computing F1 scores. Provide an example scenario where macro F1 is preferable to weighted F1.

Hypothesis Testing and InferenceMediumTechnical

35 practiced

You're running an A/B/n test with three variants and plan to look at interim results daily. Explain the statistical risks of sequential peeking, how repeated looks inflate Type I error, and describe practical approaches to allow interim monitoring while controlling error rate (alpha-spending functions, group sequential designs, and sequential probability ratio tests).

Sample Answer

Sequential peeking (checking results daily) without adjustment inflates the Type I error: each look is another chance to declare a false positive. Intuitively, if a single hypothesis test has alpha=0.05, two independent looks would give ~1 − (1−0.05)^2 ≈ 0.097 chance of at least one false positive; with many looks that approaches 1. In practice looks are correlated (same accumulating data), but the family-wise Type I error still increases substantially.

Practical approaches to allow interim monitoring while controlling error:

- Alpha‑spending functions: pre-specify a total alpha (e.g., 0.05) and an allocation rule across looks (e.g., Lan–DeMets with O’Brien–Fleming or Pocock-like spending). Early looks get very small alpha under O’Brien–Fleming (conservative early), Pocock spreads alpha more evenly. Use when you want flexible timing of looks.

- Group sequential designs: fix number of interim analyses and apply boundary rules (critical z-values) so final family-wise alpha stays at nominal. Simple to implement when you plan, say, daily looks for first week only or fixed k looks.

- Sequential Probability Ratio Test (SPRT) / fully sequential methods: continuously monitor a likelihood ratio and stop when ratio crosses precomputed boundaries. SPRT is most efficient (minimizes expected sample size) but requires specifying alternative and is more complex with multiple arms.

Practical notes for A/B/n:- Pre-specify monitoring plan and adjustments before starting.- Account for multiplicity across variants (e.g., Bonferroni or hierarchical testing / closed testing) or control FDR if many arms.- Consider Bayesian sequential methods (posterior monitoring with decision thresholds) as an alternative — easier continuous monitoring but requires calibration to frequentist alpha if stakeholders demand it.- If you must peek daily, use an alpha‑spending approach (Lan–DeMets O’Brien–Fleming) or Bayesian credible thresholds with simulation-based calibration so Type I risk is understood and controlled.

Practice Data Scientist questions across all topics

Additional Information

Want to create your own tailored preparation guide using our deep research?

Get Started for Free

Interview-Ready Courses

Visual-first, interactive, structured learning paths

Browse Data Scientist jobs

AI-enriched listings across hundreds of company career pages

Explore Jobs

Amazon Data Scientist Interview Preparation Guide (Mid-Level)

Interview Process Overview

Interview Rounds

Recruiter Screening

What to Expect

Tips & Advice

Focus Topics

Career Motivation & Growth Mindset

Practice Interview

Study Questions

Understanding of Amazon Data Scientist Role

Practice Interview

Study Questions

Communication & Collaboration Style

Practice Interview

Study Questions

Professional Background & Relevant Experience

Practice Interview

Study Questions

Technical Phone Screen 1: SQL & Data Analysis

What to Expect

Tips & Advice

Focus Topics

Python for Data Exploration & Validation

Practice Interview

Study Questions

Query Optimization & Performance Analysis

Practice Interview

Study Questions

Business Metrics & Data Analysis via SQL

Practice Interview

Study Questions

Complex SQL Query Writing

Practice Interview

Study Questions

Technical Phone Screen 2: Machine Learning & Modeling

What to Expect

Tips & Advice

Focus Topics

Feature Engineering & Data Preparation

Practice Interview

Study Questions

Python ML Implementation & Coding

Practice Interview

Study Questions

Handling Data Quality & Class Imbalance

Practice Interview

Study Questions

Model Evaluation Metrics & Validation

Practice Interview

Study Questions

Machine Learning Algorithms & Model Selection

Practice Interview

Study Questions

Onsite Round 1: Machine Learning & Modeling Deep Dive

What to Expect

Tips & Advice

Focus Topics

Model Architecture Design & Deep Learning Concepts

Practice Interview

Study Questions

Model Interpretability, Explainability & Debugging

Practice Interview

Study Questions

Building Scalable ML Pipelines & Production Considerations

Practice Interview

Study Questions

End-to-End Project Ownership & Impact Measurement

Practice Interview

Study Questions

Advanced Regularization & Hyperparameter Tuning

Practice Interview

Study Questions

Translating Business Problems to ML Solutions

Practice Interview

Study Questions

Onsite Round 2: Data Analysis & A/B Testing

What to Expect

Tips & Advice

Focus Topics