InterviewStack.io LogoInterviewStack.io

Large Dataset Management and Technical Analysis Questions

Develop skills in working efficiently with large datasets: data cleaning and validation, efficient aggregation and manipulation, handling missing data, identifying and managing outliers. Master advanced Excel features or learn SQL for database queries. Practice data quality assessment. Learn efficient workflows that scale with dataset size. Understand data security and privacy considerations.

HardTechnical
0 practiced
On a Spark job, a hot join key causes a single task to process an outsized partition and fail with OOM. Describe how you'd debug this (what metrics to inspect) and propose concrete remedies: salting, pre-aggregation, broadcasting, custom partitioners, or sampling. For each remedy explain pros/cons.
MediumTechnical
0 practiced
Write a SQL query or describe an approach to estimate the 95th percentile (approximate quantile) of 'response_time' in a 'requests' table with 500M rows on a data warehouse (e.g., Redshift or Postgres). Discuss trade-offs between exact and approximate methods and how you'd bound error.
HardTechnical
0 practiced
You must compute per-user rolling medians over the last 30 days for millions of users in a data warehouse that supports UDFs but exact window medians are too slow. Describe a solution that uses mergeable sketches such as t-digest as an aggregate UDF to approximate rolling quantiles, including how to maintain and merge sketches incrementally and how to characterize error.
EasyTechnical
0 practiced
Explain what "data cleaning" means when preparing very large datasets for machine learning (e.g., a 100M-row events table). Describe the common pipeline steps you would include (schema validation, type conversions, null handling, deduplication, outlier treatment, normalization), how you would instrument each step for correctness, and summarize the difference between ETL and ELT in practice for ML pipelines.
EasyTechnical
0 practiced
High-level: As an ML engineer, list GDPR and PII considerations when collecting and storing user data for model training. Explain practical steps for pseudonymization/tokenization, encryption at rest and in transit, consent management, data retention policies, and how these influence who can access datasets for model development.

Unlock Full Question Bank

Get access to hundreds of Large Dataset Management and Technical Analysis interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.