SQL for Data Analysis Questions

Using SQL as a tool for data analysis and reporting. Focuses on writing queries to extract metrics, perform aggregations, join disparate data sources, use subqueries and window functions for trends and rankings, and prepare data for dashboards and reports. Includes best practices for reproducible analytical queries, handling time series and date arithmetic, basic query optimization considerations for analytic workloads, and when to use SQL versus built in reporting tools in analytics platforms.

MediumTechnical

0 practiced

You are given an EXPLAIN ANALYZE snippet that shows large mismatch between estimated_rows and actual_rows and a nested loop join over large tables. Explain likely causes and provide SQL-level fixes (statistics, query rewrite, join hints). Example snippet:

Nested Loop (actual time=... rows=100000 loops=1)
  -> Index Scan (actual rows=100000)
  -> Seq Scan on big_table (actual rows=10000000)

Diagnostic and fixes?

HardTechnical

0 practiced

Describe how a distributed SQL engine chooses between broadcast (replicate small table) vs shuffle/hash join (redistribute partitions). For a join between a 200M-row fact and a 5M-row dimension table, propose strategies (SQL and data-engineer) to force or avoid broadcast and reduce shuffle overhead. Provide examples of hints or rewrites.

MediumTechnical

0 practiced

Compute a 7-day moving average of daily active users (DAU) using PostgreSQL. Table: `events(user_id BIGINT, event_ts TIMESTAMP)`. Return columns: day (date), dau (distinct users), moving_avg_7 (7-day average). Explain how you deal with days that have zero events.

EasyTechnical

0 practiced

Given a PostgreSQL table `users` with schema:

users(
  user_id INTEGER PRIMARY KEY,
  signup_date DATE,
  country TEXT
)

Write a SQL query that returns the number of new users per month for the year 2024 (format YYYY-MM), including months with zero signups. Use PostgreSQL and explain how to ensure months with zero appear in the result (hint: calendar generation).

MediumTechnical

0 practiced

Discuss trade-offs between storing analytics data as Parquet files on S3 vs managed columnar tables in Snowflake/Redshift/Spark. Consider SQL performance, predicate pushdown, partitioning, cost, and operational complexity. Which would you pick for ad-hoc SQL analysis vs scheduled dashboards?

Unlock Full Question Bank

Get access to hundreds of SQL for Data Analysis interview questions and detailed answers.

Join thousands of developers preparing for their dream job.