InterviewStack.io LogoInterviewStack.io

Aggregation Functions and Group By Questions

Fundamentals of aggregation in Structured Query Language covering aggregate functions such as COUNT, SUM, AVG, MIN, and MAX and how to use them to calculate totals, averages, minima, maxima, and row counts. Includes mastery of the GROUP BY clause to group rows by one or more dimensions such as customer, product, region, or time period, and producing metrics like total revenue by month, average order value by product, or count of transactions by date. Covers the HAVING clause for filtering aggregated groups and explains how it differs from WHERE, which filters rows before aggregation. Also addresses related topics commonly tested in interviews and practical problems: grouping by multiple columns, grouping on expressions and date truncation, using DISTINCT inside aggregates, handling NULL values, ordering and limiting grouped results, using aggregates in subqueries or derived tables, and basic performance considerations when aggregating large datasets. Practice examples include calculating monthly revenue, finding customers with more than a threshold number of orders, and identifying top products by sales.

MediumTechnical
0 practiced
Using PySpark DataFrame API, write code to compute weekly revenue per product from a large orders DataFrame (order_id, product_id, price, quantity, created_at). Include optimizations: reduce shuffle, partitioning, and use of map-side aggregations (if applicable).
EasyTechnical
0 practiced
Using a sales table (sale_id, product_id, user_id, sale_amount, sale_date), write SQL to return, for each product_id, the number of unique buyers. Use COUNT(DISTINCT) and explain when COUNT(DISTINCT user_id) differs from COUNT(user_id).
HardSystem Design
0 practiced
You have aggregates stored in multiple shards/databases (shard_id 1..N). Propose a reliable approach to compute global group-by aggregates (e.g., total revenue per product) across shards with the lowest network overhead and ensure correctness if shards may lag. Include SQL-level partial aggregation examples and architectural considerations.
MediumTechnical
0 practiced
Given tables orders(order_id, customer_id, created_at) and order_items(order_item_id, order_id, product_id, price, qty), write a correct SQL query to compute total revenue per customer without double counting when joins may multiply rows. Explain your join strategy.
HardTechnical
0 practiced
Given raw events that include duplicates, demonstrate how to deduplicate events before aggregation using ROW_NUMBER() (SQL) or using Spark (dropDuplicates with a stable ordering). Provide SQL that partitions by event_id and keeps the latest ingestion_time, then aggregates count per event_type.

Unlock Full Question Bank

Get access to hundreds of Aggregation Functions and Group By interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.