InterviewStack.io LogoInterviewStack.io

AWS Data Services Questions

Specialized knowledge of Amazon Web Services targeted at data storage, processing, analytics, and streaming. This covers object storage and data lake design with Simple Storage Service including storage classes, lifecycle and partitioning strategies; analytics and warehousing with Redshift including columnar storage, distribution styles, compression, query optimization and concurrency considerations; big data processing with Elastic MapReduce for managed Spark and Hadoop clusters and associated tuning; serverless extract transform and load using Glue and data catalog concepts, schema management and job orchestration; and real time data ingestion and processing with Kinesis including producers, shards, retention, consumers, and stream processing patterns. Candidates should understand when to choose batch versus streaming architectures, how to integrate services into end to end data pipelines, trade offs around scalability, latency, consistency, security, data governance and cost optimization, and monitoring and debugging techniques for data workloads.

MediumTechnical
27 practiced
You find thousands of small Parquet files (<1MB) under a partition prefix causing slow query planning and high S3 request cost. Propose a solution using AWS services to compact these files, schedule compaction, and prevent the problem at ingest.
HardTechnical
21 practiced
Compare storing raw events in JSON versus columnar formats like Parquet/ORC in S3 for both streaming and batch pipelines. Discuss implications for schema evolution, compression, query performance, and downstream consumer flexibility.
HardTechnical
21 practiced
An EMR Spark job is failing with frequent GC overhead and shuffle spill messages. Describe step-by-step diagnostics you would perform using the Spark UI, YARN logs and CloudWatch, and list specific tuning actions (executor memory, cores, shuffle partitions, serialization, broadcast join) to resolve the problem.
EasyTechnical
23 practiced
Define partitioning strategies for S3-based analytics datasets (for example event data). Suggest good partition keys and explain pitfalls such as too-fine-grained partitions, many tiny files, and hot partitions. Provide guidelines for choosing partition granularity.
HardTechnical
27 practiced
(PySpark) Implement a sessionization routine for clickstream data that assigns a session_id per user based on a 30-minute inactivity timeout. Input schema: user_id string, event_time timestamp, bytes int. Output: session_id, user_id, session_start, session_end, event_count, total_bytes. Describe an efficient approach for large-scale batch processing.

Unlock Full Question Bank

Get access to hundreds of AWS Data Services interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.