AWS Data Services Questions

Specialized knowledge of Amazon Web Services targeted at data storage, processing, analytics, and streaming. This covers object storage and data lake design with Simple Storage Service including storage classes, lifecycle and partitioning strategies; analytics and warehousing with Redshift including columnar storage, distribution styles, compression, query optimization and concurrency considerations; big data processing with Elastic MapReduce for managed Spark and Hadoop clusters and associated tuning; serverless extract transform and load using Glue and data catalog concepts, schema management and job orchestration; and real time data ingestion and processing with Kinesis including producers, shards, retention, consumers, and stream processing patterns. Candidates should understand when to choose batch versus streaming architectures, how to integrate services into end to end data pipelines, trade offs around scalability, latency, consistency, security, data governance and cost optimization, and monitoring and debugging techniques for data workloads.

HardTechnical

27 practiced

(PySpark) Implement a sessionization routine for clickstream data that assigns a session_id per user based on a 30-minute inactivity timeout. Input schema: user_id string, event_time timestamp, bytes int. Output: session_id, user_id, session_start, session_end, event_count, total_bytes. Describe an efficient approach for large-scale batch processing.

EasyTechnical

25 practiced

Explain Amazon S3 consistency guarantees and how they impact data pipelines that write partitioned files and immediately query them via Athena or Redshift Spectrum. Describe common race conditions and best practices to ensure queries see complete data after an ETL job finishes.

MediumTechnical

27 practiced

You find thousands of small Parquet files (<1MB) under a partition prefix causing slow query planning and high S3 request cost. Propose a solution using AWS services to compact these files, schedule compaction, and prevent the problem at ingest.

HardTechnical

23 practiced

Write pseudocode for a Spark job that compacts time-partitioned Parquet files in S3. For each partition older than N hours, the job should merge small files into larger ones, write atomically to a staging location, and then replace originals. Describe locking, idempotency, and failure recovery.

MediumTechnical

22 practiced

You are delivering streaming events to S3 using Kinesis Data Firehose with buffering hints (buffer_size and buffer_interval). Explain how buffer size and buffer interval affect latency, S3 file sizes, compression efficiency and cost, and when you would tune them lower or higher.

Unlock Full Question Bank

Get access to hundreds of AWS Data Services interview questions and detailed answers.

Join thousands of developers preparing for their dream job.