InterviewStack.io LogoInterviewStack.io

Big Data Technologies Stack Questions

Overview of big data tooling and platforms used for data ingestion, processing, and analytics at scale. Includes frameworks and platforms such as Apache Spark, Hadoop ecosystem components (HDFS, MapReduce, YARN), data lake architectures, streaming and batch processing, and cloud-based data platforms. Covers data processing paradigms, distributed storage and compute, data quality, and best practices for building robust data pipelines and analytics infrastructure.

MediumTechnical
44 practiced
Explain YARN's role and resource management model in Hadoop clusters. Compare YARN with Kubernetes for running Spark jobs: discuss scheduling model, containerization, elasticity, isolation, and operational implications for multi-tenant data platforms.
HardTechnical
42 practiced
Design an online feature store for low-latency model inference. Describe feature ingestion (streaming and batch), storage choices for online serving (Redis, DynamoDB, RocksDB), freshness SLAs, TTL policies, consistency and atomic updates, and rollback strategies when features are corrected or retrained models require different feature versions.
HardTechnical
33 practiced
You observe a Parquet table with hundreds of thousands of small files, causing slow queries and metadata overhead. Propose a plan to fix the small-files problem: compaction strategies, tuning writers to produce optimal file sizes, scheduling compaction jobs, and mitigations to avoid impacting SLAs during compaction.
MediumTechnical
35 practiced
Technical domain-specific: Design a data quality pipeline for streaming events using tools like Deequ or Great Expectations. Specify checks for schema validation, required fields, duplicate detection, cardinality constraints, distribution drift, where to run checks (stream vs batch), alerting thresholds, and automated remediation/workflow.
MediumTechnical
45 practiced
Explain the concept of backpressure in stream processing systems. How does backpressure manifest in a Kafka + Spark/Flink deployment, and what strategies can you apply (rate limiting, autoscaling, batching, buffer sizing, circuit breakers) to mitigate backpressure without losing events or violating SLAs?

Unlock Full Question Bank

Get access to hundreds of Big Data Technologies Stack interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.