Data Engineering & Analytics Infrastructure Topics
Data pipeline design, ETL/ELT processes, streaming architectures, data warehousing infrastructure, analytics platform design, and real-time data processing. Covers event-driven systems, batch and streaming trade-offs, data quality and governance at scale, schema design for analytics, and infrastructure for big data processing. Distinct from Data Science & Analytics (which focuses on statistical analysis and insights) and from Cloud & Infrastructure (platform-focused rather than data-flow focused).
Data Transformation and Preparation
Focuses on the technical skills and judgement required to connect to data sources, clean and shape data, and prepare datasets for analysis and visualization. Includes identifying necessary transformations such as calculations, aggregations, filtering, joins, and type conversions; deciding whether to perform transformations in the business intelligence tool or in the data warehouse or database layer; designing efficient data models and extract transform load workflows; ensuring data quality, lineage, and freshness; applying performance optimization techniques such as incremental refresh and pushdown processing; and familiarity with tools and features such as Power BI Power Query, Tableau data preparation capabilities, and structured query language for database level transformations. Also covers documentation, reproducibility, and testing of data preparation pipelines.
Distributed Data Processing and Optimization
Comprehensive knowledge of processing large datasets across a cluster and practical techniques for optimizing end to end data pipelines in frameworks such as Apache Spark. Candidates should understand distributed computation patterns such as MapReduce and embarrassingly parallel workloads, how work is partitioned across tasks and executors, and how partitioning strategies affect data locality and performance. They should explain how and when data shuffles occur, why shuffles are expensive, and how to minimize shuffle cost using narrow transformations, careful use of repartition and coalesce, broadcast joins for small lookup tables, and map side join approaches. Coverage should include join strategies and broadcast variables, avoiding wide transformations, caching versus persistence trade offs, handling data skew with salting and repartitioning, and selecting effective partition keys. Resource management and tuning topics include executor memory and overhead, cores per executor, degree of parallelism, number of partitions, task sizing, and trade offs between processing speed and resource usage. Fault tolerance and scaling topics include checkpointing, persistence for recovery, and strategies for horizontal scaling. Candidates should also demonstrate monitoring, debugging, and profiling skills using the framework user interface and logs to diagnose shuffles, stragglers, and skew, and to propose actionable tuning changes and coding patterns that scale in distributed environments.
Data Pipelines and Feature Platforms
Designing and operating data pipelines and feature platforms involves engineering reliable, scalable systems that convert raw data into production ready features and deliver those features to both training and inference environments. Candidates should be able to discuss batch and streaming ingestion architectures, distributed processing approaches using systems such as Apache Spark and streaming engines, and orchestration patterns using workflow engines. Core topics include schema management and evolution, data validation and data quality monitoring, handling event time semantics and operational challenges such as late arriving data and data skew, stateful stream processing, windowing and watermarking, and strategies for idempotent and fault tolerant processing. The role of feature stores and feature platforms includes feature definition management, feature versioning, point in time correctness, consistency between training and serving, online low latency feature retrieval, offline materialization and backfilling, and trade offs between real time and offline computation. Feature engineering strategies, detection and mitigation of distribution shift, dataset versioning, metadata and discoverability, governance and compliance, and lineage and reproducibility are important areas. For senior and staff level candidates, design considerations expand to multi tenant platform architecture, platform application programming interfaces and onboarding, access control, resource management and cost optimization, scaling and partitioning strategies, caching and hot key mitigation, monitoring and observability including service level objectives, testing and continuous integration and continuous delivery for data pipelines, and operational practices for supporting hundreds of models across teams.
Data Reliability and Fault Tolerance
Design and operate data pipelines and stream processing systems to guarantee correctness, durability, and predictable recovery under partial failures, network partitions, and node crashes. Topics include delivery semantics such as at most once, at least once, and exactly once and the trade offs among latency, throughput, and complexity. Candidates should understand idempotent processing, deduplication techniques using unique identifiers or sequence numbers, transactional and atomic write strategies, and coordinator based or two phase commit approaches when appropriate. State management topics include checkpointing, snapshotting, write ahead logs, consistent snapshots for aggregations and joins, recovery of operator state, and handling out of order events. Operational practices include safe retries, retry and circuit breaker patterns for downstream dependencies, dead letter queues and reconciliation processes, strategies for replay and backfill, runbooks and automation for incident response, and failure mode testing and chaos experiments. Data correctness topics include validation and data quality checks, schema evolution and compatibility strategies, lineage and provenance, and approaches to detect and remediate data corruption and schema drift. Observability topics cover metrics, logs, tracing, alerting for pipeline health and state integrity, and designing alerts and dashboards to detect and diagnose processing errors. The topic also includes reasoning about when exactly once semantics are achievable versus when at least once with compensating actions or idempotent sinks is preferable given operational and performance trade offs.
Apache Spark Architecture
Covers core Apache Spark architecture and programming model, including the roles of the driver and executors, cluster manager options, resource allocation, executor memory and cores, partitions, tasks, stages, and the directed acyclic graph used for job execution. Explains lazy evaluation and the distinction between transformations and actions, fault tolerance mechanisms, caching and persistence strategies, partitioning and shuffle behavior, broadcast variables and accumulators, and techniques for performance tuning and handling data skew. Compares Resilient Distributed Datasets, DataFrames, and Datasets, describing when to use each API, the benefits of the DataFrame and Spark SQL APIs driven by the Catalyst optimizer and Tungsten execution engine, and considerations for user defined functions, serialization, checkpointing, and common data sources and formats.
Experimentation Platforms and Infrastructure
Addresses the technical and organizational infrastructure required to run experiments at scale. Topics include randomization and assignment strategies, traffic allocation, instrumentation and metric collection pipelines, experiment configuration and rollout systems, experiment tracking and metadata, data quality and monitoring, guardrails to detect interference or contamination, automated validity checks, self service experimentation tooling, governance and permissions, and approaches to scale experimentation across many teams while preserving statistical validity. Senior conversations include designing experiment platforms, enabling self service and observability, and trade offs when scaling experiment velocity across products.
Batch and Stream Processing
Covers design and implementation of data processing using batch, stream, or hybrid approaches. Candidates should be able to explain when to choose batch versus streaming based on latency, throughput, cost, data volume, and business requirements, and compare architectural patterns such as lambda and kappa. Core stream concepts include event time versus processing time, windowing strategies such as tumbling sliding and session windows, watermarks and late arrivals, event ordering and out of order data handling, stateful versus stateless processing, state management and checkpointing, and delivery semantics including exactly once and at least once. Also includes knowledge of streaming and batch engines and runtimes, connector patterns for sources and sinks, partitioning and scaling strategies, backpressure and flow control, idempotency and deduplication techniques, testing and replayability, monitoring and alerting, and integration with storage layers such as data lakes and data warehouses. Interview focus is on reasoning about correctness latency cost and operational complexity and on concrete architecture and tooling choices.
Data Quality and Governance
Covers the principles, frameworks, practices, and tooling used to ensure data is accurate, complete, timely, and trustworthy across systems and pipelines. Key areas include data quality checks and monitoring such as nullness and type checks, freshness and timeliness validation, referential integrity, deduplication, outlier detection, reconciliation, and automated alerting. Includes design of service level agreements for data freshness and accuracy, data lineage and impact analysis, metadata and catalog management, data classification, access controls, and compliance policies. Encompasses operational reliability of data systems including failure handling, recovery time objectives, backup and disaster recovery strategies, observability and incident response for data anomalies. Also covers domain and system specific considerations such as customer relationship management and sales systems: common causes of data problems, prevention strategies like input validation rules, canonicalization, deduplication and training, and business impact on forecasting and operations. Candidates may be evaluated on designing end to end data quality programs, selecting metrics and tooling, defining roles and stewardship, and implementing automated pipelines and governance controls.
Data Pipeline Scalability and Performance
Design data pipelines that meet throughput and latency targets at large scale. Topics include capacity planning, partitioning and sharding strategies, parallelism and concurrency, batching and windowing trade offs, network and I O bottlenecks, replication and load balancing, resource isolation, autoscaling patterns, and techniques for maintaining performance as data volume grows by orders of magnitude. Include approaches for benchmarking, backpressure management, cost versus performance trade offs, and strategies to avoid hot spots.