Data Engineering & Analytics Infrastructure Topics
Data pipeline design, ETL/ELT processes, streaming architectures, data warehousing infrastructure, analytics platform design, and real-time data processing. Covers event-driven systems, batch and streaming trade-offs, data quality and governance at scale, schema design for analytics, and infrastructure for big data processing. Distinct from Data Science & Analytics (which focuses on statistical analysis and insights) and from Cloud & Infrastructure (platform-focused rather than data-flow focused).
Real Time Data Processing and Consistency
Covers architectures and trade offs for low latency real time processing and the consistency constraints that accompany them. Candidates should describe streaming ingestion pipelines stateful stream processors message queues and event driven architectures. Discuss ordering and delivery semantics idempotence deduplication and approaches for exactly once or at least once delivery, batching and back pressure strategies, the role of caches and materialized views in reducing latency, and how to reason about correctness versus freshness in a real time system.
Analytical Data Systems and Warehousing
Architectures and operational patterns for analytical workloads and reporting. Coverage includes data warehouses, data marts, column oriented analytic storage, data lake and lakehouse architectures, extract transform load and extract load transform pipelines, batch and streaming ingestion, schema on read versus schema on write, materialized views and aggregation strategies, columnar compression and storage formats, partitioning and clustering tuned for analytic queries, cost versus performance trade offs for managed cloud services, and integration with business intelligence and reporting tools. Candidates should be able to distinguish online analytical processing from online transaction processing and choose appropriate architectures and tools for large scale analytics, including managed offerings and cost optimization strategies.
Data Quality and Validation
Covers the core concepts and hands on techniques for detecting, diagnosing, and preventing data quality problems. Topics include common data issues such as missing values, duplicates, outliers, incorrect labels, inconsistent formats, schema mismatches, referential integrity violations, and distribution or temporal drift. Candidates should be able to design and implement validation checks and data profiling queries, including schema validation, column level constraints, aggregate checks, distinct counts, null and outlier detection, and business logic tests. This topic also covers the mindset of data validation and exploration: how to approach unfamiliar datasets, validate calculations against sources, document quality rules, decide remediation strategies such as imputation quarantine or alerting, and communicate data limitations to stakeholders.
Data Governance and Classification
Covers frameworks and practices for classifying and governing organizational data. Candidates should demonstrate how to define classification schemes such as public, internal, confidential, and restricted and how to identify sensitivity categories including personally identifiable information, protected health information, payment and financial data, biometric data, and location data. Explain how classification drives data handling requirements including storage location choices, encryption and access control policies, retention and deletion schedules, data minimization, and cross border data handling. Discuss implementation patterns such as metadata and labeling, automated discovery and classification, integration with data pipelines and applications, policy enforcement and auditing, roles and responsibilities for data stewardship, and how to align classification with legal and regulatory compliance and privacy requirements.