Data Collection and Instrumentation Questions

Designing and implementing reliable data collection and the supporting data infrastructure to power analytics and machine learning. Covers event tracking and instrumentation design, decisions about what events to log and schema granularity, data validation and quality controls at collection time, sampling and deduplication strategies, attribution and measurement challenges, and trade offs between data richness and cost. Includes pipeline and ingestion patterns for real time and batch processing, scalability and maintainability of pipelines, backfill and replay strategies, storage and retention trade offs, retention policy design, anomaly detection and monitoring, and operational cost and complexity of measurement systems. Also covers privacy and compliance considerations and privacy preserving techniques, governance frameworks, ownership models, and senior level architecture and operationalization decisions.

MediumSystem Design

0 practiced

Design an ingestion architecture for collecting client events from mobile apps that must support both real-time feature updates and historical training data. Requirements: ingest 100k events/sec, provide <200ms processing latency for online features, persist raw events durably to object storage (S3), enforce schema validation via a registry, and gracefully handle offline clients. Sketch components, data flow, and strategies for durability, backpressure, batching, and schema evolution.

MediumTechnical

0 practiced

Explain schema evolution strategies for event- and file-based data. Define backward, forward, and full compatibility and give concrete examples of how Avro, Protobuf, and Parquet support evolution. Describe practical patterns for renaming fields, changing types, and adding/removing fields without breaking consumers and how to use a schema registry to enforce rules.

HardTechnical

0 practiced

Design a scalable near-duplicate detection solution for textual data ingestion to prevent storing near-identical user submissions at petabyte scale. Describe the use of MinHash/LSH or alternative locality-sensitive hashing techniques, index sharding, tuning for recall/precision, handling index updates (adds/deletes), and how to evaluate dedup effectiveness at scale.

HardTechnical

0 practiced

Create a cost model and storage-tiering strategy for telemetry data across hot SSD-backed stores for recent queries, warm object storage (S3 standard/frequent access), and cold archives (Glacier). Given estimated request rates and ML training windows, show how you'd decide which data to place in each tier, set TTLs, and handle retrieval patterns to minimize overall cost while meeting SLAs.

EasyTechnical

0 practiced

Explain the trade-offs between logging raw event-level data and storing only aggregated metrics for analytics and ML training. Discuss cost, queryability, the ability to reproduce experiments, debugging ability, privacy exposures, and recommended hybrid patterns that give both flexibility and cost control.

Unlock Full Question Bank

Get access to hundreds of Data Collection and Instrumentation interview questions and detailed answers.

Join thousands of developers preparing for their dream job.