InterviewStack.io LogoInterviewStack.io

Data Collection and Instrumentation Questions

Designing and implementing reliable data collection and the supporting data infrastructure to power analytics and machine learning. Covers event tracking and instrumentation design, decisions about what events to log and schema granularity, data validation and quality controls at collection time, sampling and deduplication strategies, attribution and measurement challenges, and trade offs between data richness and cost. Includes pipeline and ingestion patterns for real time and batch processing, scalability and maintainability of pipelines, backfill and replay strategies, storage and retention trade offs, retention policy design, anomaly detection and monitoring, and operational cost and complexity of measurement systems. Also covers privacy and compliance considerations and privacy preserving techniques, governance frameworks, ownership models, and senior level architecture and operationalization decisions.

HardTechnical
0 practiced
As a senior AI Engineer, propose a governance framework for telemetry and instrumentation that includes ownership model, SLAs for data quality and freshness, data-contract lifecycle (create, deprecate, enforce), compliance controls, and operational enforcement. Explain how you'd measure effectiveness and how to scale governance without becoming a bottleneck.
MediumTechnical
0 practiced
You receive a high-volume event stream where occasionally the same user action is emitted multiple times due to client retries. Describe concrete approaches to deduplicate events in a streaming pipeline. Discuss time-windowed deduplication, stateful processing (e.g., Flink keyed state), Bloom filters for memory savings, and producer-side idempotency; include trade-offs for latency, memory, false positives, and exactly-once guarantees.
MediumTechnical
0 practiced
Implement a Python class that deduplicates events using an in-memory sliding window. The structure should keep seen event IDs with a TTL (in seconds) and evict expired IDs efficiently. Provide methods: add_event(event_id, timestamp) -> bool (True if new), and a background eviction approach. Assume single-process and explain how this design changes in distributed deployments.
HardTechnical
0 practiced
You must design logging and storage strategies for ultra-high-cardinality categorical features (such as user_id or device_id) so analytics and ML training remain feasible without unbounded growth. Evaluate techniques including hashing (fixed-size buckets), frequency thresholding (cataloging only frequent keys), embedding catalogs, and sketch summaries (count-min, hyperloglog) and discuss implications for model accuracy and privacy.
HardTechnical
0 practiced
Describe the instrumentation and logs required to enable counterfactual policy evaluation and reliable A/B testing of models in production. Include details on logging randomization seeds, treatment assignment, eligibility checks, exposures (impressions), user covariates, and any necessary offline computation steps to estimate counterfactual metrics and correct for bias.

Unlock Full Question Bank

Get access to hundreds of Data Collection and Instrumentation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.