InterviewStack.io LogoInterviewStack.io

Data Collection and Instrumentation Questions

Designing and implementing reliable data collection and the supporting data infrastructure to power analytics and machine learning. Covers event tracking and instrumentation design, decisions about what events to log and schema granularity, data validation and quality controls at collection time, sampling and deduplication strategies, attribution and measurement challenges, and trade offs between data richness and cost. Includes pipeline and ingestion patterns for real time and batch processing, scalability and maintainability of pipelines, backfill and replay strategies, storage and retention trade offs, retention policy design, anomaly detection and monitoring, and operational cost and complexity of measurement systems. Also covers privacy and compliance considerations and privacy preserving techniques, governance frameworks, ownership models, and senior level architecture and operationalization decisions.

MediumTechnical
0 practiced
Write a SQL query (BigQuery/standard SQL) to deduplicate event records in a table events_raw(event_id, user_id, event_name, event_time, ingestion_time, payload) keeping the earliest ingestion_time per event_id, and explain your approach and assumptions. Include handling for NULL ingestion_time.
MediumTechnical
0 practiced
A pipeline transformation accidentally started emitting events with incorrect timezone-normalized timestamps, skewing daily aggregates. Explain how you would detect such a bug, roll back or fix it, and remediate affected historical aggregates. Include communication to stakeholders.
HardTechnical
0 practiced
You must define an ownership model for telemetry and analytics across engineering, product, and data teams. Propose an RACI-style model that defines who owns instrumentation design, schema changes, pipeline maintenance, data quality alerts, and SLA monitoring. Explain how you would operationalize handoffs.
HardTechnical
0 practiced
As PM responsible for analytics cost, design a metric-level chargeback model to attribute data platform costs to product teams based on event volume, retention, and query cost. Describe inputs, how to make it predictable, and incentives to reduce unnecessary telemetry.
MediumTechnical
0 practiced
Explain how idempotency and deduplication are implemented at ingestion to prevent double-counting of events when mobile clients retry transmissions. Suggest both client-side and server-side techniques and describe trade-offs of each.

Unlock Full Question Bank

Get access to hundreds of Data Collection and Instrumentation interview questions and detailed answers.

Sign in to Continue

Join thousands of developers preparing for their dream job.