Classification and Regression Fundamentals Questions

Covers the core concepts and distinctions between classification and regression in supervised learning. Classification predicts discrete categories, either binary or multi class, while regression predicts continuous numerical values. Candidates should understand how to format and encode target variables for each task, common algorithms for each family, and the theoretical foundations of representative models such as linear regression and logistic regression. For regression, know least squares estimation, coefficients interpretation, residual analysis, assumptions of the linear model, R squared, and common loss and error measures including mean squared error, root mean squared error, and mean absolute error. For classification, know logistic regression with its sigmoid transformation and probability interpretation, decision trees, k nearest neighbors, and other basic classifiers; understand loss functions such as cross entropy and evaluation metrics including accuracy, precision, recall, F one score, and area under the receiver operating characteristic curve. Also be prepared to discuss model selection, regularization techniques such as L one and L two regularization, handling class imbalance, calibration and probability outputs, feature preprocessing and encoding for targets and inputs, and trade offs when choosing approaches based on problem constraints and data characteristics.

MediumTechnical

0 practiced

You're evaluating a regression model and find a few extreme outliers with very large residuals. Discuss at least four approaches for dealing with outliers in training and reporting, considering their effects on model bias and stakeholder trust.

MediumTechnical

0 practiced

Given the transactions table below, write a SQL query to create a binary target column 'high_value' where high_value = 1 if the user's total spend in the past 30 days exceeds $500, else 0. Use the schema:

transactions(transaction_id PK, user_id INT, amount DECIMAL(10,2), occurred_at TIMESTAMP)

Assume you need one row per user with the latest timestamp included.

MediumTechnical

0 practiced

Explain what R-squared measures in linear regression. Provide two examples of when a high R-squared can be misleading in a BI report and what additional diagnostics you would include to ensure the model is reliable.

HardTechnical

0 practiced

A classification model in production produces well-calibrated probability scores during training, but in production you notice overconfident probabilities (e.g., predicted 0.9 but actual rate ~0.6). Describe steps to diagnose the cause and methods to recalibrate probabilities in a BI scoring pipeline.

MediumTechnical

0 practiced

Describe how you would use cross-validation in a BI-focused regression problem to estimate out-of-sample MAPE (mean absolute percentage error). Specify which cross-validation strategy you would choose for i) i.i.d. data and ii) time-ordered data, and why.

Unlock Full Question Bank

Get access to hundreds of Classification and Regression Fundamentals interview questions and detailed answers.

Join thousands of developers preparing for their dream job.