Thursday, January 20, 2022

Data Science and ML Interview Questions

Data Science and ML Interview Questions


Introduction

This blog is an accumulation of all the interview questions I have faced as well the ones faced by my peers/friends.I will try to answer each question with brief explanation/notes

Q1:What are examples of some of the discrete probability distributions

Ans:Binomial and Poisson


Q2:If X is normally distributed then what is the probability that Pr(X=2)

Ans: 0.For a continuous Random Variable, the probability is not defined at a point


Q3:What are the percent of values between -2σ and 2σ for a normal distribution

Ans:

  • 68% between -1σ and 1σ
  • 95% between -2σ and 2σ
  • 99% between -3σ and 3σ


Q4:Can you think of a situation when low R2 is justified

Ans: In all B2B(business to business) transactions, usual economic variables are not able to explain variance in the data.There are a lot of external factors which cant be captured in B2B scenarios


Q5:What is the probability distribution of p value

Ans: Uniform distribution


Q6:What is the basis of linear regression

Ans: Ordinary least square(OLS)


Q7:What is the basis of logistic regression

Ans: Maximum likelihood estimate(MLE)


Q8:What is the distribution of β coefficients in regression

Ans: In large samples,β is normally distribued


Q9:If X1 and X2 normally distributed, what will be distibution of X1 + X2

Ans: It will be a normal distribution because of central limit theorem


Q10:What are the assumptions of linear regression

Ans: There are 4 assumptions:

  • The Conditional Distribution of ui given Xi has a mean of Zero
  • X and Y are IID ie independent and identically distributed
  • There is no multicollinearity
  • There are no outliers


Q11:What are the evaluation metrics for linear regression

Ans: There are 2 metrics:

  • R2adj
  • Root Mean Square Error(RMSE)


Q12:How do we check presence of multicollinearity linear regression

Ans: Through Variance Inflation Factor(VIF).Variables with VIF values greater than 5 should be removed


Q13:Why do we take log of a variable in certain analysis

Ans: If the range of values in a variable is very high, then in order to compress the values/variance, we take log of that variable.


Q14:If there is a categorical variable with N unique values, then why do we only take dummy variables for N-1 values

Ans: If we take dummy variables for all N values, then it will lead to perfect multi-collinearity as shown below

D1 + D2 ….. Dn = 1

D1 + D2 ….. Dn = 1* X0

Here the regressor(X0 is the variable for β0.X0 is always equal to 1) is represented as perfect linear combination with values of variable


Q15:What metric does a decision tree use

Ans: Gini index.A decision tree uses gini index to sbplit a node


Q16:How can we select between two competing logistic regression models

Ans: There are 2 metrics:

  • AIC/BIC:The value of Akaike and Bayesian Information criteria should be lower
  • F1 Score: The value of F1 score should be high


Q17:What are the disadvantages of a decision tree use

Ans: Decision tree often leads to overfilling of data.Also creating an ensemble of decision trees lead of creation of correlated trees which offer no improvement in fit


Q18:How does a random forest work

Ans: Random forest is an ensemble model where multiple models are trained simultaneously on randomly sampled datasets and randomly selected attributes.This ensures that correlated trees are not generated in the ensemble.The final prediction is made using average/voting.


Q19:Give a scenario of overfitting in Machine Learning

Ans: If the model is trained on lets say 5% of the dataset, then it cant learn and generalize since there are less records.Once it is tested on test data, it performs poorly

Other instances where overfitting can happen is when we include variables with large number of unique/distinct values


Q20:What is sensitivity in classification model

Ans: Lets look at the confusion matrix.



  • Accuracy: (TN+TP)/(TN+TP+FN+FP)

  • Sensitivity:Proportion of positive correctly classified.Also called true positive rate
    TP/(TP+FN)

  • Specificity:Proportion of Negatives correctly classified.Also called true negative rate
    TN/(TN+FP)

  • Precision: Proportion of positive cases predicted accurately by the model
    TP/(TP+FP)

  • Recall:Same as sensitivity

  • F1 Score: (2 X Precision X Recall)/(Precision+Recall) Important in cases where we want to shortlist best model among a set of competing models


Q21:Give an example of Unsuperivsed Machine Learning model

Ans:

  • K Means clustering
  • Latent Dirichlet Allocation
  • Latent Semantic Allocation


Q22:Is KNN model Unsuperivsed or SUpervised Machine Learning model

Ans: KNN is a supervised ML Model


Q23:How do you identify cut off value in logistic regression

Ans: Refer to the below explanation

Normally the aim of all model building exercise is to maximise accuracy. However, we need to be careful about how many False positive(FP) and False negative(FN) cases are generated by the model. In such circumstances, we would want to minimise FN and FP. This can be done by selection of an appropriate threshold value which is obtained by plotting the Sensitivity and Specificity plots and taking the intersection of these graphs as the optimal point. The logic behind this approach is that the point of intersection of these graphs represents the maximum value for both Sensitivity as well as Specificity which mimimises FN as well as FP


Q24:In a throw of a dice, if X is a Random Variable which denotes the face that appears on a throw, what is the probability distribution of X

Ans: It will be a uniform distribution.Since P[X=1,2…6] in the long run is equal to 1/6, all the events have the same probability.


Q25:In a throw of a dice, if X is a Random Variable which denotes the face that appears on a throw, what is the probability distribution of X

Ans: It will be a uniform distribution.Since P[X=1,2…6] in the long run is equal to 1/6, all the events have the same probability.


Q26:How do you interpret log log model in regression

Ans: β coefficients obtained from log log model can be interpreted as elasticity of X on Y

lnY = β * ln(X) + u ln(Y+ΔY) = β * ln(X+ΔX) ΔY/Y = β * ΔX/X

β = (ΔY/Y)/(ΔX/X) β = Percentage change in Y/Percentage change in X β = elasticity


Q27:How can we reduce overfitting in regression

Ans: There are three methods:

  • L1 regularization(Lasso regression):The coefficients are constrained and reduced to a value around 0 in order to improve fit.A constraint parameter lamda Λ equal to sum of absolute value of coefficients is added

  • L2 regularization(Ridge regression):The coefficients are constrained and reduced in order to improve fit.A constraint parameter lamda Λ equal to sum of square of coefficients is added.The coefficients in this case can never be reduced to zero

  • Elastic Net:The coefficients are constrained in a combination which is equal to L1 and L2 regularization.It is basically a convex combination of L1 and L2


My Youtube Channel

Embed Shiny

Please wait...