Data Science with Python Supervised Learning Models Questions and Answers

Question 1

A data scientist is working on a binary classification problem to predict customer churn. The dataset is highly imbalanced. After training a Random Forest and a Gradient Boosting model, they find that the Gradient Boosting model has higher accuracy. Which of the following is the most likely reason for this outcome?

Accepted Answer

Gradient Boosting builds trees sequentially, with each tree focusing on correcting the errors of the previous one, which can be particularly effective for the minority class.

Answer

Gradient Boosting is often more effective on imbalanced datasets because it builds trees sequentially. Each new tree is built to correct the residual errors of the previous ensemble, which means it can progressively focus more on the hard-to-classify instances, which are often the minority class. Random Forest builds trees independently and in parallel, which may not give specific attention to the underrepresented class.

Question 2

Which of the following is a key assumption of Linear Regression that, if violated, can lead to unreliable and biased coefficient estimates?

Accepted Answer

The residuals (error terms) are independent of each other.

Answer

A critical assumption of linear regression is the independence of residuals (or errors). This means that the error for one observation is not correlated with the error of another. Violation of this assumption, known as autocorrelation, is common in time-series data and leads to inefficient and biased estimates of the model coefficients.

Question 3

In a Support Vector Machine (SVM) model, what is the primary purpose of the 'kernel trick'?

Accepted Answer

To efficiently compute dot products in a high-dimensional feature space without explicitly mapping the data, allowing the model to find non-linear decision boundaries.

Answer

The kernel trick is a core concept in SVMs that allows them to solve non-linear classification problems. It works by implicitly mapping the input data into a higher-dimensional space where a linear separator can be found. The 'trick' is that it achieves this without ever having to compute the coordinates of the data in that high-dimensional space, instead calculating the dot products directly in the original space, which is computationally efficient.

Question 4

A team is developing a model to classify news articles into one of three categories: 'Sports', 'Politics', or 'Technology'. The output variable is categorical. Which supervised learning algorithm is most appropriate for this task?

Accepted Answer

Logistic Regression

Answer

Logistic Regression is a classification algorithm used to predict a discrete, categorical outcome. While often introduced for binary classification, it can be extended to handle multi-class problems (multinomial logistic regression), making it suitable for classifying articles into one of three categories. Simple Linear Regression is for predicting continuous values. K-Means and PCA are unsupervised learning algorithms.

Question 5

An unconstrained Decision Tree is trained on a complex dataset and achieves 100% accuracy on the training data but performs poorly on the test data. What is this phenomenon called, and what is a common way to mitigate it in Scikit-learn?

Accepted Answer

Overfitting; set the `max_depth` parameter to a smaller integer.

Answer

This scenario describes overfitting, where the model learns the training data too well, including its noise, and fails to generalize to new data. Decision trees are prone to overfitting if not constrained. A common mitigation strategy is to prune the tree by limiting its maximum depth using the `max_depth` hyperparameter in Scikit-learn, which prevents it from becoming overly complex.

Data Science with Python Certification Practice Test

Data Science with Python Certification Practice Test

Data Science with Python Supervised Learning Models Questions and Answers