MS-DS Master of Data science Data Wrangling and Preprocessing Questions and Answers

Question 1

A data scientist is working with a dataset containing customer information, including an 'Income' feature with a significant number of missing values. The distribution of the 'Income' feature is heavily right-skewed. Which of the following methods for handling missing data is most appropriate in this situation to minimize the impact of outliers?

Accepted Answer

Imputing the missing values with the median of the 'Income' feature.

Answer

For skewed distributions, the median is a more robust measure of central tendency than the mean because it is less affected by outliers. Deleting rows could lead to significant data loss, and replacing with zero would introduce bias and distort the feature's distribution.

Question 2

You are preparing a dataset for a K-Nearest Neighbors (KNN) algorithm. The dataset has features with vastly different scales: 'Age' (ranging from 20 to 70) and 'Salary' (ranging from 30,000 to 150,000). Why is feature scaling a critical step in this scenario?

Accepted Answer

It ensures that features with larger ranges do not disproportionately influence the distance calculations.

Answer

K-Nearest Neighbors is a distance-based algorithm. If features are not scaled, the 'Salary' feature, with its much larger range, would dominate the distance calculation, making the 'Age' feature almost irrelevant. Feature scaling, such as standardization or normalization, ensures all features contribute more equally.

Question 3

Which of the following data wrangling tasks involves creating new variables from existing ones to better represent the underlying patterns in the data for a machine learning model?

Accepted Answer

Feature Engineering

Answer

Feature engineering is the process of using domain knowledge to create new features (variables) from the raw data that make machine learning algorithms work better. Data cleaning deals with errors and missing values, integration combines data sources, and validation checks data quality.

Question 4

A dataset contains a categorical feature 'Education Level' with the values: 'High School', 'Bachelor's', 'Master's', 'PhD'. This is an example of what type of data, and which encoding scheme would be most appropriate to preserve its inherent order?

Accepted Answer

Ordinal data; Label Encoding

Answer

The 'Education Level' feature is ordinal because its categories have a meaningful, inherent order. Label Encoding is suitable here because it assigns a unique integer to each category (e.g., 0, 1, 2, 3), which preserves this ranking. One-Hot Encoding would treat each category as independent, losing the ordinal relationship.

Question 5

In the context of data preprocessing, what is the primary difference between Standardization and Normalization (Min-Max Scaling)?

Accepted Answer

Standardization (Z-score) transforms data to have a mean of 0 and a standard deviation of 1, while Normalization scales data to a specific range (e.g., [0, 1]).

Answer

Standardization rescales data to have a mean of zero and a standard deviation of one (Z-score). Normalization, specifically Min-Max scaling, rescales the data to a fixed range, usually 0 to 1. Standardization does not bind values to a specific range, which can be a key difference in certain algorithms.

MS-DS Master of Data science Practice Test

MS-DS Master of Data science Practice Test

MS-DS Master of Data science Data Wrangling and Preprocessing Questions and Answers