Data Science with Python Feature Engineering Techniques Questions and Answers

Question 1

A data scientist is preparing a dataset for a K-Nearest Neighbors (KNN) model. The dataset contains an 'age' feature (range 20-70) and an 'income' feature (range 30,000-250,000). Since KNN is a distance-based algorithm, what is the most appropriate feature scaling technique to apply and why?

Accepted Answer

Standardization (Z-score scaling), because it is less sensitive to outliers than Min-Max scaling and handles features on vastly different scales effectively.

Answer

Standardization (Z-score scaling) rescales features to have a mean of 0 and a standard deviation of 1. This is crucial for distance-based algorithms like KNN, where features with larger scales (like 'income') can dominate the distance calculation. Standardization is generally more robust to outliers than Min-Max scaling, which scales data to a fixed range (e.g., 0 to 1) and can be skewed by extreme values. [2, 7, 24]

Question 2

You are working with a categorical feature 'product_category' which contains the values 'Electronics', 'Apparel', 'Home Goods', and 'Books'. There is no inherent order or ranking among these categories. Which encoding technique should be used to prepare this feature for a linear regression model to prevent the model from assuming a false ordinal relationship?

Accepted Answer

One-Hot Encoding, as it creates separate binary columns for each category, avoiding any implied ranking.

Answer

One-Hot Encoding is the correct choice for nominal categorical data (where no order exists) when used with linear models. It creates a new binary (0 or 1) feature for each category, preventing the model from incorrectly interpreting the categories as having a quantitative relationship (e.g., that 'Books' (encoded as 3) is greater than 'Apparel' (encoded as 1)). [19, 25, 26]

Question 3

A dataset contains a 'last_login_date' column with a `datetime64[ns]` dtype. Which of the following feature engineering approaches is most effective for extracting cyclical patterns that could be useful for a predictive model?

Accepted Answer

Creating new numerical features such as 'day_of_week', 'month_of_year', and a binary 'is_weekend' flag.

Answer

Extracting components like the day of the week, month, or creating a flag for weekends allows a model to capture time-based patterns and seasonality (e.g., user activity might be higher on weekends or at the beginning of the month). This is a standard and highly effective technique for making datetime information useful to a model. [15, 18, 29]

Question 4

A feature representing 'customer_spending' in a dataset is heavily right-skewed, with most values being low but with a long tail of very high-spending customers. Many linear machine learning models perform better with normally distributed features. What is a common and effective transformation to apply to this feature to make its distribution more symmetric?

Accepted Answer

Applying a logarithmic transformation (e.g., `np.log1p`) to compress the higher values and expand the lower values.

Answer

A logarithmic transformation is a powerful and common method for handling right-skewed data. It compresses the range of large values more than it compresses the range of small values, which effectively pulls the long tail in towards the center of the distribution, making it more symmetric and closer to a normal distribution. [1, 6, 14, 22]

Question 5

What is a primary advantage of using binning (or discretization) to transform a continuous numerical feature, such as 'Age', into categorical bins (e.g., '18-25', '26-40', '41-60')?

Accepted Answer

It can help a linear model capture non-linear relationships between the feature and the target variable.

Answer

By converting a continuous feature into discrete bins, a model (especially a linear one) can learn a separate weight for each bin. This allows it to capture complex, non-linear patterns where the effect of the feature on the target variable changes across different ranges (e.g., the likelihood of purchasing a product might be high for the '18-25' age group, low for '26-40', and high again for '41-60'). [5, 8, 11, 16]

Data Science with Python Certification Practice Test

Data Science with Python Certification Practice Test

Data Science with Python Feature Engineering Techniques Questions and Answers