Free Data Science Unsupervised Learning Techniques Questions and Answers

Question 1

A data scientist is tasked with reducing the dimensionality of a large dataset containing highly correlated features. The goal is to create a smaller set of new, uncorrelated features that capture the maximum possible variance from the original data. Which unsupervised learning technique is most suitable for this purpose?

Accepted Answer

Principal Component Analysis (PCA)

Answer

Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform a large set of variables into a smaller one that still contains most of the information. It works by creating new, uncorrelated variables, called principal components, that successively maximize the variance from the original data.

Question 2

A retail company wants to analyze customer transaction data to discover which products are frequently purchased together. This information will be used for product placement and promotional strategies. Which of the following algorithms is designed for this type of market basket analysis?

Accepted Answer

Apriori Algorithm

Answer

The Apriori algorithm is a classic algorithm used for association rule mining. It is designed to identify frequent itemsets in a dataset and generate association rules, making it ideal for market basket analysis to discover relationships between products in transactional data.

Question 3

A data science team is working with a dataset that has complex, non-spherical clusters and contains a significant amount of noise. The team does not know the number of clusters in advance. Which clustering algorithm would be the most effective choice for this scenario?

Accepted Answer

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Answer

DBSCAN is a density-based clustering algorithm that excels at finding arbitrarily shaped clusters and identifying noise points. Unlike K-Means, it does not require the number of clusters to be specified beforehand, making it suitable for exploratory analysis on complex datasets with outliers.

Question 4

Which of the following is a primary disadvantage of using hierarchical clustering algorithms, especially the agglomerative type, compared to K-Means clustering?

Accepted Answer

It is computationally expensive and has high memory requirements for large datasets.

Answer

Hierarchical clustering algorithms, particularly agglomerative ones, have a higher time and space complexity (often O(n^3) in time and O(n^2) in space) compared to K-Means (which is closer to O(n)). This makes them less scalable and computationally intensive for large datasets.

Question 5

A data analyst is using an agglomerative hierarchical clustering algorithm and needs to decide how to measure the distance between clusters. They choose a method where the distance between two clusters is defined as the distance between the two closest points in the different clusters. What is this linkage criterion called?

Accepted Answer

Single Linkage

Answer

Single Linkage defines the distance between two clusters as the minimum distance between any single data point in the first cluster and any single data point in the second cluster. It is also known as the nearest-neighbor technique.

Data Science Practice Test

Data Science Practice Test

Free Data Science Unsupervised Learning Techniques Questions and Answers