Machine Learning Practice Test

We set the gradient to zero to obtain the minimum or maximum of a function because:

A) Depends on the type of problem

B) The value of the gradient at extrema of a function is always zero

C) A and B

D) None of these

Correct! Wrong!

Explanation:
The gradient of a multivariable function at a maximum point will be the zero vector of the function, which is the single greatest value that the function can achieve.

Which of the following machine learning algorithms is based on the principle of bagging and is extensively used and effective?

A) Classification

B) Random Forest

C) Decision Tree

D) Regression

Correct! Wrong!

Explanation:
The Radom Forest algorithm builds an ensemble of Decision Trees, mostly trained with the bagging method.

Which of the following is a good characteristic of a test dataset?

A) Is representative of the dataset as a whole

B) Large enough to yield meaningful results

C) A and B

Correct! Wrong!

Explanations:
A good test dataset has a good amount of sample population and equal ratios of class representation.

The following are the most regularly used metrics and tools for evaluating a classification model:

A) Area under the ROC curve

B) Confusion matrix

C) Cost-sensitive accuracy

D) All of the above

Correct! Wrong!

Explanations:
The model performance assessment for classification algorithms encorporates all of the above techniques.

What is the purpose of cross-validation?

A) To judge how the trained model performs outside the sample on test data

B) To assess the predictive performance of the models

C) Both A and B

Correct! Wrong!

Explanations:
Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set.

How do you deal with data in a dataset that is missing or corrupted?

A) Replace missing values with mean/median/mode

B) Assign a unique category to missing values

C) Drop missing rows or columns

D) All of the above

Correct! Wrong!

Explanations: All of the above techniques are different ways of imputing the missing values.

A disadvantage of decision trees is which of the following?

A) Decision trees are prone to be overfit

B) Factor analysis

C) Decision trees are robust to outliers

Correct! Wrong!

Explanations:
Allowing a decision tree to split to a granular degree makes decision trees prone to learning every point extremely well to the point of perfect classification that is overfitting.

Which of the following is the correct technique to preprocess data before performing regression or classification?

A) PCA -> normalize PCA output -> training

B) Normalize the data -> PCA -> normalize PCA output -> training

C) Normalize the data -> PCA -> training

Correct! Wrong!

Explanations:
You need to always normalize the data first. If not, PCA or other techniques that are used to reduce dimensions will give different results.

Why is it necessary to use second-order differencing in a time series?

A) To find the maxima or minima at the local point

B) To remove stationarity

C) Both A and B

D) None of these

Correct! Wrong!

Explanations:
If the second-order difference is positive, the time series will curve upward and if it is negative, the time series will curve downward at that time.

In Sklearn, what is pca.components_?

A) Matrix of principal components

B) Result of the multiplication matrix

C) Set of all eigen vectors for the projection space

D) None of the above

Correct! Wrong!

Explanations:
pca.components_ is the set of all eigen vectors for the projection space.

Which of the following is a feature extraction example?

A) Removing stopwords in a sentence

B) Constructing bag of words vector from an email

C) Applying PCA projects to a large high-dimensional data

D) All of the above

Correct! Wrong!

Explanations:
All of the above techniques transform raw data into features which can be used as inputs to machine learning algorithms.

Which of the following regularization statements is incorrect?

A) Using a very large value of lambda cannot hurt the performance of your hypothesis.

B) Using too large a value of lambda can cause your hypothesis to underfit the data.

C) Using too large a value of lambda can cause your hypothesis to overfit the data.

D) None of the above

Correct! Wrong!

Explanations:
A large value results in a large regularization penalty and therefore, a strong preference for simpler models, which can underfit the data.

Which of the following statements about Naive Bayes is correct?

A) Assumes that all the features in a dataset are independent

B) Assumes that all the features in a dataset are equally important

C) A and B

D) None of the above

Correct! Wrong!

Explanations:
Naive Bayes assumes that all the features in a data set are equally important and independent.

Which of the following scenarios will K-means clustering fail to produce satisfactory results? 1) Outliers in the data 2) Data points of various densities 3) Nonconvex data points

A) 2 and 3

B) 1 and 2

C) 1 and 3

D) 1, 2, and 3

Correct! Wrong!

Explanations:
K-means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different, and the data points follow nonconvex shapes.

In-text mining, which of the following approaches can be used for normalization?

A) Stop Word Removal

B) Stemming

C) Lemmatization

D) Both B and C

Correct! Wrong!

Explanations:
Lemmatization and stemming are the techniques of keyword normalization.

How can a clustering algorithm avoid becoming caught in a bad local optima?

A) Use multiple radom initializations

B) Set the same seed value for each run

C) Both A and B

Correct! Wrong!

Explanations:
K-Means clustering algorithm has the drawback of converging at local minima which can be prevented by using multiple radom initializations.

After 15 iterations of gradient descent with a=0.3, you compute J(theta). You notice that J(Theta) rapidly falls before leveling out. Which of the following conclusions do you think is most likely based on this information?

A) a=0.3 is an effective choice of learning rate

B) Rather than using the current value of a, use a larger value of a (say a=1.0)

C) Rather than using the current value of a, use a smaller value of a (say a=0.1)

Correct! Wrong!

Explanations:
You need the gradient descent to quickly converge to the minimum. So the current setting of a seems to be good.

Which of the following is an appropriate method for determining "k" main components?

A) Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).

B) Choose k to be the smallest value so that at least 99% of the varinace is retained.

C) Use the elbow method.

D) Choose k to be the largest value so that 99% of the variance is retained.

Correct! Wrong!

Explanations:
This will maintain the structure of the data and also reduce its dimension.

What is the purpose of a sentence parser?

A) It is used to parse sentences to derive their most likely syntax tree structures.

B) It is used to parse sentences to check if they are utf-8 compliant.

C) It is used to check if sentences can be parsed into meaningful tokens.

D) It is used to parse sentences to assign POS tags to all tokens.

Correct! Wrong!

Explanations:
Sentence parsers analyze a sentence and automatically build a syntax tree.

Using the automated machine learning user interface, you create a machine learning model (UI). You must guarantee that the model complies with Microsoft's transparent AI philosophy. What are your options?

A) Enable Explain best model.

B) Set Validation type to Auto.

C) Set Max concurrent iterations to 0.

D) Set Primary metric to accuracy.

Correct! Wrong!

Explanation:
Model Explain Ability. Most businesses run on trust and being able to open the ML “black box” helps build transparency and trust. In heavily regulated industries like healthcare and banking, it is critical to comply with regulations and best practices. One key aspect of this is understanding the relationship between input variables (features) and model output. Knowing both the magnitude and direction of the impact each feature (feature importance) has on the predicted value helps better understand and explain the model. With model explain ability, we enable you to understand feature importance as part of automated ML runs.

Different binary classification models are being evaluated by a Data Scientist. A false positive result is 5 times more expensive than a false negative result (from a commercial standpoint).
The following criteria should be used to evaluate the models:
1) Must have a recall rate of at least 80%
2) Must have a false positive rate of 10% or less
3) Must minimize business costs
The Data Scientist creates the matching confusion matrix once each binary classification model is created.
Which confusion matrix best describes the model that meets the criteria?

A) TN = 99, FP = 1 FN = 21, TP = 79

B) TN = 91, FP = 9 FN = 22, TP = 78

C) TN = 98, FP = 2 FN = 18, TP = 82

D) TN = 96, FP = 4 FN = 10, TP = 90

Correct! Wrong!

Explanation:
The following calculations are required:
TP = True Positive
FP = False Positive
FN = False Negative
TN = True Negative
FN = False Negative
Recall = TP / (TP + FN)
False Positive Rate (FPR) = FP / (FP + TN)
Cost = 5 * FP + FN

A Machine Learning Engineer uses the Amazon SageMaker Linear Learner algorithm to prepare a data frame for a supervised learning task. The ML Engineer notes that the target label classes are unbalanced, and that several feature columns have missing data. The percentage of missing values is less than 5% for the full data frame.
What should the machine learning engineer do to reduce bias caused by missing values?

A) Replace each missing value by the mean or median across non-missing values in the same column.

B) Replace each missing value by the mean or median across non-missing values in same row.

C) For each feature, approximate the missing values using supervised learning based on other features.

D) Delete observations that contain missing values because these represent less than 5% of the data.

Correct! Wrong!

Explanation:
Use supervised learning to predict missing values based on the values of other features. Different supervised learning approaches might have different performances, but any properly implemented supervised learning approach should provide the same or better approximation than mean or median approximation, as proposed in responses A and B. Supervised learning applied to the imputation of missing values is an active field of research.

A business wants to develop a fraud detection model. Due to the limited number of fraud incidents, the Data Scientist currently does not have enough information.
Which strategy is the MOST LIKELY to catch the MOST genuine fraud cases?

A) Class weight adjustment

B) Oversampling using SMOTE

C) Undersampling

D) Oversampling using bootstrapping

Correct! Wrong!

Explanation:
With datasets that are not fully populated, the Synthetic Minority Over-sampling Technique (SMOTE) adds new information by adding synthetic data points to the minority class. This technique would be the most effective in this scenario.

A fraud detection model is built using logistic regression by a Data Scientist. While the algorithm's accuracy is 99 percent, the model fails to detect 90 percent of fraud incidents.
What activity will ensure that the model is able to detect more than 10% of fraud cases?

A) Using regularization to reduce overfitting

B) Using oversampling to balance the dataset

C) Using undersampling to balance the dataset

D) Decreasing the class probability threshold

Correct! Wrong!

Explanation:
Decreasing the class probability threshold makes the model more sensitive and, therefore, marks more cases as the positive class, which is fraud in this case. This will increase the likelihood of fraud detection. However, it comes at the price of lowering precision.

In Amazon S3, a Machine Learning team has numerous huge CSV datasets. On similar-sized datasets, models developed with the Amazon SageMaker Linear Learner algorithm have previously taken hours to train. The training process must be accelerated by the team's leaders.
What can a Machine Learning Expert do to help with this issue?

A) Use Amazon Machine Learning to train the models.

B) Use Amazon SageMaker Pipe mode.

C) Use AWS Glue to transform the CSV dataset to the JSON format.

D) Use Amazon Kinesis to stream the data to Amazon SageMaker.

Correct! Wrong!

Explanation:
Amazon SageMaker Pipe mode streams the data directly to the container, which improves the performance of training jobs. In Pipe mode, your training job streams data directly from Amazon S3. Streaming can provide faster start times for training jobs and better throughput. With Pipe mode, you also reduce the size of the Amazon EBS volumes for your training instances. A would not apply in this scenario. C transforms the data structure. D is a streaming ingestion solution, but is not applicable in this scenario.

We set the gradient to zero to obtain the minimum or maximum of a function because:

Which of the following machine learning algorithms is based on the principle of bagging and is extensively used and effective?

Which of the following is a good characteristic of a test dataset?

The following are the most regularly used metrics and tools for evaluating a classification model:

What is the purpose of cross-validation?

How do you deal with data in a dataset that is missing or corrupted?

A disadvantage of decision trees is which of the following?

Which of the following is the correct technique to preprocess data before performing regression or classification?

Why is it necessary to use second-order differencing in a time series?

In Sklearn, what is pca.components_?

Which of the following is a feature extraction example?

Which of the following regularization statements is incorrect?

Which of the following statements about Naive Bayes is correct?

Which of the following scenarios will K-means clustering fail to produce satisfactory results? 1) Outliers in the data 2) Data points of various densities 3) Nonconvex data points

In-text mining, which of the following approaches can be used for normalization?

How can a clustering algorithm avoid becoming caught in a bad local optima?

After 15 iterations of gradient descent with a=0.3, you compute J(theta). You notice that J(Theta) rapidly falls before leveling out. Which of the following conclusions do you think is most likely based on this information?

Which of the following is an appropriate method for determining "k" main components?

What is the purpose of a sentence parser?

Using the automated machine learning user interface, you create a machine learning model (UI). You must guarantee that the model complies with Microsoft's transparent AI philosophy. What are your options?

A business wants to develop a fraud detection model. Due to the limited number of fraud incidents, the Data Scientist currently does not have enough information. Which strategy is the MOST LIKELY to catch the MOST genuine fraud cases?

A fraud detection model is built using logistic regression by a Data Scientist. While the algorithm's accuracy is 99 percent, the model fails to detect 90 percent of fraud incidents. What activity will ensure that the model is able to detect more than 10% of fraud cases?

A business wants to develop a fraud detection model. Due to the limited number of fraud incidents, the Data Scientist currently does not have enough information.
Which strategy is the MOST LIKELY to catch the MOST genuine fraud cases?

A fraud detection model is built using logistic regression by a Data Scientist. While the algorithm's accuracy is 99 percent, the model fails to detect 90 percent of fraud incidents.
What activity will ensure that the model is able to detect more than 10% of fraud cases?