FREE Master of Data Science Natural Language Processing Questions and Answers
Which grammar-based text parsing techniques can be utilized for noun phrase detection, verb phrase detection, subject detection, and object detection when working with text data obtained from structured news sentences.
These relationships are extracted from the text through dependency and constituent parsing.
What do alpha and beta hyperparameters in the Latent Dirichlet Allocation model for text categorization represent?
The correct interpretation of the alpha and beta hyperparameters in the Latent Dirichlet Allocation (LDA) model for text classification purposes is: Alpha (α): This hyperparameter represents the density of topics generated within documents. It controls the mixture of topics within individual documents. A higher value of alpha encourages documents to contain a more diverse mixture of topics, while a lower value makes documents focus on fewer dominant topics. Beta (β): This hyperparameter represents the density of terms generated within topics. It controls the mixture of words within each topic. A higher value of beta encourages topics to contain a more diverse set of words, while a lower value makes topics more focused on a few dominant words. So, the correct statement is. Alpha: density of topics generated within documents, beta: density of terms generated within topics - True
You generated a document term matrix from the input data of 100K documents while developing a machine learning model on text data. Which of the following remedies can be used to reduce the dimensions of data –
1. Latent Dirichlet Allocation
2. Latent Semantic Indexing
3. Keyword Normalization
To minimize the dimensionality of the data, any technique can be utilized.
You treated each tweet as a separate document when creating a document term matrix from the data. Which of the following statements about the document term matrix is true?
1.Removal of stopwords from the data will affect the dimensionality of data
2.Normalization of words in the data will reduce the dimensionality of data
3.Converting all the words in lowercase will not affect the dimensionality of the data
The two most utilized methods for creating chatbots are retrieval-based models and generative models. Which of the subsequent is a retrieval model example and a generating model illustration, respectively.
Both retrieval-based and generative models have their strengths and weaknesses. Retrieval-based models can provide accurate and controlled responses since they rely on predefined rules, but they may be limited in handling unseen or out-of-context queries. Generative models, on the other hand, can produce more diverse and contextually appropriate responses, but they require more data and may sometimes generate incorrect or irrelevant responses. Each approach is suitable for different chatbot applications based on the desired level of flexibility and control.
Which of the following features can be utilized to enhance a classification model's accuracy?
The correct answers are 1 and 2 because removing stopwords will reduce the amount of features in the matrix, normalizing words will lessen redundant features, and changing all words to lowercase would minimize dimensionality.
What regular expression from the list below may be used to find the date(s) in the text object? The following data science meetup will be place on September 21 of this year; the previous one was on March 31, 2016.
None of these expressions would be able to recognize the dates in this text item.
The combination of N keywords is known as a "N-gram."
Bigrams: Analytics Vidhya, Vidhya is, is a, a great, great source, source to, To learn, learn data, data science
What distinguishes the Conditional Random Field (CRF) from the Hidden Markov Model (HMM)?
The key distinction is that CRF models the conditional probability of the output given the input directly, making it a discriminative model, while HMM models the joint probability of the input and output, making it a generative model. The choice between CRF and HMM depends on the nature of the task and the type of data being modeled.
You have amassed a dataset of roughly 10,000 rows of tweet text alone. You want to develop a model for classifying tweets that divides each one into three categories: positive, negative, and neutral. Which of the following models can classify tweets in light of the aforementioned context?
There is no goal variable because you are simply provided the data from the tweets and nothing else. Since svm and naive bayes are both supervised learning methods, one cannot train a supervised learning model.
When attempting to extract context from text data, you came across two different phrases: There are soldiers inside the tank. There is nitrogen in the tank. Which of the following solutions is best for solving the word sense disambiguation issue in sentences?
The Lesk method, which is the only one that may be employed, is option 2, and it is used to decipher word senses.
How many bi-grams can be made out of the following sentence: "Analytics Vidhya is an excellent resource to study data science."
How many trigrams can be made from the following sentence using the techniques below for text cleaning:
Stopword Removal
Replacing punctuations by a single space
“#Analytics-vidhya is a great source to learn @data science.
Stopwords are removed, and punctuation is replaced, resulting in the text "Analytics vidhya fantastic source study data science." Trigrams: amazing source learn, source learn data, source learn data science, analytics vidhya,
Did you mean, a feature of Google Search, uses a variety of methods. Which of the following methods is most likely a component?
1. Collaborative Filtering model to detect similar user behaviors (queries)
2. Model that checks for Levenshtein distance among the dictionary terms
3. Translation of sentences into multiple languages
Levenshtein is used to calculate the distance between dictionary terms, and collaborative filtering can be used to examine human usage trends.
What part does NLP play in developing the two well-known recommendation engines, collaborative filtering and content-based models?
When dealing with text data, NLP can be utilized for feature extraction, comparing feature similarity, and creating text vector features.
What percentage of the total statements are correct with regards to Topic Modeling?
It is a supervised learning technique
LDA (Linear Discriminant Analysis) can be used to perform topic modeling
Selection of number of topics in a model does not depend on the size of data
Number of topic terms are directly proportional to size of the data\s
The most comprehensible type of text data is seen on social media platforms. A corpus of entire tweet social media data is provided to you. How can a model that suggests hashtags be made?
The most important terms in a corpus can be extracted using any of the strategies.