DSE - Data Science Natural Language Processing Fundamentals Questions and Answers

Question 1

A data scientist is preprocessing a large corpus of text for a machine learning model. They need to reduce words like 'running', 'ran', and 'runs' to a common base form. Which technique uses a dictionary and morphological analysis to convert a word to its true base form, ensuring the result is a valid word?

Accepted Answer

Lemmatization

Answer

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. It uses a vocabulary and morphological analysis of words to return the base or dictionary form of a word, which is known as the lemma. For example, the lemma of 'ran' and 'running' is 'run'. Stemming, in contrast, is a cruder, rule-based process that chops off endings, which may result in a non-dictionary word (e.g., 'studies' might become 'studi').

Question 2

In text analysis, which method assigns a higher weight to words that are frequent in a specific document but are rare across the entire collection of documents, effectively filtering out common words?

Accepted Answer

Term Frequency-Inverse Document Frequency (TF-IDF)

Answer

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a statistical measure used to evaluate the importance of a word to a document in a corpus. The 'Term Frequency' (TF) part measures how often a word appears in a document, while the 'Inverse Document Frequency' (IDF) part penalizes words that are common across all documents. The product of these two scores gives a high weight to terms that are significant within one document but not common everywhere.

Question 3

A data scientist is building a system to extract key information from news articles. The system needs to identify and label entities such as 'Google' as an ORGANIZATION, 'London' as a LOCATION, and '2024' as a DATE. Which NLP task is specifically designed for this purpose?

Accepted Answer

Named Entity Recognition (NER)

Answer

Named Entity Recognition (NER) is an NLP task that involves identifying and categorizing key information (entities) in text into predefined categories. These categories typically include names of persons, organizations, locations, dates, quantities, and more, which is exactly what the scenario requires.

Question 4

What is considered a primary limitation of the Bag-of-Words (BoW) model for text representation?

Accepted Answer

It disregards word order and semantic context.

Answer

The Bag-of-Words (BoW) model represents text by counting the occurrences of each word, completely ignoring their order, grammar, and context. This means that sentences with very different meanings, such as 'The dog chased the cat' and 'The cat chased the dog', would have identical BoW representations, which is a significant loss of semantic information.

Question 5

A market research firm has collected thousands of open-ended survey responses about a new product. The goal is to discover the main themes or topics (e.g., 'pricing', 'features', 'customer support') discussed in the responses without any prior labeling. Which unsupervised NLP technique is most appropriate for this task?

Accepted Answer

Latent Dirichlet Allocation (LDA)

Answer

Latent Dirichlet Allocation (LDA) is a generative statistical model used for topic modeling. It is an unsupervised technique that treats documents as a mixture of topics and topics as a mixture of words, allowing it to discover abstract themes within a collection of texts without needing pre-labeled data.

DSE - Data Science Practice Test

DSE - Data Science Practice Test

DSE - Data Science Natural Language Processing Fundamentals Questions and Answers