/machine-learning/bootcamp/20-natural-language-processing.md

20 Natural Language Processing

NLP provides several tools to process human language. It is used to analyze text, allowing machines to understand how human speak. It is used in chatbots, spam filters, and grammar checkers.
In order to work with natural language it's necessary to follow these steps:

Compile Documents: The first step is to compile the documents that will be used to train the model. This step is also known as corpus.
Featurize Documents: The second step is to extract features from text. This step is also known as vectorization.
Compare Features: The third step is to compare the features of the documents. This step is also known as similarity.

Bag of Words

NLP transforms text into a more understandable format for machines:

Documents "Blue House" and "Red House" gets transformed into vectors:
"Blue House" => (red,blue,house) => (0,1,1)
"Red House" => (red,blue,house) => (1,0,1)

A document represented as a vector of word is called a bag of words.
In order to determine the similarity between two documents, we can use the cosine similarity:

cosine similarity = sim(A,B) = cos(0) = (AB) / (||A||||B||)

It's possible to improve the bag of words by adjusting the words count based on their frequency in the corpus. This is called term frequency-inverse document frequency (TF-IDF).

Terminology

Corpus: A collection of documents.
Stop Words: Common words that are unlikely to be useful for learning, such as "and", "the", "a", etc.
Token: A single element of a corpus, such as a word or a symbol.
Tokenization: The process of splitting a document into tokens.
Vector: A list of numbers that represent a document.
Vectorization: The process of converting a document into a vector.
Cosine Similarity: A measure of similarity between two vectors.
Bag of Words: A vector representation of a document that describes the occurrence of words within a corpus.

TF-IDF

It's a measure of the frequency of a word in a document, adjusted for the frequency of the word in the corpus.
It's mathematically expressed as:

W(i,j) = TF(i,j) * log(N/DF(i))

Where:

W(i,j) is the weight of the word i in the document j.
TF(i,j) is the frequency of the word i in the document j.
N is the number of documents in the corpus.
DF(i) is the number of documents in the corpus that contain the word i.

Term frequency

Importance of the term within the document. TF(d,t) = Number of occurences of term t in document d.

Inverse term frequency

Importance of the term in the corpus.

Python Implementation

Python has a library called "nltk" that provides several tools to work with natural language.

conda install nltk
pip3 install nltk

import matplotlib.pyplot as plt
import seaborn as sns
import nltk
%matplotlib inline

nltk provides an interface to download several corpora and models. The following command will open a shell to download the corpora and models:

nltk.download_shell()

Dataframe

messages = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'])
messages.head()

ham messages count vs spam messages count:

messages.groupby('label').describe()

An important part of NLP is about feature engineering, several features might be extracted from text data, such as:

length of the text
number of words
number of characters
number of punctuations
number of upper case words
number of title case words
number of stop words
number of special characters
number of numerics
number of emojis
number of hashtags
number of mentions
average length of the words
average length of the sentences
number of sentences
number of unique words
number of unique words without stopwords
number of words with only alphabets

messages['length'] = messages['message'].apply(len)

Plotting

Disribution of messages length:

messages['length'].plot.hist(bins=150)

Finding a message with 910 characters:

messages[messages['length'] == 910]['message'].iloc[0]

Plotting the distribution of the length of the messages by label:

messages.hist(column='length', by='label', bins=60, figsize=(12, 4))

It's possible to see that spam messages tend to have more characters

Text Preprocessing

The following steps are necessary to process text data:

Remove punctuation
Remove stopwords
Tokenize
Lemmatize or Stem

import string

def text_process(mess):
    """
    1. remove punctuation
    2. remove stopwords
    3. return list of clean text words
    """
    nopunc = [char for char in mess if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

Vectorization

Vectorization is the process of converting text data into vectors. The following steps are necessary to vectorize text data:

Count how many times a word occurs in each message (known as term frequency)
Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
Normalize the vectors to unit length, to abstract from the original text length (L2 norm)

from sklearn.feature_extraction.text import CountVectorizer

bow_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])

how many words are in the vocabulary:

print(len(bow_transformer.vocabulary_))

message4 = messages['message'][3]
# 'U dun say so early hor... U c already then say...'

bow4 = bow_transformer.transform([message4])
print(bow4)
print(bow4.shape)

# (0, 4068)	2
# (0, 4629)	1
# (0, 5261)	1
# (0, 6204)	1
# (0, 6222)	1
# (0, 7186)	1
# (0, 9554)	2
# -----------
# (1, 11425)

Get back the word at index 4068:

bow_transformer.get_feature_names_out()[
    # check the word at index 4068
    4068
]
# feature 4068 is 'U' and it appears twice in the message

Transform the entire dataframe:

messages_bow = bow_transformer.transform(messages['message'])

Sparse matrix is a matrix in which most of the elements are zero. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements.
It's useful because most of the cells in the matrix are empty, so it's a waste of memory to store all the zeros.

print('Shape of Sparse Matrix: ', messages_bow.shape)
print('Amount of Non-Zero occurences: ', messages_bow.nnz)

Sparsity is useful to know how many zeros are in the matrix:

sparsity = (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))
print('sparsity: {}'.format(round(sparsity)))

TF-IDF

TD-IDF stands for Term Frequency - Inverse Document Frequency. It's a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
The TF-IDF tramsformer will calculate the TF-IDF score for each word in the corpus.

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(messages_bow)

Transform the message 4:

tfidf4 = tfidf_transformer.transform(bow4)

print(tfidf4)
#  (0, 4068)	0.26863320404807484
#  (0, 4629)	0.2283418270966581
#  (0, 5261)	0.2848160159387552
#  (0, 6204)	0.2891210261089915
#  (0, 6222)	0.3273399941961877
#  (0, 7186)	0.6208395209546331
#  (0, 9554)	0.4673241898596147

Understanding the importance of word 'university' in the message 4:

print(tfidf_transformer.idf_[bow_transformer.vocabulary_['university']])

Transform the entire bag-of-words corpus into TF-IDF corpus:

messages_tfidf = tfidf_transformer.transform(messages_bow)

Training a model

It's now possible to train a model to classify a message as spam or not spam using for example MultinomialNB classifier.

from sklearn.naive_bayes import MultinomialNB

spam_detect_model = MultinomialNB().fit(messages_tfidf, messages['label'])

Checking the prediction for the message 4:

print('predicted:', spam_detect_model.predict(tfidf4)[0])
# ham
print('expected:', messages.label[3])
# ham

Model Evaluation

Applying the model to all the messages:

all_predictions = spam_detect_model.predict(messages_tfidf)
print(all_predictions)

from sklearn.metrics import classification_report

print(classification_report(messages['label'], all_predictions))

from sklearn.model_selection import train_test_split

msg_train, msg_test, label_train, label_test = train_test_split(messages['message'], messages['label'], test_size=0.3)

Creating a Data Pipeline

Scikit-learn pipelines are very useful for simplifying the workflow. It's possible to set up all the transformations that we will do to the data for future use.

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    # 1st step => use CountVectorizer to convert text messages into a matrix of token counts
    ('bow', CountVectorizer(analyzer=text_process)),
    # 2nd step => use TfidfTransformer to compute the IDF values and compute the TF-IDF scores
    ('tfidf', TfidfTransformer()),
    # 3rd step => use MultinomialNB classifier to train the model
    ('classifier', MultinomialNB())
])

Fit pipeline to the training data:

pipeline.fit(msg_train, label_train)

Predict the test data:

predictions = pipeline.predict(msg_test)

Evaluate the model:

print(classification_report(predictions, label_test))

Other models

from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),
    ('tfidf', TfidfTransformer()),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(msg_train, label_train)

predictions = pipeline.predict(msg_test)

print(classification_report(predictions, label_test))

from sklearn.svm import SVC

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),
    ('tfidf', TfidfTransformer()),
    ('classifier', SVC())
])

pipeline.fit(msg_train, label_train)

predictions = pipeline.predict(msg_test)

print(classification_report(predictions, label_test))

from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),
    ('tfidf', TfidfTransformer()),
    ('classifier', LogisticRegression())
])

pipeline.fit(msg_train, label_train)

predictions = pipeline.predict(msg_test)

print(classification_report(predictions, label_test))

Python

PythonCImachine-learningscikit-learnseabornsklearnmatplotlibmetricsnlpmdcsvstorage