Understanding NLP Model Development Using Python
Written on
This article aims to reshape the perspectives of newcomers regarding the learning of natural language processing (NLP). Reflecting on my early experiences, I often wondered how I would apply the various concepts I was learning.
A fundamental understanding of natural language concepts is required before diving into this content. For those needing a refresher, consider reviewing the article mentioned below.
<h2>NLP — From Beginner to Expert with Python</h2>
<div><h3>A comprehensive guide for mastering NLP fundamentals</h3></div>
<div><p>pub.towardsai.net</p></div>
Topics Covered: 1. Reading sentiment text files 2. Data Exploration and Text Processing 3. Data Cleaning — Stopwords, Stemming, and Lemmatization 4. Model Building — Naive Bayes 5. Saving and Loading the model
Reading Sentiment Text Files
First, we will import the necessary libraries.
import pandas as pd import numpy as np import warnings warnings.filterwarnings('ignore')
Next, we will read the sentiment file using the pandas library, which can be downloaded from Kaggle.
train_ds = pd.read_csv("sentiment_train", delimiter="t") train_ds.head(5)
The sentiment file consists of two columns: sentiment and text, with the sentiment column containing binary values ("0" and "1").
To properly view the sentences, we need to adjust the column width.
pd.set_option('max_colwidth', 800)
Now, we will filter the data based on the sentiment values "1" (positive) and "0" (negative). The following code will display the first five rows of positive sentiment.
train_ds[train_ds.sentiment == 1][0:5]
The following code snippet shows the text for negative sentiment up to the first five rows.
train_ds[train_ds.sentiment == 0][0:5]
Data Exploration and Text Processing
Data Exploration
We can examine the data's information using the info() method.
train_ds.info()
Next, we will count the positive and negative sentiments using seaborn's count plot.
import matplotlib.pyplot as plt import seaborn as sn %matplotlib inline
plt.figure(figsize=(6,5)) ax = sn.countplot(x='sentiment', data=train_ds)
for p in ax.patches:
ax.annotate(p.get_height(), (p.get_x() + 0.1, p.get_height() + 50))
<h2>Steps to Become a Data Scientist in 2021</h2>
<div><h3>Key points essential for a career in data science</h3></div>
<div><p>pub.towardsai.net</p></div>
Text Processing
Now, we will transform the text data into a format suitable for analysis using a count vector model.
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer() feature_vector = count_vectorizer.fit(train_ds.text)
feature_vector
To determine the total number of features, we can utilize the get_feature_names() method.
word = feature_vector.get_feature_names() print("Total number of features: ", len(word))
The output indicates a total of 2132 features.
To sample some features from the list:
import random random.sample(word, 10)
Now, we will convert the features into a sparse matrix.
train_ds_features = count_vectorizer.transform(train_ds.text) type(train_ds_features)
The output will confirm that it is a scipy.sparse.csr.csr_matrix.
To check the dimensions of the sparse matrix:
train_ds_features.shape
The result will show (6918, 2132).
Next, we'll convert the sparse matrix into a dense dataframe.
train_ds_df = pd.DataFrame(train_ds_features.todense()) train_ds_df.columns = word
To view the dataframe:
train_ds_df.head()
To check the first row of the raw data:
train_ds[0:1]
Now, we will inspect the first row of the dense matrix with selected columns:
train_ds_df.iloc[0:1, 150:157]
Counting Word Frequencies
We will count the occurrences of words and organize them into a dataframe.
words_counts = np.sum(train_ds_features.toarray(), axis=0) feature_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))
plt.figure(figsize=(12,5)) plt.hist(feature_counts_df.counts, bins=50, range=(0, 2000)) plt.xlabel('Frequency of words') plt.ylabel('Density')
Next, we will examine words that occur only once.
len(feature_counts_df[feature_counts_df.counts == 1])
The output will show that there are 1228 words with a count of one.
To identify the most frequently occurring words and create a dataframe from them:
count_vectorizer = CountVectorizer(max_features=1000) feature_vector = count_vectorizer.fit(train_ds.text) word = feature_vector.get_feature_names() train_ds_features = count_vectorizer.transform(train_ds.text) word_counts = np.sum(train_ds_features.toarray(), axis=0) word_counts = pd.DataFrame(dict(features=word, counts=word_counts))
To view the most frequently occurring words as a dataframe:
feature_counts.sort_values('counts', ascending=False)[0:15]
Data Cleaning
Stopwords
We will now identify and eliminate stopwords, as they do not contribute meaningful information for sentiment analysis.
from sklearn.feature_extraction import text my_stop_words = text.ENGLISH_STOP_WORDS
print("Few stop words: ", list(my_stop_words)[0:10])
The output will display a selection of common stopwords.
Additionally, we can incorporate custom stopwords:
my_stop_words = text.ENGLISH_STOP_WORDS.union(['harry', 'potter', 'code', 'vinci', 'da', 'harri', 'mountain', 'movie', 'movies'])
Now, we will create a new dataframe after removing the stopwords.
count_vectorizer = CountVectorizer(stop_words=my_stop_words, max_features=1000) feature_vector = count_vectorizer.fit(train_ds.text) train_ds_features = count_vectorizer.transform(train_ds.text) word = feature_vector.get_feature_names() words_counts = np.sum(train_ds_features.toarray(), axis=0) word_counts = pd.DataFrame(dict(features=word, counts=words_counts))
View the new dataframe after filtering out stopwords:
feature_counts.sort_values("counts", ascending=False)[0:15]
<h2>Word Cloud Visualization with Python</h2>
<div><h3>Transforming text comments into word cloud graphics</h3></div>
<div><p>pub.towardsai.net</p></div>
Stemming and Lemmatization
Next, we will reduce words to their root forms using the Porter Stemmer.
from nltk.stem.snowball import PorterStemmer stemmer = PorterStemmer() analyzer = CountVectorizer().build_analyzer()
def stem_words(doc):
stem_words = (stemmer.stem(w) for w in analyzer(doc))
non_stop_words = [word for word in list(set(stem_words) - set(my_stop_words))]
return non_stop_words
Now, let's create a new dataframe with the root words.
count_vectorizer = CountVectorizer(analyzer=stem_words, max_features=1000) feature_vector = count_vectorizer.fit(train_ds.text) train_ds_features = count_vectorizer.transform(train_ds.text) word = feature_vector.get_feature_names() words_counts = np.sum(train_ds_features.toarray(), axis=0)
feature_counts = pd.DataFrame(dict(features=word, counts=words_counts)) feature_counts.sort_values("counts", ascending=False)[0:15]
Next, we will convert the vector matrix into a dataframe.
train_ds_df = pd.DataFrame(train_ds_features.todense()) train_ds_df.columns = features train_ds_df['sentiment'] = train_ds.sentiment
Model Building
Naive Bayes Model
We will split the data into training and testing sets.
from sklearn.model_selection import train_test_split train_X, test_X, train_y, test_y = train_test_split(train_ds_features, train_ds.sentiment, test_size=0.3, random_state=42)
We will utilize the Bernoulli Naive Bayes classifier.
from sklearn.naive_bayes import BernoulliNB nb_clf = BernoulliNB() nb_clf.fit(train_X.toarray(), train_y)
To predict sentiments, we will apply the model on the test data.
test_ds_predicted = nb_clf.predict(test_X.toarray())
Next, we will print the classification report for the Naive Bayes classifier.
from sklearn import metrics print(metrics.classification_report(test_y, test_ds_predicted))
Now, let's visualize the confusion matrix.
cm = metrics.confusion_matrix(test_y, test_ds_predicted) sn.heatmap(cm, annot=True, fmt='.2f');
Saving and Loading the Model
We will use the pickle library to save our model.
import pickle pickle.dump(nb_clf, open("Sentiment_Classifier_model", 'wb'))
To load the model for future predictions:
loaded_model = pickle.load(open("Sentiment_Classifier_model", 'rb')) test_ds_predicted = loaded_model.predict(test_X.toarray()) print(metrics.classification_report(test_y, test_ds_predicted))
Conclusion:
This article provides foundational concepts for building NLP models and transforming words into features for predictive analysis. Future articles will explore additional models like TF-IDF and N-Grams.
I hope you found this article informative. Feel free to connect with me on LinkedIn and Twitter.
# Recommended Articles 1. Understanding List as Big O and Comprehension with Python Examples 2. Python Data Structures Data-types and Objects 3. Exception Handling Concepts in Python 4. Principal Component Analysis in Dimensionality Reduction with Python 5. Fully Explained K-means Clustering with Python 6. Fully Explained Linear Regression with Python 7. Fully Explained Logistic Regression with Python 8. Basics of Time Series with Python 9. Data Wrangling With Python — Part 1 10. Confusion Matrix in Machine Learning