Understanding NLP Model Development Using Python

This article aims to reshape the perspectives of newcomers regarding the learning of natural language processing (NLP). Reflecting on my early experiences, I often wondered how I would apply the various concepts I was learning.

A fundamental understanding of natural language concepts is required before diving into this content. For those needing a refresher, consider reviewing the article mentioned below.

<h2>NLP — From Beginner to Expert with Python</h2>

<div><h3>A comprehensive guide for mastering NLP fundamentals</h3></div>

<div><p>pub.towardsai.net</p></div>

Topics Covered: 1. Reading sentiment text files 2. Data Exploration and Text Processing 3. Data Cleaning — Stopwords, Stemming, and Lemmatization 4. Model Building — Naive Bayes 5. Saving and Loading the model

Reading Sentiment Text Files

First, we will import the necessary libraries.

import pandas as pd import numpy as np import warnings warnings.filterwarnings('ignore')

Next, we will read the sentiment file using the pandas library, which can be downloaded from Kaggle.

train_ds = pd.read_csv("sentiment_train", delimiter="t") train_ds.head(5)

The sentiment file consists of two columns: sentiment and text, with the sentiment column containing binary values ("0" and "1").

To properly view the sentences, we need to adjust the column width.

pd.set_option('max_colwidth', 800)

Now, we will filter the data based on the sentiment values "1" (positive) and "0" (negative). The following code will display the first five rows of positive sentiment.

train_ds[train_ds.sentiment == 1][0:5]

The following code snippet shows the text for negative sentiment up to the first five rows.

train_ds[train_ds.sentiment == 0][0:5]

Data Exploration and Text Processing

Data Exploration

We can examine the data's information using the info() method.

train_ds.info()

Next, we will count the positive and negative sentiments using seaborn's count plot.

import matplotlib.pyplot as plt import seaborn as sn %matplotlib inline

plt.figure(figsize=(6,5)) ax = sn.countplot(x='sentiment', data=train_ds)

for p in ax.patches:

ax.annotate(p.get_height(), (p.get_x() + 0.1, p.get_height() + 50))

<h2>Steps to Become a Data Scientist in 2021</h2>

<div><h3>Key points essential for a career in data science</h3></div>

<div><p>pub.towardsai.net</p></div>

Text Processing

Now, we will transform the text data into a format suitable for analysis using a count vector model.

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer() feature_vector = count_vectorizer.fit(train_ds.text)

feature_vector

To determine the total number of features, we can utilize the get_feature_names() method.

word = feature_vector.get_feature_names() print("Total number of features: ", len(word))

The output indicates a total of 2132 features.

To sample some features from the list:

import random random.sample(word, 10)

Now, we will convert the features into a sparse matrix.

train_ds_features = count_vectorizer.transform(train_ds.text) type(train_ds_features)

The output will confirm that it is a scipy.sparse.csr.csr_matrix.

To check the dimensions of the sparse matrix:

train_ds_features.shape

The result will show (6918, 2132).

Next, we'll convert the sparse matrix into a dense dataframe.

train_ds_df = pd.DataFrame(train_ds_features.todense()) train_ds_df.columns = word

To view the dataframe:

train_ds_df.head()

To check the first row of the raw data:

train_ds[0:1]

Now, we will inspect the first row of the dense matrix with selected columns:

train_ds_df.iloc[0:1, 150:157]

Counting Word Frequencies

We will count the occurrences of words and organize them into a dataframe.

words_counts = np.sum(train_ds_features.toarray(), axis=0) feature_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))

plt.figure(figsize=(12,5)) plt.hist(feature_counts_df.counts, bins=50, range=(0, 2000)) plt.xlabel('Frequency of words') plt.ylabel('Density')

Next, we will examine words that occur only once.

len(feature_counts_df[feature_counts_df.counts == 1])

The output will show that there are 1228 words with a count of one.

To identify the most frequently occurring words and create a dataframe from them:

count_vectorizer = CountVectorizer(max_features=1000) feature_vector = count_vectorizer.fit(train_ds.text) word = feature_vector.get_feature_names() train_ds_features = count_vectorizer.transform(train_ds.text) word_counts = np.sum(train_ds_features.toarray(), axis=0) word_counts = pd.DataFrame(dict(features=word, counts=word_counts))

To view the most frequently occurring words as a dataframe:

feature_counts.sort_values('counts', ascending=False)[0:15]

Data Cleaning

Stopwords

We will now identify and eliminate stopwords, as they do not contribute meaningful information for sentiment analysis.

from sklearn.feature_extraction import text my_stop_words = text.ENGLISH_STOP_WORDS

print("Few stop words: ", list(my_stop_words)[0:10])

The output will display a selection of common stopwords.

Additionally, we can incorporate custom stopwords:

my_stop_words = text.ENGLISH_STOP_WORDS.union(['harry', 'potter', 'code', 'vinci', 'da', 'harri', 'mountain', 'movie', 'movies'])

Now, we will create a new dataframe after removing the stopwords.

count_vectorizer = CountVectorizer(stop_words=my_stop_words, max_features=1000) feature_vector = count_vectorizer.fit(train_ds.text) train_ds_features = count_vectorizer.transform(train_ds.text) word = feature_vector.get_feature_names() words_counts = np.sum(train_ds_features.toarray(), axis=0) word_counts = pd.DataFrame(dict(features=word, counts=words_counts))

View the new dataframe after filtering out stopwords:

feature_counts.sort_values("counts", ascending=False)[0:15]

<h2>Word Cloud Visualization with Python</h2>

<div><h3>Transforming text comments into word cloud graphics</h3></div>

<div><p>pub.towardsai.net</p></div>

Stemming and Lemmatization

Next, we will reduce words to their root forms using the Porter Stemmer.

from nltk.stem.snowball import PorterStemmer stemmer = PorterStemmer() analyzer = CountVectorizer().build_analyzer()

def stem_words(doc):

stem_words = (stemmer.stem(w) for w in analyzer(doc))

non_stop_words = [word for word in list(set(stem_words) - set(my_stop_words))]

return non_stop_words

Now, let's create a new dataframe with the root words.

count_vectorizer = CountVectorizer(analyzer=stem_words, max_features=1000) feature_vector = count_vectorizer.fit(train_ds.text) train_ds_features = count_vectorizer.transform(train_ds.text) word = feature_vector.get_feature_names() words_counts = np.sum(train_ds_features.toarray(), axis=0)

feature_counts = pd.DataFrame(dict(features=word, counts=words_counts)) feature_counts.sort_values("counts", ascending=False)[0:15]

Next, we will convert the vector matrix into a dataframe.

train_ds_df = pd.DataFrame(train_ds_features.todense()) train_ds_df.columns = features train_ds_df['sentiment'] = train_ds.sentiment

Model Building

Naive Bayes Model

We will split the data into training and testing sets.

from sklearn.model_selection import train_test_split train_X, test_X, train_y, test_y = train_test_split(train_ds_features, train_ds.sentiment, test_size=0.3, random_state=42)

We will utilize the Bernoulli Naive Bayes classifier.

from sklearn.naive_bayes import BernoulliNB nb_clf = BernoulliNB() nb_clf.fit(train_X.toarray(), train_y)

To predict sentiments, we will apply the model on the test data.

test_ds_predicted = nb_clf.predict(test_X.toarray())

Next, we will print the classification report for the Naive Bayes classifier.

from sklearn import metrics print(metrics.classification_report(test_y, test_ds_predicted))

Now, let's visualize the confusion matrix.

cm = metrics.confusion_matrix(test_y, test_ds_predicted) sn.heatmap(cm, annot=True, fmt='.2f');

Saving and Loading the Model

We will use the pickle library to save our model.

import pickle pickle.dump(nb_clf, open("Sentiment_Classifier_model", 'wb'))

To load the model for future predictions:

loaded_model = pickle.load(open("Sentiment_Classifier_model", 'rb')) test_ds_predicted = loaded_model.predict(test_X.toarray()) print(metrics.classification_report(test_y, test_ds_predicted))

Conclusion:

This article provides foundational concepts for building NLP models and transforming words into features for predictive analysis. Future articles will explore additional models like TF-IDF and N-Grams.

I hope you found this article informative. Feel free to connect with me on LinkedIn and Twitter.

# Recommended Articles 1. Understanding List as Big O and Comprehension with Python Examples 2. Python Data Structures Data-types and Objects 3. Exception Handling Concepts in Python 4. Principal Component Analysis in Dimensionality Reduction with Python 5. Fully Explained K-means Clustering with Python 6. Fully Explained Linear Regression with Python 7. Fully Explained Logistic Regression with Python 8. Basics of Time Series with Python 9. Data Wrangling With Python — Part 1 10. Confusion Matrix in Machine Learning

jkisolo.com

Understanding NLP Model Development Using Python

Share the page:

Recent Post:

# Exploring the Intricacies of Ketamine and Its Impact on the Brain

Empowering Your Mindset for Financial Growth and Confidence

Transforming Healthcare: The Role of AI Innovations and Challenges

Writing as Natural as Breathing: Embrace the Flow State

# Guidelines for COVID-19 Exposure Post-Vaccination: Expert Insights

Resolving the

Is the SWEAT App a Good Investment for Your Fitness Journey?

Sincerity: The Key to Genuine Engagement on Medium