jkisolo.com

Understanding NLP Model Development Using Python

Written on

This article aims to reshape the perspectives of newcomers regarding the learning of natural language processing (NLP). Reflecting on my early experiences, I often wondered how I would apply the various concepts I was learning.

A fundamental understanding of natural language concepts is required before diving into this content. For those needing a refresher, consider reviewing the article mentioned below.

<h2>NLP — From Beginner to Expert with Python</h2>

<div><h3>A comprehensive guide for mastering NLP fundamentals</h3></div>

<div><p>pub.towardsai.net</p></div>

Topics Covered: 1. Reading sentiment text files 2. Data Exploration and Text Processing 3. Data Cleaning — Stopwords, Stemming, and Lemmatization 4. Model Building — Naive Bayes 5. Saving and Loading the model

Reading Sentiment Text Files

First, we will import the necessary libraries.

import pandas as pd import numpy as np import warnings warnings.filterwarnings('ignore')

Next, we will read the sentiment file using the pandas library, which can be downloaded from Kaggle.

train_ds = pd.read_csv("sentiment_train", delimiter="t") train_ds.head(5)

The sentiment file consists of two columns: sentiment and text, with the sentiment column containing binary values ("0" and "1").

To properly view the sentences, we need to adjust the column width.

pd.set_option('max_colwidth', 800)

Now, we will filter the data based on the sentiment values "1" (positive) and "0" (negative). The following code will display the first five rows of positive sentiment.

train_ds[train_ds.sentiment == 1][0:5]

The following code snippet shows the text for negative sentiment up to the first five rows.

train_ds[train_ds.sentiment == 0][0:5]

Data Exploration and Text Processing

Data Exploration

We can examine the data's information using the info() method.

train_ds.info()

Next, we will count the positive and negative sentiments using seaborn's count plot.

import matplotlib.pyplot as plt import seaborn as sn %matplotlib inline

plt.figure(figsize=(6,5)) ax = sn.countplot(x='sentiment', data=train_ds)

for p in ax.patches:

ax.annotate(p.get_height(), (p.get_x() + 0.1, p.get_height() + 50))

<h2>Steps to Become a Data Scientist in 2021</h2>

<div><h3>Key points essential for a career in data science</h3></div>

<div><p>pub.towardsai.net</p></div>

Text Processing

Now, we will transform the text data into a format suitable for analysis using a count vector model.

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer() feature_vector = count_vectorizer.fit(train_ds.text)

feature_vector

To determine the total number of features, we can utilize the get_feature_names() method.

word = feature_vector.get_feature_names() print("Total number of features: ", len(word))

The output indicates a total of 2132 features.

To sample some features from the list:

import random random.sample(word, 10)

Now, we will convert the features into a sparse matrix.

train_ds_features = count_vectorizer.transform(train_ds.text) type(train_ds_features)

The output will confirm that it is a scipy.sparse.csr.csr_matrix.

To check the dimensions of the sparse matrix:

train_ds_features.shape

The result will show (6918, 2132).

Next, we'll convert the sparse matrix into a dense dataframe.

train_ds_df = pd.DataFrame(train_ds_features.todense()) train_ds_df.columns = word

To view the dataframe:

train_ds_df.head()

To check the first row of the raw data:

train_ds[0:1]

Now, we will inspect the first row of the dense matrix with selected columns:

train_ds_df.iloc[0:1, 150:157]

Counting Word Frequencies

We will count the occurrences of words and organize them into a dataframe.

words_counts = np.sum(train_ds_features.toarray(), axis=0) feature_counts_df = pd.DataFrame(dict(features=word, counts=words_counts))

plt.figure(figsize=(12,5)) plt.hist(feature_counts_df.counts, bins=50, range=(0, 2000)) plt.xlabel('Frequency of words') plt.ylabel('Density')

Next, we will examine words that occur only once.

len(feature_counts_df[feature_counts_df.counts == 1])

The output will show that there are 1228 words with a count of one.

To identify the most frequently occurring words and create a dataframe from them:

count_vectorizer = CountVectorizer(max_features=1000) feature_vector = count_vectorizer.fit(train_ds.text) word = feature_vector.get_feature_names() train_ds_features = count_vectorizer.transform(train_ds.text) word_counts = np.sum(train_ds_features.toarray(), axis=0) word_counts = pd.DataFrame(dict(features=word, counts=word_counts))

To view the most frequently occurring words as a dataframe:

feature_counts.sort_values('counts', ascending=False)[0:15]

Data Cleaning

Stopwords

We will now identify and eliminate stopwords, as they do not contribute meaningful information for sentiment analysis.

from sklearn.feature_extraction import text my_stop_words = text.ENGLISH_STOP_WORDS

print("Few stop words: ", list(my_stop_words)[0:10])

The output will display a selection of common stopwords.

Additionally, we can incorporate custom stopwords:

my_stop_words = text.ENGLISH_STOP_WORDS.union(['harry', 'potter', 'code', 'vinci', 'da', 'harri', 'mountain', 'movie', 'movies'])

Now, we will create a new dataframe after removing the stopwords.

count_vectorizer = CountVectorizer(stop_words=my_stop_words, max_features=1000) feature_vector = count_vectorizer.fit(train_ds.text) train_ds_features = count_vectorizer.transform(train_ds.text) word = feature_vector.get_feature_names() words_counts = np.sum(train_ds_features.toarray(), axis=0) word_counts = pd.DataFrame(dict(features=word, counts=words_counts))

View the new dataframe after filtering out stopwords:

feature_counts.sort_values("counts", ascending=False)[0:15]

<h2>Word Cloud Visualization with Python</h2>

<div><h3>Transforming text comments into word cloud graphics</h3></div>

<div><p>pub.towardsai.net</p></div>

Stemming and Lemmatization

Next, we will reduce words to their root forms using the Porter Stemmer.

from nltk.stem.snowball import PorterStemmer stemmer = PorterStemmer() analyzer = CountVectorizer().build_analyzer()

def stem_words(doc):

stem_words = (stemmer.stem(w) for w in analyzer(doc))

non_stop_words = [word for word in list(set(stem_words) - set(my_stop_words))]

return non_stop_words

Now, let's create a new dataframe with the root words.

count_vectorizer = CountVectorizer(analyzer=stem_words, max_features=1000) feature_vector = count_vectorizer.fit(train_ds.text) train_ds_features = count_vectorizer.transform(train_ds.text) word = feature_vector.get_feature_names() words_counts = np.sum(train_ds_features.toarray(), axis=0)

feature_counts = pd.DataFrame(dict(features=word, counts=words_counts)) feature_counts.sort_values("counts", ascending=False)[0:15]

Next, we will convert the vector matrix into a dataframe.

train_ds_df = pd.DataFrame(train_ds_features.todense()) train_ds_df.columns = features train_ds_df['sentiment'] = train_ds.sentiment

Model Building

Naive Bayes Model

We will split the data into training and testing sets.

from sklearn.model_selection import train_test_split train_X, test_X, train_y, test_y = train_test_split(train_ds_features, train_ds.sentiment, test_size=0.3, random_state=42)

We will utilize the Bernoulli Naive Bayes classifier.

from sklearn.naive_bayes import BernoulliNB nb_clf = BernoulliNB() nb_clf.fit(train_X.toarray(), train_y)

To predict sentiments, we will apply the model on the test data.

test_ds_predicted = nb_clf.predict(test_X.toarray())

Next, we will print the classification report for the Naive Bayes classifier.

from sklearn import metrics print(metrics.classification_report(test_y, test_ds_predicted))

Now, let's visualize the confusion matrix.

cm = metrics.confusion_matrix(test_y, test_ds_predicted) sn.heatmap(cm, annot=True, fmt='.2f');

Saving and Loading the Model

We will use the pickle library to save our model.

import pickle pickle.dump(nb_clf, open("Sentiment_Classifier_model", 'wb'))

To load the model for future predictions:

loaded_model = pickle.load(open("Sentiment_Classifier_model", 'rb')) test_ds_predicted = loaded_model.predict(test_X.toarray()) print(metrics.classification_report(test_y, test_ds_predicted))

Conclusion:

This article provides foundational concepts for building NLP models and transforming words into features for predictive analysis. Future articles will explore additional models like TF-IDF and N-Grams.

I hope you found this article informative. Feel free to connect with me on LinkedIn and Twitter.

# Recommended Articles 1. Understanding List as Big O and Comprehension with Python Examples 2. Python Data Structures Data-types and Objects 3. Exception Handling Concepts in Python 4. Principal Component Analysis in Dimensionality Reduction with Python 5. Fully Explained K-means Clustering with Python 6. Fully Explained Linear Regression with Python 7. Fully Explained Logistic Regression with Python 8. Basics of Time Series with Python 9. Data Wrangling With Python — Part 1 10. Confusion Matrix in Machine Learning

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

# Exploring the Intricacies of Ketamine and Its Impact on the Brain

An in-depth look at ketamine's effects on the brain, its therapeutic potential, and the science behind its use in mental health treatment.

Empowering Your Mindset for Financial Growth and Confidence

Discover how shifting your mindset can lead to empowerment and financial success.

Transforming Healthcare: The Role of AI Innovations and Challenges

Explore how AI is reshaping healthcare through innovation and the challenges it faces, including data privacy and trust issues.

Writing as Natural as Breathing: Embrace the Flow State

Discover how to make writing feel as effortless as breathing, and tap into your creative flow.

# Guidelines for COVID-19 Exposure Post-Vaccination: Expert Insights

Learn expert recommendations on what to do if exposed to COVID-19 after vaccination, including testing and quarantine advice.

Resolving the

Learn how to fix the Windows 11 error stating

Is the SWEAT App a Good Investment for Your Fitness Journey?

Explore the SWEAT app's features, workout programs, and whether it's worth the investment for fitness enthusiasts.

Sincerity: The Key to Genuine Engagement on Medium

Explore how sincere engagement on Medium can lead to meaningful connections and improved writing visibility.