jkisolo.com

Understanding Normalization and Standardization in Python

Written on

Normalization and Standardization are two commonly confused concepts in data preprocessing. Understanding when and how to apply each method is crucial for effective data analysis.

To begin with, let's examine normalization. This process rescales your data, ensuring that any specific value falls within the range of 0 to 1, accomplished using the following formula:

We'll implement this using Python with a dataset that's readily accessible.

from sklearn import preprocessing

import numpy as np

import pandas as pd

# Load dataset

df = pd.read_csv("https://storage.googleapis.com/mledudatasets/california_housing_train.csv", sep=",")

# Normalize the 'total_bedrooms' column

x_array = np.array(df['total_bedrooms'])

normalized_X = preprocessing.normalize([x_array])

Why normalize? Here are a few reasons:

  1. Normalization reduces sensitivity to the scale of features during training, enabling more accurate coefficient estimation.

For instance, in the California housing dataset, features like the number of bedrooms and median household income possess varying units and scales, which can complicate analysis if not addressed.

Let's observe these features without normalization.

The visualizations reveal unusual patterns, such as an implausible number of bedrooms exceeding 1000, along with significant outliers and binning discrepancies. Income also clusters at $500,000, suggesting that those above this threshold are categorized together.

Now, let's apply normalization.

After normalization, all values are confined to the 0-1 range, and outliers are less pronounced, improving feature consistency for evaluating future model outputs.

  1. Using normalization enhances analysis across multiple models.

If we were to apply algorithms without normalization, it may hinder convergence due to scaling discrepancies. Normalization improves data conditioning for convergence.

  1. Normalizing mitigates variance issues during convergence, making optimization viable.

However, it's essential to note that there are scenarios where normalization might not be appropriate. For example, if the data is already proportional, normalization might yield inaccurate estimators. Alternatively, if the relative scales of features are significant, retaining those scales could be necessary. Understanding your data and the transformations required to achieve your analytical objectives is paramount.

Additionally, some argue that centering input values around 0 (standardization) is preferable to scaling them between 0 and 1. Conducting thorough research will help clarify the specific data requirements for your models.

Now that we’ve clarified normalization, let’s delve into standardization. This technique rescales data so that the mean is 0 and the standard deviation is 1, following this formula:

Why standardize? Here are key points to consider:

  1. It allows for comparison among features with different units or scales.

Using the same housing and income data, standardization facilitates feature comparison and integration into models.

When running models (e.g., logistic regression, SVMs, neural networks), standardized data ensures that estimated weights update uniformly, resulting in more accurate outcomes.

Let’s see this in action with Python:

from sklearn import preprocessing

# Retrieve column names

names = df.columns

# Initialize the Scaler object

scaler = preprocessing.StandardScaler()

# Fit the data to the scaler object

scaled_df = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_df, columns=names)

The results illustrate that outlier values for both bedrooms and income have been adjusted, yielding a more normal distribution for each feature. While not perfect, the data is in a significantly better state compared to when it was normalized. Given the large discrepancies in scales and units, standardization proves to be a more suitable transformation for this dataset.

  1. Standardization enhances the training process, as it improves the numerical stability of optimization problems.

For instance, in Principal Component Analysis (PCA), accurate interpretation of outputs necessitates centering features around their means. Understanding your goals and the models being utilized is critical for making informed transformation decisions.

However, standardizing data may lead to the loss of some information. If that information is non-essential, this process can be beneficial; otherwise, it might hinder your results.

Bonus: Binning

Before concluding, let’s explore another concept: binning values.

For instance, consider the latitude feature in our dataset, which indicates geographical coordinates. While standardization or normalization could be applied, binning offers an alternative approach.

We can create new columns representing various latitude ranges and encode values in our dataset as either 0 or 1 based on their presence within these ranges.

# Define latitude range for new columns

lat_range = zip(range(32, 44), range(33, 45))

new_df = pd.DataFrame()

# Iterate to create new columns with binary encoding

for r in lat_range:

new_df["latitude_%d_to_%d" % r] = df["latitude"].apply(

lambda l: 1.0 if r[0] <= l < r[1] else 0.0)

Now that we have binned values, we can assign a binary indicator for each latitude in California. This method provides an additional strategy for cleaning data in preparation for modeling.

As always, I hope this discussion clarifies some concepts and offers practical examples for you to explore.

Cheers!

Further Reading:

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

The Moon's Role in Shaping Life on Earth: Unraveling Mysteries

Discover how the Moon has influenced Earth’s environment and the development of life, alongside intriguing theories about its origin.

The MIND Diet: Enhancing Cognitive Health as We Age

Explore how the MIND diet supports brain health and cognitive function as we age, backed by recent research.

Stephen Hawking's Views on God and the Cosmos Explored

An exploration of Stephen Hawking's perspective on God, the universe, and the implications of his theories.

Transform Your Life: 7 Habits to Safeguard Your Focus

Discover 7 impactful habits that can help you maintain focus and transform your life for the better.

Unlocking Rapid Growth: How I Gained 100 Followers in a Week

Learn how I gained 100 followers on Medium in just a week with effective strategies and community engagement tips.

Exploring the Cosmic Wonders of the Pillars of Creation

Discover a stunning 3D visualization of the Pillars of Creation, revealing the intricate details of star formation and cosmic beauty.

Vaccination Myths: Debunking Common Misconceptions

This article addresses prevalent myths surrounding vaccinations and presents factual information to clarify misconceptions.

The Rise and Fall of European Powers in the 16th Century

Explore how 16th-century European powers rose and fell, shaping global dynamics and leading to future conflicts.