jkisolo.com

Understanding Normalization and Standardization in Python

Written on

Normalization and Standardization are two commonly confused concepts in data preprocessing. Understanding when and how to apply each method is crucial for effective data analysis.

To begin with, let's examine normalization. This process rescales your data, ensuring that any specific value falls within the range of 0 to 1, accomplished using the following formula:

We'll implement this using Python with a dataset that's readily accessible.

from sklearn import preprocessing

import numpy as np

import pandas as pd

# Load dataset

df = pd.read_csv("https://storage.googleapis.com/mledudatasets/california_housing_train.csv", sep=",")

# Normalize the 'total_bedrooms' column

x_array = np.array(df['total_bedrooms'])

normalized_X = preprocessing.normalize([x_array])

Why normalize? Here are a few reasons:

  1. Normalization reduces sensitivity to the scale of features during training, enabling more accurate coefficient estimation.

For instance, in the California housing dataset, features like the number of bedrooms and median household income possess varying units and scales, which can complicate analysis if not addressed.

Let's observe these features without normalization.

The visualizations reveal unusual patterns, such as an implausible number of bedrooms exceeding 1000, along with significant outliers and binning discrepancies. Income also clusters at $500,000, suggesting that those above this threshold are categorized together.

Now, let's apply normalization.

After normalization, all values are confined to the 0-1 range, and outliers are less pronounced, improving feature consistency for evaluating future model outputs.

  1. Using normalization enhances analysis across multiple models.

If we were to apply algorithms without normalization, it may hinder convergence due to scaling discrepancies. Normalization improves data conditioning for convergence.

  1. Normalizing mitigates variance issues during convergence, making optimization viable.

However, it's essential to note that there are scenarios where normalization might not be appropriate. For example, if the data is already proportional, normalization might yield inaccurate estimators. Alternatively, if the relative scales of features are significant, retaining those scales could be necessary. Understanding your data and the transformations required to achieve your analytical objectives is paramount.

Additionally, some argue that centering input values around 0 (standardization) is preferable to scaling them between 0 and 1. Conducting thorough research will help clarify the specific data requirements for your models.

Now that we’ve clarified normalization, let’s delve into standardization. This technique rescales data so that the mean is 0 and the standard deviation is 1, following this formula:

Why standardize? Here are key points to consider:

  1. It allows for comparison among features with different units or scales.

Using the same housing and income data, standardization facilitates feature comparison and integration into models.

When running models (e.g., logistic regression, SVMs, neural networks), standardized data ensures that estimated weights update uniformly, resulting in more accurate outcomes.

Let’s see this in action with Python:

from sklearn import preprocessing

# Retrieve column names

names = df.columns

# Initialize the Scaler object

scaler = preprocessing.StandardScaler()

# Fit the data to the scaler object

scaled_df = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_df, columns=names)

The results illustrate that outlier values for both bedrooms and income have been adjusted, yielding a more normal distribution for each feature. While not perfect, the data is in a significantly better state compared to when it was normalized. Given the large discrepancies in scales and units, standardization proves to be a more suitable transformation for this dataset.

  1. Standardization enhances the training process, as it improves the numerical stability of optimization problems.

For instance, in Principal Component Analysis (PCA), accurate interpretation of outputs necessitates centering features around their means. Understanding your goals and the models being utilized is critical for making informed transformation decisions.

However, standardizing data may lead to the loss of some information. If that information is non-essential, this process can be beneficial; otherwise, it might hinder your results.

Bonus: Binning

Before concluding, let’s explore another concept: binning values.

For instance, consider the latitude feature in our dataset, which indicates geographical coordinates. While standardization or normalization could be applied, binning offers an alternative approach.

We can create new columns representing various latitude ranges and encode values in our dataset as either 0 or 1 based on their presence within these ranges.

# Define latitude range for new columns

lat_range = zip(range(32, 44), range(33, 45))

new_df = pd.DataFrame()

# Iterate to create new columns with binary encoding

for r in lat_range:

new_df["latitude_%d_to_%d" % r] = df["latitude"].apply(

lambda l: 1.0 if r[0] <= l < r[1] else 0.0)

Now that we have binned values, we can assign a binary indicator for each latitude in California. This method provides an additional strategy for cleaning data in preparation for modeling.

As always, I hope this discussion clarifies some concepts and offers practical examples for you to explore.

Cheers!

Further Reading:

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Unearthing Ancient Viruses: Insights from Tibetan Glacier Ice

Scientists have discovered ancient viruses in Tibetan glacier ice, shedding light on the past and potential impacts on climate change.

The Marvelous Spectacle of Our Star: The Sun’s Eruptive Beauty

Discover the beauty of solar eruptions and their magical effects, like the Northern Lights, in this fascinating exploration of our Sun.

Exploring the Direct Imaging of Exoplanets and New Discoveries

Discover the groundbreaking techniques in exoplanet imaging and the latest findings, including the remarkable planet HIP-99770b.

Finding Beauty in Chaos: A Journey Through Madness and Truth

A poetic exploration of despair and hope, revealing truths hidden behind madness.

Reconsidering My Stance on Apple Laptops: A Fresh Perspective

Exploring my evolving views on Mac laptops, their design choices, and the potential benefits of using a MacBook Pro.

Navigating Market Volatility: Insights for Savvy Investors

Understanding market fluctuations is essential for investors. Learn strategies for managing risks and achieving financial goals.

Exploring SBOMs: Insights from Anand Revashetti of Lineaje

Discover how SBOMs enhance software security through insights from Anand Revashetti, Co-Founder & CTO of Lineaje.

Rethinking Password Security: Why Your Master Password Matters

A critical look at password management and security risks, emphasizing the importance of unique master passwords.