# Understanding Normalization and Standardization in Python

Written on

Normalization and Standardization are two commonly confused concepts in data preprocessing. Understanding when and how to apply each method is crucial for effective data analysis.

To begin with, let's examine **normalization**. This process rescales your data, ensuring that any specific value falls within the range of 0 to 1, accomplished using the following formula:

We'll implement this using Python with a dataset that's readily accessible.

from sklearn import preprocessing

import numpy as np

import pandas as pd

# Load dataset

df = pd.read_csv("https://storage.googleapis.com/mledudatasets/california_housing_train.csv", sep=",")

# Normalize the 'total_bedrooms' column

x_array = np.array(df['total_bedrooms'])

normalized_X = preprocessing.normalize([x_array])

Why normalize? Here are a few reasons:

- Normalization reduces sensitivity to the scale of features during training, enabling more accurate coefficient estimation.

For instance, in the California housing dataset, features like the **number of bedrooms** and **median household income** possess varying units and scales, which can complicate analysis if not addressed.

Let's observe these features without normalization.

The visualizations reveal unusual patterns, such as an implausible number of **bedrooms** exceeding 1000, along with significant outliers and binning discrepancies. Income also clusters at $500,000, suggesting that those above this threshold are categorized together.

Now, let's apply normalization.

After normalization, all values are confined to the 0-1 range, and outliers are less pronounced, improving feature consistency for evaluating future model outputs.

- Using normalization enhances analysis across multiple models.

If we were to apply algorithms without normalization, it may hinder convergence due to scaling discrepancies. Normalization improves data conditioning for convergence.

- Normalizing mitigates variance issues during convergence, making optimization viable.

However, it's essential to note that there are scenarios where normalization might not be appropriate. For example, if the data is already proportional, normalization might yield inaccurate estimators. Alternatively, if the relative scales of features are significant, retaining those scales could be necessary. Understanding your data and the transformations required to achieve your analytical objectives is paramount.

Additionally, some argue that centering input values around 0 (standardization) is preferable to scaling them between 0 and 1. Conducting thorough research will help clarify the specific data requirements for your models.

Now that we’ve clarified normalization, let’s delve into **standardization**. This technique rescales data so that the mean is 0 and the standard deviation is 1, following this formula:

Why standardize? Here are key points to consider:

- It allows for comparison among features with different units or scales.

Using the same housing and income data, standardization facilitates feature comparison and integration into models.

When running models (e.g., logistic regression, SVMs, neural networks), standardized data ensures that estimated weights update uniformly, resulting in more accurate outcomes.

Let’s see this in action with Python:

from sklearn import preprocessing

# Retrieve column names

names = df.columns

# Initialize the Scaler object

scaler = preprocessing.StandardScaler()

# Fit the data to the scaler object

scaled_df = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_df, columns=names)

The results illustrate that outlier values for both **bedrooms** and **income** have been adjusted, yielding a more normal distribution for each feature. While not perfect, the data is in a significantly better state compared to when it was normalized. Given the large discrepancies in scales and units, standardization proves to be a more suitable transformation for this dataset.

- Standardization enhances the training process, as it improves the numerical stability of optimization problems.

For instance, in Principal Component Analysis (PCA), accurate interpretation of outputs necessitates centering features around their means. Understanding your goals and the models being utilized is critical for making informed transformation decisions.

However, standardizing data may lead to the loss of some information. If that information is non-essential, this process can be beneficial; otherwise, it might hinder your results.

**Bonus: Binning**

Before concluding, let’s explore another concept: **binning** values.

For instance, consider the **latitude** feature in our dataset, which indicates geographical coordinates. While standardization or normalization could be applied, binning offers an alternative approach.

We can create new columns representing various **latitude** ranges and encode values in our dataset as either 0 or 1 based on their presence within these ranges.

# Define latitude range for new columns

lat_range = zip(range(32, 44), range(33, 45))

new_df = pd.DataFrame()

# Iterate to create new columns with binary encoding

for r in lat_range:

new_df["latitude_%d_to_%d" % r] = df["latitude"].apply(

lambda l: 1.0 if r[0] <= l < r[1] else 0.0)

Now that we have binned values, we can assign a binary indicator for each **latitude** in California. This method provides an additional strategy for cleaning data in preparation for modeling.

As always, I hope this discussion clarifies some concepts and offers practical examples for you to explore.

Cheers!

**Further Reading:**

- What’s the difference between Normalization and Standardization?