Understanding Normalization and Standardization in Python
Written on
Normalization and Standardization are two commonly confused concepts in data preprocessing. Understanding when and how to apply each method is crucial for effective data analysis.
To begin with, let's examine normalization. This process rescales your data, ensuring that any specific value falls within the range of 0 to 1, accomplished using the following formula:
We'll implement this using Python with a dataset that's readily accessible.
from sklearn import preprocessing
import numpy as np
import pandas as pd
# Load dataset
df = pd.read_csv("https://storage.googleapis.com/mledudatasets/california_housing_train.csv", sep=",")
# Normalize the 'total_bedrooms' column
x_array = np.array(df['total_bedrooms'])
normalized_X = preprocessing.normalize([x_array])
Why normalize? Here are a few reasons:
- Normalization reduces sensitivity to the scale of features during training, enabling more accurate coefficient estimation.
For instance, in the California housing dataset, features like the number of bedrooms and median household income possess varying units and scales, which can complicate analysis if not addressed.
Let's observe these features without normalization.
The visualizations reveal unusual patterns, such as an implausible number of bedrooms exceeding 1000, along with significant outliers and binning discrepancies. Income also clusters at $500,000, suggesting that those above this threshold are categorized together.
Now, let's apply normalization.
After normalization, all values are confined to the 0-1 range, and outliers are less pronounced, improving feature consistency for evaluating future model outputs.
- Using normalization enhances analysis across multiple models.
If we were to apply algorithms without normalization, it may hinder convergence due to scaling discrepancies. Normalization improves data conditioning for convergence.
- Normalizing mitigates variance issues during convergence, making optimization viable.
However, it's essential to note that there are scenarios where normalization might not be appropriate. For example, if the data is already proportional, normalization might yield inaccurate estimators. Alternatively, if the relative scales of features are significant, retaining those scales could be necessary. Understanding your data and the transformations required to achieve your analytical objectives is paramount.
Additionally, some argue that centering input values around 0 (standardization) is preferable to scaling them between 0 and 1. Conducting thorough research will help clarify the specific data requirements for your models.
Now that we’ve clarified normalization, let’s delve into standardization. This technique rescales data so that the mean is 0 and the standard deviation is 1, following this formula:
Why standardize? Here are key points to consider:
- It allows for comparison among features with different units or scales.
Using the same housing and income data, standardization facilitates feature comparison and integration into models.
When running models (e.g., logistic regression, SVMs, neural networks), standardized data ensures that estimated weights update uniformly, resulting in more accurate outcomes.
Let’s see this in action with Python:
from sklearn import preprocessing
# Retrieve column names
names = df.columns
# Initialize the Scaler object
scaler = preprocessing.StandardScaler()
# Fit the data to the scaler object
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=names)
The results illustrate that outlier values for both bedrooms and income have been adjusted, yielding a more normal distribution for each feature. While not perfect, the data is in a significantly better state compared to when it was normalized. Given the large discrepancies in scales and units, standardization proves to be a more suitable transformation for this dataset.
- Standardization enhances the training process, as it improves the numerical stability of optimization problems.
For instance, in Principal Component Analysis (PCA), accurate interpretation of outputs necessitates centering features around their means. Understanding your goals and the models being utilized is critical for making informed transformation decisions.
However, standardizing data may lead to the loss of some information. If that information is non-essential, this process can be beneficial; otherwise, it might hinder your results.
Bonus: Binning
Before concluding, let’s explore another concept: binning values.
For instance, consider the latitude feature in our dataset, which indicates geographical coordinates. While standardization or normalization could be applied, binning offers an alternative approach.
We can create new columns representing various latitude ranges and encode values in our dataset as either 0 or 1 based on their presence within these ranges.
# Define latitude range for new columns
lat_range = zip(range(32, 44), range(33, 45))
new_df = pd.DataFrame()
# Iterate to create new columns with binary encoding
for r in lat_range:
new_df["latitude_%d_to_%d" % r] = df["latitude"].apply(
lambda l: 1.0 if r[0] <= l < r[1] else 0.0)
Now that we have binned values, we can assign a binary indicator for each latitude in California. This method provides an additional strategy for cleaning data in preparation for modeling.
As always, I hope this discussion clarifies some concepts and offers practical examples for you to explore.
Cheers!
Further Reading:
- What’s the difference between Normalization and Standardization?