How to normalize data in Python without sklearn

Educative Answers Team

Inhaltsverzeichnis Show

Introduction
What is normalization
Normalization example
How to normalize data in Python

Left: Original Data, Right: Normalized Data

from sklearn import preprocessing import numpy as np a = np.random.random((1, 4)) a = a*20 print("Data = ", a) # normalize the data attributes normalized = preprocessing.normalize(a) print("Normalized Data = ", normalized)

RELATED TAGS

normalization

machine learning

python

Two of the best ways to normalize the input data in python with sklearn

Photo by Luke Chesser on Unsplash

In this post, I’ll try to explain how to Normalize the data in the simplest way possible.

Normalization is the process of scaling down data(in simple words). Usually while normalizing we change the scale of the data to fall between 0–1.

This article was first published on PyShark , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

In this article we will explore how to normalize data in Python.

Table of Contents

Introduction

One of the first steps in feature engineering for many machine learning models is ensuring that the data is scaled properly.

Some models, such as linear regression, KNN, and SVM, for example, are heavily affected by features with different scales.

While others, such as decision trees, bagging, and boosting algorithms generally do not require any data scaling.

The level of effect of features’ scales on mentioned models is high, and features with larger ranges of values will play a bigger role in the decision making of the algorithm since impacts they produce have larger effect on the outputs.

In such cases, we turn to feature scaling to help us find common level for all these features to be evaluated equally when training the model.

Two most popular feature scaling techniques are:

Z-Score Standardization
Min-Max Normalization

In this article, we will discuss how to perform min-max normalization of data using Python.

To continue following this tutorial we will need the following two Python libraries: sklearn and pandas.

If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:

pip install sklearn pip install pandas

What is normalization

In statistics and machine learning, min-max normalization of data is a process of converting original range of data to the range between 0 and 1.

The resulting normalized values represent the original data on 0-1 scale.

This will allow us to compare multiple features together and get more relevant information since now all the data will be on the same scale.

In min-max normalization, for every feature, its minimum value gets transformed into 0 and its maximum value gets transformed into 1. All values in-between get scaled to be within 0-1 range based on the original value relative to minimum and maximum values of the feature.

Suppose you have an array of numbers $A = [v_1, v_2, …, v_i]$.

We will first find the minimum and maximum values of the array: $min_A$ and $max_A$.

Then, using the min and max values we will transform each original value $v_i$ into a min-max normalized value $v’_i$ using the follwoing formula:

$$v’_i = \frac{v_i – min_A}{max_A – min_A}$$

Normalization example

In this section we will take a look at a simple example of data normalization.

Consider the following dataset with prices of different apples:

Weight in g	Price in $
300	3
250	2
800	5

And plotting this dataset should look like this:

Here we see a much larger variation of the weight compare to price, but it appears to looks like this because of different scales of the data.

The prices range is between $2 and $5, whereas the weight range is between 250g and 800g.

Let’s normalize this data!

Start with the weight feature:

Observation	$v_i$	$v_i = \frac{v_i – min_W}{max_W – min_W}$
1	300	$\frac{300-250}{800-250} = 0.09$
2	250	$\frac{250-250}{800-250} = 0$
3	800	$\frac{800-250}{800-250} = 1$

$min_W$	250
$max_W$	800

And do the same for the price feature:

Observation	$v_i$	$v_i = \frac{v_i – min_W}{max_W – min_W}$
1	3	$\frac{3-2}{5-2} = 0.33$
2	2	$\frac{2-2}{5-2} = 0$
3	5	$\frac{5-2}{5-2} = 1$

$min_W$	2
$max_W$	5

And combine the two features into one dataset:

Weight (normalized)	Price (normalized)
0.09	0.33
0	0
1	1

We can now see that the scale of the features in the dataset is very similar, and when visualizing the data, the spread between the points will be smaller:

The graph looks almost identical with the only difference being the scale of the each axis.

Now let’s see how we can recreate this example using Python!