Data Scaling for Beginners

How to scale your data to render it suitable for model building

Benjamin Obi Tayo Ph.D.

--

Image by Unsplash

About the Author

Benjamin O. Tayo is a data science educator, tutor, coach, mentor, and consultant. Contact me for more information about our services and pricing: benjaminobi@gmail.com

Dr. Tayo has written close to 300 articles and tutorials in data science for educating the general public. Support Dr. Tayo’s educational mission using the links below:

PayPal: https://www.paypal.me/BenjaminTayo

CashApp: https://cash.app/$BenjaminTayo

INTRODUCTION

In the machine learning process, data scaling falls under data preprocessing, or feature engineering. Scaling your data before using it for model building can accomplish the following:

  • Scaling ensures that features have values in the same range
  • Scaling ensures that the features used in model building are dimensionless
  • Scaling can be used for detecting outliers

There are several methods for scaling data. The two most important scaling techniques are Normalization and Standardization.

Data Scaling Using Normalization

When data is scaled using normalization, the transformed data can be calculated using this equation

where Xmin and Xmax are the minimum and maximum values of the data, respectfully. The scaled data obtained is in the range [0, 1].

Python Implementation of Normalization

Scaling using normalization can be implemented in Python using the code below:

from sklearn.preprocessing import Normalizer

norm = Normalizer()

X_norm = norm.fit_transform(data)

Let X be a given data with Xmin = 17.7 and Xmax = 71.4. The data X is shown in the figure below:

Figure 1. Boxplot of data X with values between 17.7 and 71.4. Image by Author.

The normalized X is shown in the figure below:

--

--

Benjamin Obi Tayo Ph.D.

Dr. Tayo is a data science educator, tutor, coach, mentor, and consultant. Contact me for more information about our services and pricing: benjaminobi@gmail.com