Data Normalization Techniques

rememberme
2 min readJan 16, 2022

--

First step to make sense of data

Decimal scaling

Divide the data to make the max value below 1

Min-Max Normalization (scaling to a range)

Minimum get transformed to 0 and maximum gets transformed to 1.

v’ = (v-min/max-min)

Guarantees all features will have the exact same scale but does not handle outliers well

Z-Score Normalization (Robust Standardization)

Z-score is a variation of scaling that represents the number of standard deviations away from the mean.

v’ = ((v-mean)/standard deviation)

Handles outliers, but does not produce normalized data with the exact same scale

Log Transform

The Log Transform decreases the effect of the outliers, due to the normalization of magnitude differences and the model become more robust.

v’ = log(v)

Log scaling is helpful when a handful of your values have many points, while most other values have few points. This data distribution is known as the power law distribution

Despite the common belief that the log transformation can decrease the variability of data and make data conform more closely to the normal distribution, this is usually not the case.

Feature Clipping

If your data set contains extreme outliers, you might try feature clipping, which caps all feature values above (or below) a certain value to fixed value.

Buckets with equally spaced boundaries

the boundaries are fixed and encompass the same range

Buckets with Quantile Bucketing

Creating buckets that each have the same number of points

Spotting Outliers with Inter-Quartile Range

IQR = Q3-Q1

Lower Outlier Limit = Q1- 1.5*IQR

Upper Limit = Q3+1.5*!QR

example —

10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4

Q1(25th percentile) = 14.4

Q2(50th percentile) = 14.6

Q3(75th percentile) = 14.9

IQR = .5, Lower Limit = 13.65, Upper Limit =15.65

--

--

No responses yet