Data Normalization Techniques
First step to make sense of data
Decimal scaling
Divide the data to make the max value below 1
Min-Max Normalization (scaling to a range)
Minimum get transformed to 0 and maximum gets transformed to 1.
v’ = (v-min/max-min)
Guarantees all features will have the exact same scale but does not handle outliers well
Z-Score Normalization (Robust Standardization)
Z-score is a variation of scaling that represents the number of standard deviations away from the mean.
v’ = ((v-mean)/standard deviation)
Handles outliers, but does not produce normalized data with the exact same scale
Log Transform
The Log Transform decreases the effect of the outliers, due to the normalization of magnitude differences and the model become more robust.
v’ = log(v)
Log scaling is helpful when a handful of your values have many points, while most other values have few points. This data distribution is known as the power law distribution
Feature Clipping
If your data set contains extreme outliers, you might try feature clipping, which caps all feature values above (or below) a certain value to fixed value.
Buckets with equally spaced boundaries
the boundaries are fixed and encompass the same range
Buckets with Quantile Bucketing
Creating buckets that each have the same number of points
Spotting Outliers with Inter-Quartile Range
IQR = Q3-Q1
Lower Outlier Limit = Q1- 1.5*IQR
Upper Limit = Q3+1.5*!QR
example —
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4
Q1(25th percentile) = 14.4
Q2(50th percentile) = 14.6
Q3(75th percentile) = 14.9
IQR = .5, Lower Limit = 13.65, Upper Limit =15.65