Feature Engineering
Understand your data by asking following questions
- What are the most important features in your data?
- Do I need to transform these features?
- Does the data set has missing data? missing column value of low number of records for positive cases (unbalanced data).
- Do I need to create new features based on existing features?
If the data set is not properly understood in conceptual way then the AI modeling will not yield desired results.
Handling missing column values in your dataset
a. Apply mean values from same column from other values and fillin the missing data. Its fast, easy.
b. Or apply median when there are lots of outlier.
c. replace it with most frequent occurring value
Handling Unbalanced data
a. Oversample — duplicate the the existing positive records to create more records of feature you want to detect.
b. Undersampling — remove the existing negative records to balance the dataset. Careful when removing or throwing data away from dataset. maybe useful when hardware scaling issues
c. SMOTE — Synthetic Minority Over-sampling Technique- generate new sample by using methods like k-nearest neighbors.
d. adjust threshold when predicting
Identify and handling Outlier
a. Calculate variance and standard deviations and mark data that are more than 1 or 2 SD away.
b. or use AWS random cut forest to identify outlier
c. or identify outlier with IQR
d. remove the outlier. do it carefully. or use max value for all values greater then IQR upper limit and use min value for all values lower then IQR lower limit
Binning
- Bucket data based on range of values
- Quantile binning — same number of sample in each buckets
Transforming
a. Apply some function to make data more suitable for ML. e.g. apply log to exponential trend
b. encoding — encode the data required by your model. e.g. One-hot encoding
c. scaling or normalizing — e.g. scit-learn MinMaxScalar
d. shuffling — randomize your dataset
Introduce Cyclic Features
Many features are cyclic in nature e.g. they are dependent on month, time of day or day of the week etc. Example- sweaters in cold weather, bike rental in warm seasons etc.
Compute the x- and y- component of that point using sin and cosin trigonometric functions. Add these new features to regression model
'hr_sin' = sin(hr*(2.*pi/24))
'hr_cos' = cos(hr*(2.*pi/24))
'mnth_sin' = sin((mnth)*(2.*pi/12))
'mnth_cos' = cos((df.mnth)*(2.*pi/12))