Systematized Predictive Modeling
Preprocessing
- Zero mean (subtract the mean from each predictor) to center the data.
- Divide by standard deviation to scale the data.
- DateTime
- One-Hot Code
- Look For skewness, log/sqrt/Box Cox transform if necessary (Boxcox)
- Resolve Outliers (and understand their meaning) (apply spatial sign if model is sensitive to outliers)
- Eliminate Missing Data (Can be problematic if missingness is predictive. Tree Based Models can deal with missing data)
- Imputation/Interpolation (KNN or intermediate regression model)
Exploratory Data Analysis
- Maximal Information Coefficient Matrix / Correlation Matrix
- Box-Chart Everything
- Scatter Every Combination of Features
- Pivot Tables
- Group by particular features
- Histogram Everything
- Outlier Analysis
- Transform Variables (Square, Cube, Inverse, Log) and Plot
- Summary (Mean, Mode, Minimum, Maximum, Upper/Lower Quartiles, Identify Outliers)
Data Reduction
- Principal Component Analysis 2.Linear Discriminant Analysis (For Classification)
- Feature Selection (Only use the components that account for a majority of the information when Modeling
- Remove Low/Zero Variance Predictors
- Remove multicollinear heavily correlated features
- Isomap
- Lasso
Algorithms for Regression
- Linear Regression
- Ridge Regression / Lasso / Elastic Net
- Best Subset Selection
- Forward and Backward Stepwise, Stagewise
- Partial Least Squares
- Principal Components Regression
- Neural Networks
- CNN
- RNN -LSTM
- Multivariate Adaptive Regression Splines
- Support Vector Regressor
- K-Nearest Neighbors
- Regression Decision Trees
- Bagged Trees
- Random Forests
- Extremely Random Forests
- Gradient Boosted Trees
- Generalized Linear Model
- Generalized Additive Model
Evaluating Regression
- RMSE
- MAE
- Median
- R2
- Visualization
Algorithms for Classification
- Logistic Regression
- L1, L2, Elastic Net
- Discriminant Analysis
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- Neural Networks
- CNN
- RNN
- LSTM
- Support Vector Classifier
- K-Nearest Neighbors
- Naive Bayes
- Classification Trees
- Bagged Trees
- Random Forests
- Extremely Random Forests
- Gradient Boosted Trees
- Generalized Additive Model
Evaluating Classification
- ROC Curve
- Confusion Matrix
- F1 Score
- Heat Map
- Overall accuracy rate
- Kappa Statistic
- Sensitivity
- Specificity
- AUC
Unsupervised Learning
- K-Means
- K-Means++
- K-Medoids
- Hierarchical Agglomerative Clustering
- Single Linkage, Complete Linkage, Average Linkage, Centroid Criterion
- Principal Components Analysis
- Spectral Clustering
- Affinity Propagation
- Biclustering
- Gaussian Mixture Model
Classification Class Imbalance
- Model Tuning (Tune Parameters For Sensitivity)
- Alternate Cutoffs (Using ROC Curve)
- Adjusting Prior Probability
- Unequal Case Weights
- Down Sampling
- Up Sampling
- Alter Cost Function
- Dynamic Structure (Cascade of classifiers)
Feature Evaluation
- Coefficients in Linear Models
- Random Forest Importances (variance for regression, information gain for classification)
- Pearson Correlation with Outcome
- Maximal Information Coefficient (MIC)
- Distance Correlation (code)
- Model with/without feature
- Randomly shuffle the feature between data points, check difference in model quality
- Lasso Automatic Selection
- Mean Decrease Accuracy (code)
- Stability Selection
- Recursive Feature Elimination
Parameter Tuning
- Cross Validation
- Bootstrap
- Grid Search (ex)
Text Features
- n-Grams
- Word Vector Representations (Word 2 Vec)
- Bag of words
- Word counts
- Lengths
- Tf-idf
- Term frequency, weighted by its rarity
- topic modeling (LDA)
Modeling Techniques
- Feature Engineering
- Basis Expansions
- Combine Features
- average values, median values, variances, sums, differences, maximums or minimums, and counts.
- Stacking (using output of one algorithm as input to the next)
- Internal Prediction
- Blending (Especially with differentiated models)
- Account For Missing Data (It can be information)
- External Data
- Acquire Domain Knowledge for Feature Engineering
- Random Forest, Boosters, Trees Importances for Feature Exploration
- Clustering for feature creation
- Distance to Class Centroid