Data Preparation for Machine Learning | Data Cleaning, Data Transformation, Data Reduction
Data preparation is one of the most difficult steps in any Machine Learning (ML) project. Each dataset is different and highly specific to the project and each predictive modeling project with ML is different, but there are common steps performed on each project.
This post assumes that the reader has a basic understanding of ML concepts and terminology.
Machine learning models can be divided into four broad categories/types based on what they’re used for —
- Classification — categorizing of instances
- Regression — prediction of continuous values
- Clustering — finding logical groupings that exist in a dataset
- Dimensionality reduction — extract only the most significant features to train a ML model
If the data fed into an ML model is of poor quality, the model itself will be of poor quality (garbage in, garbage out). There are a huge number of problems that are encountered when working with data in the real world.
- Insufficient data — models trained with insufficient data result in poor predictions and they lead to either overfitting or underfitting
Overfitting: the model has memorized the training data
Underfitting: the model is unable to capture relationships in data
- Too much data — excessive data can either be outdated historical data (too many rows) or, have too many columns (curse of dimensionality). Dimensionality can be reduced by — feature selection/engineering, dimensionality reduction techniques
- Non-representative data — collected data has errors that can have a significant impact on the ML model; when the data is not proportional/biased; solved by cleaning (oversampling and undersampling)
- Missing data — solved by data deletion or data imputation
Data deletion — delete an entire record when a single value is missing but this can lead to bias
Data imputation — infer from known data i.e., fill in missing values with column mean, interpolate from other nearby values, build an ML model to predict missing value; sort records and use immediately prior data for missing value (hot-deck imputation)
- Duplicate data — solved by applying de-duplication
- Outliers — a data point that differs significantly from other data points in the dataset. Records in a dataset can be identified as outliner by measuring the distance from the mean or a fitted line. Once identified, these records can either be dropped, capped, or set to mean.
There are two types of data and all other forms of data like text, image, and video must be converted to one of these forms.
- Numeric (continuous) — predicted using regression models. ML algorithms typically do not work well with numeric data with different scales.
- Categorical — predicted using classification models. Categorical data has to be converted to numeric form before applying any ML model.
Standardization centers the data so that every column has a mean of zero and unit variance. Standardizing is essentially expressing the data in terms of z-scores and it is applied on features (columns). This process is sensitive to outliers in the dataset. Comparing different numeric features is hard because the range, average, and dispersion can be very different. This is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of zero and a standard deviation of one.
z = [xi - mean(x)]/stdev(x)
Mean is a measure of central tendency and the standard deviation is a measure of dispersion.
Robustscaler output does not change very much due to outliners. This Scaler removes the median and scales the data according to the quantile range (defaults to Interquartile Range, IQR). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
z = [xi - median(x)]/interquartile range(x)
Normalization is a process of scaling input vectors individually to the unit norm (magnitude of one) and is applied on records (rows/vectors). It is a measure of cosine similarity between two non-zero vectors. This is often applied to document modeling and clustering ML algorithms. If we plan to use the clustering ML model and use cosine similarity to measure how similar two data points are its extremely helpful to preprocess data and normalize the feature input vectors.
Different Vector Norms —
- L1 (manhattan distance)— a sum of absolute values of components of vector; a most natural way of measure distance between vectors
- L2 (euclidean distance) — a distance of the vector coordinate from the origin of the vector space. definition of vector magnitude;
- Max — largest absolute value of elements of a vector
Values greater than the threshold map to one, while values less than or equal to the threshold map to zero. With the default threshold of zero, only positive values map to one. Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance. It can also be used as a pre-processing step for estimators that consider boolean random variables.
A discretization transform will map numerical variables onto discrete values. Values for the variable are grouped into discrete bins and each bin is assigned a unique integer such that the ordinal relationship between the bins is preserved. The use of bins is often referred to as binning or k-bins, where k refers to the number of groups to which a numeric variable is mapped. Different methods for grouping the values into k discrete bins can be used and common techniques are:
- Uniform — each bin has the same width in the span of possible values for the variable
- Quantile — each bin has the same number of values, split based on percentiles
- Clustered — clusters are identified and examples are assigned to each group.
Too much data can be excessive in two ways — too many records (rows), too many features (columns).
Outdated historical data can become serious and usually requires a subject matter expert to decide which features are important.
Certain trade-offs of reducing complexity (rows and columns) are —
- Information loss — only a subset of features are used to train the model and there is information loss i.e., losing the information in the features that have been dropped
- Computationally intensive — preprocessing steps/transformations can be complex and hence computationally intensive
- Transformed features hard to interpret
- Performance degradation
Curse of dimensionality is when the number of features (X) increases; working with this data causes several problems. This can lead to problems with -
- Visualizing — Exploratory Data Analysis (EDA) is essential for identifying outliers and detecting anomalies. Higher-dimensional data is often not explored properly before applying any ML model.
- Training — Training is a process to find best model parameters and if the models are not trained for long enough it can lead to a bad model and model parameters may not have converged to the best possible values. The number of parameters to be found by an ML model increases with dimensionality and can be extremely time-consuming and expensive especially in the cloud (charged based on computing resources).
- Prediction — Once an ML model if fully trained, prediction in its simplest form is looking at a test instance and finding what training instances this particular instance is similar to. As dimensionality grows, the size of the search space that a model has to look within explodes. When the dataset has a large number of dimensions, every instance is very far away from other instances. Higher the number of variables/features results in a higher risk of overfitting on the training dataset.
Reducing complexity can be divided into two broad categories — Feature Selection and Dimensionality Reduction.
Feature selection is a process to choose the most relevant X variables from the existing dataset. Feature selection can further be classified to-
- Filter method — Features are selected based on statistical techniques: Variance thresholding, Chi-square test, and ANOVA.
(i) Variance thresholding is based on the principle that if all points have the same value for an X variable, then that variable adds no information.
Hence columns that have a variance below a certain threshold can be dropped. An ML model is thus built with only those features that have a high variance above a minimum threshold.
(ii) Chi-square test is typically used in classification models for categorical X and Y. For each X variable/feature, the Chi-square test is used to evaluate whether that X variable and the target Y variable are independent; if they’re independent, that X feature can be dropped. Chi-square test tries to check whether the observed data deviates from the expected in a particular analysis. If there is a deviation from the expected based on the value of a certain variable, that variable is significant or relevant. Under the hood, the chi-square test calculates the sum of the squared difference between the observed and the expected data in all categories and checks to see how different they are and this difference is captured in terms of a chi-square statistic and a p-value that gives the significance of that statistic.
(iii) ANOVA (Analysis of Variance) looks across multiple groups of populations and compares the means/averages of these populations to produce one score and one significance value indicating how different these populations are. Under the hood, ANOVA F-test checks whether the mean of the Y category varies for each distinct value of X. If the average Y value for each X category is not significantly different, it is inferred that X does not influence Y and hence the X variable can be dropped.
- Embedded method — Process of actually selecting the relevant/significant features within the actual model training process of the dataset. There are certain machine learning models, such as decision trees and lasso regression, which when trained on a dataset assign feature importance or significance to the features. This feature importance is then used to select features to train the final ML model.
- Wrapper method — In this method features are chosen by building different candidate models (build a bunch of different ML models) each of which has been trained on a different subset of features. Of all of these candidate models, the model that works best on the test data is chosen and the features used to train this model are the features that are used in the final model. Examples of wrapper methods to train models are forward and backward stepwise regression. The use of wrapper methods for feature selection can get computationally intensive because building and training many different candidate models and training on a different subset of features can be computationally intensive.
Dimensionality reduction is a process to transform the original X variables and project them onto new dimensions. Use of dimensionality reduction techniques to reduce complexity are —
- Projection — Try and find new better axes and reorient the current dataset to be expressed along with this projection. Examples of dimensionality reduction techniques are Principal Components Analysis (PCA), Factor Analysis, and Linear Discriminant Analysis (LDA). PCA and Factor Analysis are used for regression type ML models and LDA is used for classification type ML models.
- Manifold learning — Used when data in higher dimensions is non-linear. This technique involves unrolling the data so that the twist and turns in higher dimensions are smoothened out when the data is expressed in lower dimensionality. Manifold learning works best when the data lies along a rolled-up surface, such as a swiss roll or an S-curve, the data has a simpler form. In lower dimensionality, it’s curved into a more complex form and higher dimensionality. Examples of manifold learning algorithms are Multidimensional Scaling (MDS), Isomap, Locally Linear Embedding (LLE), Kernel PCA, t-SNE.
- Auto-encoding — This is applied to neural networks to try and find latent/significant features in the data and extract efficient representations of complex data.
API Reference - scikit-learn 0.23.1 documentation
This is the class and function reference of scikit-learn. Please refer to the full user guide for further details, as…