The Overfitting Trap 3: Feature Selection and Dimensional Reduction

In this post, third part of the overfitting problem, we will discuss about how we can avoid overfitting with feature selection and dimensional reduction.

As we said before, overfitting is a memorization problem of our model. If we do not prevent from it, probably our model will be failure.

Today, we will see that how can we avoid overfitting with feature selection and dimensional reduction techniques.

Feature Selection

Let’s start with feature selection. Feature selection is a crucial process in model building that aims to identify the most essential features from a dataset. This involves selecting features that are consistent, unique, and directly related to the problem at hand. By carefully reducing the number of features, we can improve the performance of predictive models while also minimizing the computational resources required for model training. This is especially vital as datasets become increasingly large and complex.

From the definition, we can see those terms (they are helpful in reducing the overfitting risk):

Feature selection reduces the unimportant features (columns). It can be ‘Name’, ‘Id’, ‘Ticket Number’, etc. These columns do not have an impact in our model, so we can remove them which gives us an opportunity for reducing the complexity of our model.
At the same time, feature selection can reduce the problem of multidimensionality (it means we have a large number of columns for our model and hence the processing time of our model increases).
Feature selection reduces the variance. Variance is a measure of how dispersed the outputs from the model are. Since we don’t want too much dispersion here, we expect the variance to be low. High variance will also increase the risk of overfitting.
It decreases the noise of model (noise in models is some data which is the ‘nonsense’ in our data and they are about disrupting the data order). Like the people, models can also work well in less-noisy atmospheres.
It enhances the model’s ability to generalize.

So, we hit the target. It is our aim. So here, I will show you how can we do it a customer dataset (the dataset created by AI). Let’s see our dataset:

customerId	customerName	buyAmount	buyPrice	tax	totalPrice
4896	Eve	19	41.93	63.73	860.4
8254	David	10	75.75	60.6	818.1
7690	David	9	39.53	28.46	384.23
2271	David	7	13.34	7.47	100.85
2692	Alice	12	84.99	81.59	1101.47

Here, we have unimportant features like ‘customerId’ and ‘customerName’. We can drop them. For just an example in Python 3.10:

# firstly, let's import pandas
import pandas as pd

# drop the column/columns
df_name = df_name.drop(['customerId', 'customerName'], axis=1)

df_name: It is name of your dataset variable.
axis=1: This parameter helps you selecting ‘columns’.

So, we are done here. Right now, we will continue with ‘buyAmount’, ‘buyPrice’, ‘tax’, and ‘totalPrice’ columns. If you want to continue the feature selection, you can visualize correlation of these columns with .corr() parameter. You can remove the weak-correlated columns with the independent variable (y).

We are done with feature selection part. Let’s continue with dimensional reduction.

Dimensional Reduction

Dimensional reduction is a data processing technique that reduces complexity by transforming high-dimensional data into a low-dimensional space. This transformation makes data easier to understand and process by removing noise and unnecessary variations while preserving important information.

In contrast, dimensionality reduction reduces the data, for example, from 3-dimensional space to 2-dimensional space. It does:

By compressing the data, it reduces noise and at the same time increases computational efficiency. Here, just like in the feature selection section, there is noise reduction because the data falls into a lower space. Thus, computational efficiency increases, we can get faster results even with lower memory. Since we are moving into a lower space, the data is naturally compressed.
Dimensional reduction techniques consider correlations between features and reduce them to a lower component without distortion, so the model does not learn the same information again.
It makes it easier to understand data, and enables even the hidden structures beneath the data to emerge. When we see data on a simpler visual, we naturally understand it better and observe the structure between them more strongly.
It creates a regularization effect and thus prevents the model from becoming overly complex.

So, that’s enough for dimensional reduction. We can examplify this situation with PCA (Principal Component Analysis), MDS (Multidimensional Scaling). If we want to talk about them:

PCA: A statistical method used to find new mutually independent variables (principal components) that represent the largest portion of the variability in a data set.
MDS: A data analysis technique used to visualize similarity or difference data by transforming it into a low-dimensional geometric space.

PCA

# import pca
from sklearn.decomposition import PCA

# create pca
pca = PCA(n_components=2)

# apply pca
principalComponents = pca.fit_transform(x)

MDS

# import mds
from sklearn.manifold import MDS

# create mds
mds = MDS(n_components=2, dissimilarity='precomputed')

# apply mds
mds.fit(distance_matrix)

You can use methods like them for dimensional reduction in your dataset. After, you use these low-dimensional representations directly in your model training.

That is all for now. We will continue this series until we finish introducing all of our overfitting techniques. Return the main article from here.

Bibliography

Heavy AI. (n.d.). Feature selection. Retrieved August 16, 2024, from https://www.heavy.ai/technical-glossary/feature-selection#:~:text=Feature%20selection%20is%20the%20process,of%20datasets%20continue%20to%20grow.

The Overfitting Trap 3: Feature Selection and Dimensional Reduction

Share this:

Leave a comment Cancel reply