Feature Selection For Machine Learning

Justin Fernandez
5 min readOct 21, 2020

--

You must choose wisely

In order to have a prediction on an occurrence of an event or an outcome of that event, one needs data or knowledge on the domain of the event. If one does not have this information, a prediction or guess on any event will be completely random without evidence to back it up. This is how we as humans are able to have a good idea of what is going to happen in our lives next because we have so much data and evidence to base it off of. Computers work in the same exact way, we (the operators) need to provide computers with data in a specific structure for it to be able to identify trends and make predictions. This process is also known as machine learning. Machine learning is defined as:

The use and development of computer systems that are able to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and draw inferences from patterns in data. -Oxford Languages

One key thing to focus on is the data that we provide our systems in order for them to do the heavy lifting. The data that is used as input to a model needs to be analyzed and tested in order to evaluate its ability to give information on the output event we wish to make predictions on. But how can we evaluate data in order to tell if it will be useful in the prediction of other data? Feature selection is the answer to our problem because it allows us to filter out data that isn’t as useful as others. It also identifies which data points (also known as features) will give the most information on the data we wish to make predictions on (also known as the target variable). There are many different methods for feature selection but I am going to go over the two main methods of feature selection, the filter and wrapper method.

Filter Method

The filter method is a very useful form of feature selection as it can be done before we even start to make predictions. This allows us to filter down to the most useful features before we have identified which methods we are going to use to make predictions. But what actually is the filter method? The main idea behind the filter method is using proven statistical methods, we evaluate each feature individually on the information it gives us on our target variable. This will generate a ranking system for all the features allowing us to filter out the variables that scored the lowest.

Image Credit

Some filter methods that are commonly used are SelectKBest, SelectPercentile, GenericUnivariateSelect, and many more. For brevity purposes I am going to explain SelectKBest and how this can be paired with different statistical tests to obtain the best k features for our model. In essence SelectKBest will filter out all features other than the k best performing statistics in the ranking methods that were used in this process. There are many different statistical tests you can use but their use depends on whether you are performing classification or regression. The quick explanation on the difference between these two is classification seeks to identify if an object belongs to one group or another (if this data describes a cat or a dog) and regression seeks to assign a continuous value to an object (what the price of a house with certain characteristics).

In the case of regression some well known scoring methods are a simple univariate regression and mutual information for regression. For classification the proven ranking tests are chi-squared, annova, and mutual information for classification. The formula’s for these tests are quite complicated, but with the magic of sklearn we only need to know how to interpret the output in order to use them effectively. You can find all the information on these here.

The Wrapper Method

The wrapper method is very different from the filter method in that it uses the prediction scores as its evaluation criteria. What this means is that we are going to use our data to make predictions on the target variable and then change our input feature vector in some way in order to improve our prediction results. This process can either be done by elimination or addition over many iterations. For the addition method we start from an empty list and iteratively add features to our input feature vector until we have reached a maximum score. With elimination we start with the full data set and iteratively take away the features that are providing the least information until we hit some maximum score.

Image Credit

With all this effort what do we wish to gain from feature selection?

The most important benefit from feature selection is to hopefully see an increase in our prediction accuracy. Our goal of machine learning is to be able to predict the outcome of an event based on the evidence we have previous to the occurrence of the outcome. This could be predicting if it’s going to rain, predicting the outcome of a sports game, or even the prediction that someone taking a loan will default on that loan. Our goal is to increase our ability to predict the true outcome of an event.

The second outcome of feature selection is to reduce the dimensionality of our input feature vector to allow for faster predictions. In some cases we are trying to make predictions in real time so reducing the number of features will allow predictions to be produced much faster.

If we have all the time in the world we will still benefit from a reduction in features because we are identifying the features that give the most information about the target variable. For example, if we are able to produce the same results or better with less features we have identified features that provide no benefit to our model and are just adding noise.

This is the true goal of building machine learning models, finding out what information allows us to predict what our target variable will be.

--

--