Raw data to meaningful Features for your ML model
We all know that a good ML model needs a large preprocessed dataset with good features. But the question is how to find good features among all the attributes we have in the dataset?
It depends on how we define our goal, or what we are actually going to predict. Let’s see what characteristics should a good feature have.
First, it should be identified as a possible feature. Assume that we are going to train a model to predict the value of a house. What are the possible features we might need? Location, square footage, number of rooms, what is the previous selling price, how old is that house and so on. But is that necessary to have who owns that house now? Obviously NOT!. The point is, different problems on the same domain may need different features. To identify the features, we should well aware of what we actually going to solve. Because the feature should be related to the objective.
Second, it should be known or available at the prediction time. Let us say we are going to predict the sales for the next three days. But the problem, in reality, we get the sales data to our Dataware house at least a month later. We do not get the sales data in real-time. So here we cannot have the previous day’s sales as a feature for our prediction. Again, it all depends on our objective. Some data could be known immediately some other data do not available in real-time.
Third, we need to have numeric data with meaningful magnitude. We all know we use mathematical and statistical functions in our neurons to create the neural network. So we need numeric data as our features. But it doesn’t mean it always needs to be numeric. String and character data types also can be features. But we represent them in numeric values in a meaningful way. For example, word2vec is a method used to vectorize the String data for ML models. The key point is, to get better predictions we need to have as much as numeric features related to our model.
Fourth, we should have enough sample data based on each feature value we have. Unless it cannot be a good feature. Say we are going to include location as a feature for our model and we only have entries from a single location. In the end, our model going to predict pricing for houses from other locations. Then there is no point in having location as a feature to train the model. If we have entries from three different locations and at least we have a certain amount of entries from each location then it is good to have location as a feature.
The last one is YOU!, bring your human insight. You should have expertise related to the problem you are going to solve using this model. And your curious mind for thinking.
Before I conclude, I need to mention one thing. Features not only comes once in your model's life. You can always come back and add or remove features and retrain your model whenever it is needed.
Hope you understood how to select good features related to your problem or model. Do you have any other characteristic we should focus on while selecting features?