Data Prep For Machine Learning

Online, I hear a lot of people recommending Kaggle as the best starting place for learning machine learning (ML). To be contentious, I believe that recommendation might not be accurate. I am not saying it is a bad place to learn or that it is not helpful. What I am saying, however, is that it might not be the best.

In the real world, it is very uncommon to be handed a perfectly curated, clean dataset ready for training a model like on Kaggle. Realistically, you get asked vague questions like, “Who makes most of the complaints?”, or, “How can we make our chatbot smarter?” (these are real questions I have been asked in the past).

So, unlike Kaggle where you download your dataset, you need to work out what is required. And then you need to work out how to solve the issue before you even get to the data let alone train a model!

For today, I am going to assume we have done those first steps of understanding the problem, planning some potential solutions, and collecting our data. Now, the next step is to understand, clean and prepare the dataset for a model to ingest it. And that is what I want to talk about.

Broad Data Cleaning and Preparation Procedures

Photo by Gil Ribeiro on Unsplash

So, most of the common techniques used in data cleaning and preparation fall into one of four buckets.

The first broad procedure is (1) filtering. In this stage, one filters data points like duplicates or errors. Sometimes, there are outliers than can also throw the model out of wack, which could be removed. Sometimes there even is irrelevant data that can just be cut out.

Next, you can (2) normalise the data so that there is uniformity in the dataset. For example, your model might be trained on data collected from various locations. So, you would have some numbers in the metric numbering system and others not. Or maybe address information is written down slightly differently. So normalising would make sure that all the data are comparable.

Thirdly, there are some data (3) augmentation methods that can be applied to add information back into the dataset. Some people choose to impute missing data points based on a number of different routines. One could add information from other sources to enrich the data’s predicting power.

And finally, you can (4) aggregate the data in a number of ways. For example, maybe there are some underrepresented categories in the data. This could lead to poor estimates because there isn’t a large enough sample size to train a model that can predict accurately for these categories. So, we can group these categories with others. Addresses could be grouped into areas, or ages into age brackets, for example.

So, let’s go into a bit more detail.


Photo by Tyler Nix on Unsplash

Filtering is one of the first steps in cleaning and preparing data for a machine learning model. Sometimes there is unwanted data in your dataset, errors, or even outliers. The key here is to remove anything unwanted.


There may be situations where there are no exact duplicates. For example, if a user enters their details into a system twice, each on a different day. The result could be that John Doe becomes Doe John in the second time ‘round.

Before a dataset is ready to be used, it should go through a basic deduplication stage first.


If you can spot the errors, it makes sense to remove them from the dataset. If you leave them in the dataset, it could cause the model to fail in training due to NaN values, or it could bias the predictions.



Photo by Clark Van Der Beken on Unsplash

So, you’re dealing with imperfect data, and you find that there are quite a few inconsistencies in the data. Maybe it isn’t wrong, and you want to keep the data, but it definitely isn’t uniform. This is where normalising comes in handy!


Maybe someone labelled themselves as ‘mail’ rather than ‘male’. This would cause the model to not understand how many categories there are. Some simple data normalisation can greatly simplify the number of categories in the data and improve prediction quality.

Naming Conventions

When formatting data for a machine learning model to ingest, even ‘Meals’ vs ‘meals’ would create two categories. This splits the sample size for both categories. Ultimately, it would work to the detriment of your model, so it is best to normalise the data where possible.


Metric vs Imperial Numbering



Photo by Andrew Wulf on Unsplash

Now, I want to distinguish augmenting a dataset from feature engineering. Say you have a dataset of customer reviews. Feature engineering may derive new features from the data. For example, review length, sentiment, or the number of uppercase letters to the number of letters ratio.

Data augmentation, on the other hand, is simply creating more data without creating whole new features. In the reviews example, you could pro-grammatically replace words with synonyms to get new data points to train a model. You could also paraphrase sentences, reorder prepositional phrases, and more! With image data, you can flip, rotate, or stretch them. There are “lots of ways to skin the cat”, as they say.


Roughly speaking, sometimes it might make sense to fill the gaps with the average value of all the other data points. Maybe you can find the top N nearest rows and take the average of those. This step is often debated because sometimes missing data contains information in itself. Say, someone does not want to answer a particular question on a questionnaire. This might indicate something about the person. In this case, it might make sense to create another feature to label the entry as missing a data point to see if there is any predicting power there.


One approach is to use synthetic resampling. It creates new entries using some kind of function to generate new, synthetic data points similar to the groups we are oversampling. The other approach is simply copying or slicing and dicing existing data to get new data. This is a handy approach when data are relatively scarce and one needs to increase the sample size to train a more accurate model.


Photo by Jørgen Håland on Unsplash

Aggregation is the final step that can be done to prepare your data for model ingestion. There may be cases where you have a categorical variable that you think would have strong predictive power in the model. The issue might arise when you have only a few data points in some of the categories. The result would be that the model cannot generalise well based on a small handful of data points.

‘Other’ Category


In some instances, I have used clustering to get general clusters of addresses. I can then use location clusters as categorical variables. Another approach could be to use country or city as the variable rather than specific addresses.


This is a good approach when trying to make a model that is more robust to over fitting.


We talked about 4 broad transformations a dataset can go through. These transformations ensure that the data comes out on the other side containing some nuggets of gold.

The first was to filter unwanted errors and data. Then we discussed approaches to normalise and standardise the data. Once the data are tidy, we can augment the data using external information, predictions, or synthetic data. Finally, it is worth aggregating and grouping the data where it makes sense.

We dive into detail on how we can prepare data and fit models in our new machine learning micro-credential provided by QRC.

Find more about our Machine Learning Fundamentals micro-credential here.

Originally posted on Medium by Gio at QRC.

This error message is only visible to WordPress admins

No posts found.

Make sure this account has posts available on instagram.com.