
Online, I hear a lot of people recommending Kaggle as the best starting place for learning machine learning (ML). To be contentious, I believe that recommendation might not be accurate. I am not saying it is a bad place to learn or that it is not helpful. What I am saying, however, is that it might not be the best.
In the real world, it is very uncommon to be handed a perfectly curated, clean dataset ready for training a model like on Kaggle. Realistically, you get asked vague questions like, “Who makes most of the complaints?”, or, “How can we make our chatbot smarter?” (these are real questions I have been asked in the past).
So, unlike Kaggle where you download your dataset, you need to work out what is required. And then you need to work out how to solve the issue before you even get to the data let alone train a model!
For today, I am going to assume we have done those first steps of understanding the problem, planning some potential solutions, and collecting our data. Now, the next step is to understand, clean and prepare the dataset for a model to ingest it. And that is what I want to talk about.
Broad Data Cleaning and Preparation Procedures

So, most of the common techniques used in data cleaning and preparation fall into one of four buckets.
The first broad procedure is (1) filtering. In this stage, one filters data points like duplicates or errors. Sometimes, there are outliers than can also throw the model out of wack, which could be removed. Sometimes there even is irrelevant data that can just be cut out.
Next, you can (2) normalise the data so that there is uniformity in the dataset. For example, your model might be trained on data collected from various locations. So, you would have some numbers in the metric numbering system and others not. Or maybe address information is written down slightly differently. So normalising would make sure that all the data are comparable.
Thirdly, there are some data (3) augmentation methods that can be applied to add information back into the dataset. Some people choose to impute missing data points based on a number of different routines. One could add information from other sources to enrich the data’s predicting power.
And finally, you can (4) aggregate the data in a number of ways. For example, maybe there are some underrepresented categories in the data. This could lead to poor estimates because there isn’t a large enough sample size to train a model that can predict accurately for these categories. So, we can group these categories with others. Addresses could be grouped into areas, or ages into age brackets, for example.
So, let’s go into a bit more detail.
Filtering

Filtering is one of the first steps in cleaning and preparing data for a machine learning model. Sometimes there is unwanted data in your dataset, errors, or even outliers. The key here is to remove anything unwanted.
Duplicates
Duplicates can be troublesome for a model because it may cause the model to be biased toward that data point as it is over represented in the dataset. Much like if your vote was twice as important as your mates when choosing somewhere to eat.
There may be situations where there are no exact duplicates. For example, if a user enters their details into a system twice, each on a different day. The result could be that John Doe becomes Doe John in the second time ‘round.
Before a dataset is ready to be used, it should go through a basic deduplication stage first.
Errors
Sometimes there are errors in the data. Sensor readings could be inconsistent and drop out at times, or an entry may have been entered though it was not complete. Maybe there are missing data points! There are so many different types of errors and myriads of ways they can arise.
If you can spot the errors, it makes sense to remove them from the dataset. If you leave them in the dataset, it could cause the model to fail in training due to NaN values, or it could bias the predictions.
Outliers
I’ve had to deal with data that humans entered manually, and I think every data scientist will have to deal with that kind of data at least once in their career. The result could be human error causing strange outliers. For example, someone may have entered ‘100’ as ‘10’ by simply missing the ‘0’ key once. The difference is a factor of 10x! Whether the outlier comes from sensors, humans, or otherwise, it can really throw a machine learning model. Dropping outliers is super important.
Normalising

So, you’re dealing with imperfect data, and you find that there are quite a few inconsistencies in the data. Maybe it isn’t wrong, and you want to keep the data, but it definitely isn’t uniform. This is where normalising comes in handy!
Typos
Mr Stevens has the same age, first name, and the same height as Mr Stephens. Everything is pointing at the fact that it most likely is the same person though he represents two separate entries in the database. Simply typos can create entirely new entries causing the data to be biased again!
Maybe someone labelled themselves as ‘mail’ rather than ‘male’. This would cause the model to not understand how many categories there are. Some simple data normalisation can greatly simplify the number of categories in the data and improve prediction quality.
Naming Conventions
Imagine you’re looking at accounting/financial information that has expenditure categories. Someone might label their claim as ‘meal’, and another might put ‘dinner’, and yet another, ‘dining out’. They practically say the same thing. Therefore, it makes sense for these costs to be grouped together though they have a slightly different naming convention.
When formatting data for a machine learning model to ingest, even ‘Meals’ vs ‘meals’ would create two categories. This splits the sample size for both categories. Ultimately, it would work to the detriment of your model, so it is best to normalise the data where possible.
Addresses
Another case might be with user-defined addresses. Imagine a customer has an address with the city ‘New York’ and another with ‘New York, NY’. Clearly, these are likely the same place, yet the strings are different. Over hundreds or thousands of data points, you may get an extremely large number of locations (or any category, for that matter). As a result, the sample size for the categories becomes smaller, and there may be too many items. In this case, you can normalise by geocoding locations so that there is a standard location name and point on the globe for all the addresses.
Metric vs Imperial Numbering
When working with international data, it could be the case that you find some data points are using imperial and others the metric system. Imagine an inventory dataset where measurements are in centimetres but some items are in inches. That pan either is small or large, but there is no way to know unless the units are standardised.
Scaling
This is a big topic, but generally speaking, it usually helps a machine learning model to converge more efficiently if all the data points are in the same range. Another reason for scaling the variables in the dataset is to ensure that no one of them is over weighted in the model. For example, when predicting my car’s value, I used the number of kms on the odometer as well as of the car. The first is in the tens or hundreds of thousands, while the other is in the tens, so the kms traveled would outweigh the age by a factor of roughly 10,000x. So, I scaled the kms by 10,000 so that the variables are in the rough same ballpark. This gave me much more consistent and meaningful coefficients after training the model.
Augmenting

Now, I want to distinguish augmenting a dataset from feature engineering. Say you have a dataset of customer reviews. Feature engineering may derive new features from the data. For example, review length, sentiment, or the number of uppercase letters to the number of letters ratio.
Data augmentation, on the other hand, is simply creating more data without creating whole new features. In the reviews example, you could pro-grammatically replace words with synonyms to get new data points to train a model. You could also paraphrase sentences, reorder prepositional phrases, and more! With image data, you can flip, rotate, or stretch them. There are “lots of ways to skin the cat”, as they say.
Imputing
Sometimes there are missing data points for any number of reasons. For example, a sensor may have failed to send a signal for a few moments, or someone forgot to enter a value. Imputing means filling in the gaps. There are all kinds of ways to do this, and it’s a bigger topic than we have time for here.
Roughly speaking, sometimes it might make sense to fill the gaps with the average value of all the other data points. Maybe you can find the top N nearest rows and take the average of those. This step is often debated because sometimes missing data contains information in itself. Say, someone does not want to answer a particular question on a questionnaire. This might indicate something about the person. In this case, it might make sense to create another feature to label the entry as missing a data point to see if there is any predicting power there.
Oversampling
Another case of augmenting a dataset is to over sample. The use case for oversampling is to increase the number of data points for underrepresented categories. This may help the model to predict more accurately when these categories are entered as inputs. Adding data points to a category can be done in a number of ways.
One approach is to use synthetic resampling. It creates new entries using some kind of function to generate new, synthetic data points similar to the groups we are oversampling. The other approach is simply copying or slicing and dicing existing data to get new data. This is a handy approach when data are relatively scarce and one needs to increase the sample size to train a more accurate model.
Aggregating

Aggregation is the final step that can be done to prepare your data for model ingestion. There may be cases where you have a categorical variable that you think would have strong predictive power in the model. The issue might arise when you have only a few data points in some of the categories. The result would be that the model cannot generalise well based on a small handful of data points.
‘Other’ Category
To solve this issue, a common solution is to create an ‘other’ (or equivalent) category. For example, imagine that you want to predict the value of a car and colour was one of the variables. However, in the sample you have to train the model, there are only a few gold or green cars. There were lots of white or black cars, but some colours were rare. So, what you could do is create a model that predicts the value if the colour is different from one of the main categories. Namely, create an ‘Other’ category and group all the odds and ends into it.
Locations
Perhaps you’ve got data that contains addresses. You normalise them, but it turns out that the address information is too granular. This means the model cannot generalise about cities or neighbourhoods. So, what you can do is group data points to a general area.
In some instances, I have used clustering to get general clusters of addresses. I can then use location clusters as categorical variables. Another approach could be to use country or city as the variable rather than specific addresses.
Discretising
Another situation might be that your data is a bit sparse such that you’ve only got one person in the dataset that is age 94. Maybe these extreme cases can be aggregated into their own category, or maybe age could be gathered into age brackets. Effectively, a numerical value gets converted into a categorical or less granular numerical one.
This is a good approach when trying to make a model that is more robust to over fitting.
Summary
Making sure to have clean and useful data is a fundamental and critical part of deriving insight and value from data. Great predictions and decisions come much more easily when the data are in a usable state.
We talked about 4 broad transformations a dataset can go through. These transformations ensure that the data comes out on the other side containing some nuggets of gold.
The first was to filter unwanted errors and data. Then we discussed approaches to normalise and standardise the data. Once the data are tidy, we can augment the data using external information, predictions, or synthetic data. Finally, it is worth aggregating and grouping the data where it makes sense.
We dive into detail on how we can prepare data and fit models in our new machine learning micro-credential provided by QRC.
Find more about our Machine Learning Fundamentals micro-credential here.
Originally posted on Medium by Gio at QRC.