Every time I applied to jobs, inevitably, I got asked the awkward question about salary. “What salary are you expecting?”, they ask. And I am not going to lie. A few years ago, I was not too sure how to answer. Now, however, the answer is simple. It’s your market value!
What is your market value? Simply put, it is the going rate employers are willing to pay for people with your skills, experience, and education. Of course, there are other variables that come into the equation, but those are often the first things employers look at. The first impression, if you will.
So, when asking for a given salary, there should be a sense of what alternative employers are paying. That way, the market is competitive and no one is getting conned into paying or receiving less than the fair market rate.
What does this all have to do with machine learning? Well, we can get this estimate for our market value simply by observing the data. And this is what I did! I took thousands of current listings and trained a model to predict the fair salary for an input CV. So how did I do it?
Collecting the Data
Based in New Zealand, I collected data from the top job listing site in NZ, seek.co.nz. At the time of writing, there were over 34,500 listings in New Zealand!
I was able to gather a sample of relevant listings to me along with their salary range.
The salary range is required as input when an employer lists the job, so it could be seen as the salary they are willing to offer for their job. The middle point of that range is going to be the labels for training the predictive model.
I chose to narrow down the sample to jobs that may have a category that is relevant to me (For example, I know I have no tradie skills, unfortunately!). I was left with a sample of approximately 4000 job listings.
Once I had the data, a big part of the job was done. Now, it was a matter of finding patterns in the data. In theory, there should be predictors that can explain the salary offer. For example, if the job requires 10 years of experience, it stands to reason that it offers a higher salary than a job that requires just 1.
So, years of experience required or seniority should be a predictor. Unfortunately, there is no field for ‘required years of experience’; so, I would need to somehow determine this just from the text in the job title or description.
This variable was contentious for me because I found that you could describe seniority in terms of the salary range. If the job’s salary range is $150k-$200k+, then chances are that the role is an executive-level job, for example. The issue with using this approach is that it is basically using the answers to create the predictor. Now, I decided to allow this ‘encoding’ of the salary range information into a categorical ‘seniority’ label because the model will take input from the user about their seniority to predict the market value. I should be able to say whether I am a junior, intermediate, or senior, which will flow into the prediction.
I tried modelling the market value without this variable, and it made sense that it would perform worse than if you provide it with your seniority.
Another likely estimator would be the job type. Some jobs are higher in demand than others. Say, you have tech skills, the salary offers are much higher than for administrative skills. The dynamic arises from a simple supply and demand in the labour market, so there should be a variable that accounts for job type. Fortunately, seek has job categories, so this is an easy variable to add to the model.
Below, I have calculated the average salary offer for several top categories relevant to data analytics. Information and communication technology jobs were, on average, higher than others.
Job Requirements and Desired Skills
The variables to capture job categories, locations, and seniority were not enough to capture the nuance for specific jobs. To do this, I trained a named entity recogniser (NER) model to extract job skills and requirements from the job listing. For example, from the string:
“Strong experience with Azure Databricks, R and Python. Proven experience in writing algorithms.”
I get the following: [‘experience with Azure Databricks’, ‘R and Python’, ‘writing algorithms’].
Do that over a whole job listing, and you cut down the words to the most critical bits of information.
With these skill descriptions, I could create a ‘job description text embedding’ using a pre-trained language model. I could then convert the text into machine-readable vectors.
In essence, I convert text into vectors like so:
To distil the information contained in the text vectors, I ran the skills vectors through a dimension reduction algorithm to reduce noise in the data and the number of dimensions for later training of the model. In two dimensions, the listings show some ‘groupings’ of similar jobs.
Choosing the Features
Once I had all the features I thought would be relevant, I then needed to cull the irrelevant ones and keep the best predictors. To do this, I used a beautiful tool called Featurewiz. This library automates the feature selection pipeline firstly by pairing highly correlated features and choosing the one with the most explaining power. Then, it recursively fits an XGBoost model to find the most important features.
Fitting a Model
Finally, with the data and the best subset of features, I was ready to fit a model to make predictions based on my own skills and experience.
To start off with, it is wise to train up a simple model like multiple linear regression to gain a better understanding of the data and create a baseline performance. Right off the bat, it was clear that seniority is a significant predictor (unsurprisingly). Job categories are good predictors too.
With all the features included, the model performed relatively well. Out-of-sample predictions were usually within +-$10,000k of the actual mid-point salary offer (so within the average offer range of about $20,000). I have plotted the final model predicted vs. actual test data below:
I chose to fit over 10 different regression models to see which one had the best cross-validated performance. I found that LightGBM worked particularly well, so I chose to keep it.
The model already performed quite well, but I wanted to tune it a little further to make sure that I had all of the best model parameters possible.
So I split the data into training, validation, and test samples. And then used an awesome library called Optuna, which is an optimisation engine. I used it to automatically search for the optimal set of parameters that would maximise the performance of the model on the validation sample.
Once complete, my model was ready to make some predictions!
It is important to test the model on data that the model has never seen before. Hence why we trained on one sample, did some validating and tuning on another sample, and finally tested on a completely different sample. I’ve plotted predicted salaries against the real mid-point salary offer for all listings in the test dataset below:
Interestingly, there are some salary clumps around $65k, $100k, $130k, $160k and $210k+. I postulate that employers gravitate toward round numbers and compete at those price points.
Now, the model is complete! The performance is also accurate enough to make predictions within a range. So, it’s just a matter of inputting my own info through the model to generate a prediction.
Because one of the strongest predictors was the level of seniority, the predicted salary goes up in jumps with each level (as you can see in the plot above). The model highlights that adding skillsets may not significantly add to your market value compared to becoming a senior from a junior.
Now, becoming a senior or an executive comes with time AND skills, so it would be interesting to rerun the model without the seniority variables. This way, you would be able to see what one’s market value is without specifying the seniority.
Today, I showed another practical use of machine learning — Estimating one’s own market value. This is a great starting point for initial salary negotiations or negotiating raises.
Like any machine learning project, I collected relevant, high-quality data. From there, I picked and created the best predictors based on intuition. I reduced the set of predictors to the best ones, trained a model, and tuned the best ones. From there, it was a matter of making relevant predictions.
At Queenstown Resort College (QRC), we are launching our new micro-credential in machine learning in beautiful New Zealand. Come join us and learn the skills you need for your next career steps!
Find more about our Machine Learning Fundamentals micro-credential here.