Important Factors in Salary Determination

DATA6200, Assignment 1

Author

Jason Ives

Overview

I undertook this analysis to gain insight into which industries typically have the highest salaries, and to better understand what factors drive those salary levels. I looked at age, years of experience, and education level, both within and across industries to try to determine if they relate to compensation levels in a meaningful way.

Data Cleaning

Loading the Data

To begin, I imported some core libraries like those found in the tidyverse, and then read in the data file and added a key column to it.

Later, as I got to know the data better, it became apparent that certain cases did not represent unique and meaningful observations, either in the form of duplicate records or as bad data. I returned to this section to designate those cases for removal from the data set at the earliest possible point.

Code
##UTILITY FUNCTION REFERENCE
#stat(df) - view df details
#sessionInfo() - view loaded packages
#rm(list = ls()) - clear environment
#sum(is.na[n](col)) - count na / nan in an element

##FILES TO INCLUDE IN FINAL PACKAGE
##ask_a_manager.xlsx
##RprtRateXchgCln_20201001_20250930.csv
##list-one.xls

gc()

##ADD LIBRARIES
library(tidyverse)
library(readxl)


##DECLARE CUSTOM FUNCTIONS
##CONVENIENCE FUNCTION FOR VIEWING DISTINCT VALUES
Show_Distinct <- function(df, param) {
  df |> 
    distinct({{param}})
}

##DIDN'T WORK
# Word_Scoring <- function(df_words, df_scores, word_col, top_score_word, top_score, by_cols) {
#   df_words |> 
#     mutate(temp_word_lower = toupper({{word_col}})) |>
#     left_join(df_scores, by = by_cols) |> 
#     (mutate({{word_col}} = case_when({{top_score}} < keyword_count ~ {{word_col}}, .default = {{top_score_word}}))) |> 
#     (mutate({{top_score}} = case_when({{top_score}} < keyword_count ~ keyword_count, .default = {{top_score}}))) |> 
#     
#     select(-temp_word_lower) |> 
#     select(-keyword_count)
# 
#   return(df_words)


##IMPORT DATA
salary_data <- read_excel('ask_a_manager.xlsx')

##ADD A KEY COLUMN TO THE DATA
salary_data <- salary_data |> 
  mutate(key = row_number()) |> 
  relocate(key)

##DROP DUPLICATE ROWS
salary_data <- salary_data |> 
  distinct(across(-c(key, Timestamp)), .keep_all = TRUE)

##DROP BAD DATA ROWS
salary_data <- salary_data |> 
  filter(key != 28022) |>  ##INVALID ENTRY
  filter(key != 28016) |>  ##0 SALARY
  filter(key != 28031) |>  ##ANNOTATED AS A TEST CASE
  filter(key != 28008) |>  ##INVALID ENTRY (LOW SALARY)
  filter(key != 28010) |>  ##INVALID ENTRY (NONSENSE TITLE)
  filter(key != 28037) |>  ##INVALID ENTRY (LOW SALARY, NONSENSE TITLE)
  filter(key != 28050) |>  ##INVALID ENTRY (LOW SALARY, NONSENSE TITLE)
  filter(key != 28056)  ##SALARY OF OVER 6 BILLION

When loading in the data I received an warning related to data type coercion. I reviewed the input excel file and the loaded data, and determined that R had automatically coerced a cell from a text “00” to a numeric 0, which was appropriate. I did not take any additional action regarding this.

Code
##EXAMINE DATA RELATED TO COERCION WARNING
names(salary_data[26564, 6])
salary_data[[26564, 6]]

General Cleaning

During my review of the coercion warning, I noticed that the column names in the input data set contain the full question text. This is useful information, but not appropriate for a data table, so I moved them to a new codebook data frame and renamed the data set columns following tidy data principles.

After renaming the columns, I was able to return to this point later to implement a small number of one-off data edits, and drop the income_context column since it didn’t seem to offer a lot of value to this analysis.

Code
##CREATE DF FOR CODEBOOK LABELS
codebook <-  data.frame('var_name' = "", 'question_text' = names(salary_data))

##RENAME COLUMNS AND APPEND QUESTION LABELS TO CODEBOOK
var_names <- c('key', 'timestamp', 'age', 'industry', 'title', 'title_context', 
                        'salary', 'addtl_compensation', 'currency', 'currency_sp', 
                        'income_context', 'country', 'state', 'city', 'work_experience_overall', 
                        'work_experience_field', 'education', 'gender', 'ethnicity')

names(salary_data) = var_names

codebook$var_name <- var_names


##UPDATE SPECIFIC DATA CELLS USING NEW COL NAMES
salary_data <- salary_data |> 
  mutate(salary = case_when(
    key == 26467 ~ 50000,  ##SALARY OVER 5 MILLION, SINCE ADDTL COMPENSATION STARTS WITH 4, AND LAST 2 DIGITS OF SALARY ARE 4, IT IS LIKELY THIS IS A CASE OF TYPING THE FIRST DIGIT OF ADDTL COMPENSATION AS THE LAST DIGIT OF SALARY (TWICE)
    .default = salary))

##DROP INCOME_CONTEXT
salary_data <- salary_data |>
  select(-income_context)

I also examined the data types within the loaded data, and began converting some of the columns to a more appropriate data type.

Code
##REVIEW DATA TYPES AND STRUCTURES
str(salary_data)

##CONVERT ADDTL_COMPENSATION TO NUMERIC
salary_data <- salary_data |> 
  mutate(addtl_compensation = case_when(
    is.na(addtl_compensation) ~ "0", 
    .default = addtl_compensation)) |> 
  mutate(addtl_compensation = as.numeric(addtl_compensation))

##DEFINE AGE AND WORK_EXP VARIABLES AS FACTORS
age_levels = c("under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or over")
work_experience_levels = c("1 year or less", "2 - 4 years", "5-7 years", "8 - 10 years", "11 - 20 years", "21 - 30 years", "31 - 40 years", "41 years or more")

salary_data <- salary_data |> 
  mutate(age = factor(age, levels = age_levels)) |>
  mutate(work_experience_overall = factor(work_experience_overall, levels = work_experience_levels)) |> 
  mutate(work_experience_field = factor(work_experience_field, levels = work_experience_levels))

Standardizing Industry Values Using Keyword Scoring

With over 1100 unique values, Industry did have some loose categorization to it, but it couldn’t be treated as a factor until it was standardized. I checked the frequencies and found that only 24 values have a frequency of more than 100, and those represent 91% of all industry values. With that in mind I decided on a multi-phased approach to standardizing the industry column.

  1. Replace the industry entries with a single key word that best categorizes the entry as a whole.
    1. Break the industry data into curated sets of single words.
    2. Record word frequencies across the entire data set.
    3. Assign a word frequency score for each word in each industry entry, based on it’s frequency across the whole data set as calculated in step ii.
    4. Assign the word with the highest frequency score for that entry as the industry.
  2. Keep any resulting groupings with a frequency of more than 100. Recode the rest as “Other”.
Code
##INDUSTRY

##BUILD A UNIQUE WORD LIST BY SPLITTING BASED ON SPACES, SYMBOLS, CONJUNCTIONS, ETC
keyword_results <- salary_data |> 
  mutate(keywords = str_replace_all(industry, "[^\\w\\s]", ".")) |> 
  mutate(keywords = str_replace_all(keywords, "[0-9]", ".")) |> 
  mutate(keywords = str_replace_all(keywords, "\\.OR\\.", ".")) |> 
  mutate(keywords = str_replace_all(keywords, "\\.AND\\.", ".")) |>
  mutate(keywords = str_replace_all(keywords, "\\.SO\\.", ".")) |>
  mutate(keywords = str_replace_all(keywords, "\\.+", " ")) |> 
  mutate(keywords = str_squish(str_trim(keywords)))

##RECODE 2 WORD INDUSTRIES SO THEY ARE TREATED AS ONE WORD
keyword_results <- keyword_results |> 
  mutate(keywords = str_replace_all(keywords, regex("health care", ignore_case = TRUE), "HEALTHCARE")) |> 
  mutate(keywords = str_replace_all(keywords, regex("social work", ignore_case = TRUE), "SOCIAL_WORK")) |> 
  mutate(keywords = str_replace_all(keywords, regex("skilled trade", ignore_case = TRUE), "SKILLED_TRADE")) |> 
  mutate(keywords = str_replace_all(keywords, regex("real estate", ignore_case = TRUE), "REAL_ESTATE"))

##COMBINE THE KEYWORDS INTO A SINGLE STRING, THEN SPLIT IT BACK INTO A VECTOR BY SPACE
raw_keywords <- tibble(occurrence = toupper(unlist(strsplit(keyword_results$keywords, split = " "))))

#COUNT UNIQUE KEYWORDS
industry_keywords <- raw_keywords |> 
  count(occurrence) |> 
  rename(unique_keyword = occurrence, keyword_count = n)

##MANUALLY EDIT SCORES ON CERTAIN WORDS TO EMPHASISE OR EXCLUDE THEM
industry_keywords <- industry_keywords |> 
  mutate(keyword_count = case_when(
    unique_keyword == "B" ~ -1, 
    unique_keyword == "R" ~ -1, 
    unique_keyword == "D" ~ -1, 
    unique_keyword == "OR" ~ -1, 
    unique_keyword == "AND" ~ -1, 
    unique_keyword == "SO" ~ -1, 
    unique_keyword == "ASSOCIATION" ~ -1, 
    unique_keyword == "ADMINISTRATION" ~ -1, 
    unique_keyword == "PUBLIC" ~ -1, 
    unique_keyword == "MEDIA" ~ 1000,
    unique_keyword == "LIBRARY" ~ 2000,
    unique_keyword == "BIOSCIENCE" ~ 25,
    .default = keyword_count))

With the keyword frequency scores in hand, I was able to build the keyword scoring tibble that would track the scores for each word in each observation. The largest set of keywords had 24 distinct words, so I created the scoring data frame to store up to 24 words, along with a frequency score for each word. Next, I calculated the scores for each column by using left_join to capture the score for that word from the scoring table. At the same time I assessed whether that word had scored higher than the existing high scoring word. After repeating that process for each of the 24 keyword columns, I was left with a single column containing the highest scoring keyword for each observation.

Code
##SPLIT EACH INDUSTRY ENTRY INTO INDIVIDUAL WORDS IN INDUSTRY_SCORING TABLE
score_matrix = str_split_fixed(keyword_results$keywords, " ", 24)
colnames(score_matrix) <- c('word1', 'word2', 'word3', 'word4', 'word5', 'word6', 
                               'word7', 'word8', 'word9', 'word10', 'word11', 'word12',
                               'word13', 'word14', 'word15', 'word16', 'word17', 'word18',
                               'word19', 'word20', 'word21', 'word22', 'word23', 'word24')

industry_scoring <- as_tibble(score_matrix) |> 
  mutate(across(where(is.character), toupper)) |> 
  mutate(key = keyword_results$key) |> 
  mutate(top_word = word1) |> 
  relocate(key, top_word)

##FIND TOP SCORING WORD FOR EACH INDUSTRY
industry_scoring <- industry_scoring |> 
  mutate(temp_word_lower = toupper(top_word)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |>
  rename(top_count = keyword_count) |>
  relocate(top_count, .before = word1) |>
  
  ##SCAN WORDS 2-24, REPLACING TOP WORD / SCORE IF HIGHER SCORING WORD IS FOUND
  ##WORD2
  mutate(temp_word_lower = toupper(word2)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word2, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD3
  mutate(temp_word_lower = toupper(word3)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word3, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD4
  mutate(temp_word_lower = toupper(word4)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word4, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD5
  mutate(temp_word_lower = toupper(word5)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word5, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD6
  mutate(temp_word_lower = toupper(word6)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word6, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD7
  mutate(temp_word_lower = toupper(word7)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word7, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD8
  mutate(temp_word_lower = toupper(word8)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word8, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD9
  mutate(temp_word_lower = toupper(word9)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word9, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD10
  mutate(temp_word_lower = toupper(word10)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word10, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD11
  mutate(temp_word_lower = toupper(word11)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word11, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD12
  mutate(temp_word_lower = toupper(word12)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word12, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD13
  mutate(temp_word_lower = toupper(word13)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word13, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD14
  mutate(temp_word_lower = toupper(word14)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word14, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD15
  mutate(temp_word_lower = toupper(word15)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word15, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD16
  mutate(temp_word_lower = toupper(word16)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word16, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD17
  mutate(temp_word_lower = toupper(word17)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word17, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
    ##WORD18
  mutate(temp_word_lower = toupper(word18)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word18, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD19
  mutate(temp_word_lower = toupper(word19)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word19, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD20
  mutate(temp_word_lower = toupper(word20)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word20, 
    .default = top_word)) |> 
  mutate(top_count = case_when(
    top_count < keyword_count ~ keyword_count, 
    .default = top_count)) |> 
  select(-keyword_count) |> 
  
  ##WORD21
  mutate(temp_word_lower = toupper(word21)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word21, 
    .default = top_word)) |> 
  
  select(-keyword_count) |> 
  
  ##WORD22
  mutate(temp_word_lower = toupper(word22)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word22, 
    .default = top_word)) |> 
  select(-keyword_count) |> 
  
  ##WORD23
  mutate(temp_word_lower = toupper(word23)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word23, 
    .default = top_word)) |> 
  select(-keyword_count) |> 
  
  ##WORD24
  mutate(temp_word_lower = toupper(word24)) |>
  left_join(industry_keywords, by = c("temp_word_lower" = "unique_keyword")) |> 
  mutate(top_word = case_when(
    top_count < keyword_count ~ word24, 
    .default = top_word)) |> 
  select(-keyword_count) |> 
  
  ##TEMP COLUMN REMOVAL
  select(-temp_word_lower)

Finally, I manually tuned the top scoring word results to further consolidate and clarify the industries. In the end I was able to join a new factor-type industry column to the salary data table containing 25 unique industries.

Code
##CODING EQUIVALENT ENTRIES TO A SINGLE WORD
industry_scoring <- industry_scoring |> 
  mutate(top_word = case_when(
    startsWith(top_word, "LIBRARY") ~ "LIBRARIES",
    startsWith(top_word, "EDUC") ~ "EDUCATION",
    startsWith(top_word, "PHARMA") ~ "PHARMACEUTICAL", 
    startsWith(top_word, "VET") ~ "VETERINARY", 
    startsWith(top_word, "RELIG") ~ "RELIGION", 
    startsWith(top_word, "FAITH") ~ "RELIGION", 
    startsWith(top_word, "CHAPLAIN") ~ "RELIGION", 
    startsWith(top_word, "SUMMER") ~ "CHILDCARE", 
    startsWith(top_word, "FUNDING") ~ "FUNDRAISING", 
    startsWith(top_word, "TRADE") ~ "SKILLED_TRADE", 
    startsWith(top_word, "DEFENCE") ~ "DEFENSE", 
    startsWith(top_word, "NON") ~ "NONPROFIT", 
    startsWith(top_word, "CLINICAL") ~ "RESEARCH", 
    .default = top_word))

##REMOVING UNDERSCORES USED FOR JOINING BEFORE SPLIT
industry_scoring <- industry_scoring |> 
  mutate(top_word = str_replace_all(top_word, "_", " "))

##RECODE REMAINING NA TO NO RESPONSE
industry_scoring <- industry_scoring |> 
  mutate(top_word = case_when(
    is.na(top_word) ~ "NO RESPONSE", 
    .default = top_word))

##RECALCULATE SCORES
industry_scoring <- industry_scoring |> 
  add_count(top_word, name = "top_word_count") |> 
  relocate(top_word_count, .before = word1) |>
  select(-top_count)

##RECODE ANY ITEM WITH < .5% OF TOTAL INDUSTRY COUNT AS OTHER
industry_scoring <- industry_scoring |> 
  mutate(top_word = case_when(
    top_word_count < round(nrow(industry_scoring) * .005) ~ "OTHER", 
    .default = top_word))

##UPDATE INDUSTRY COLUMN IN SALARY_DATA WITH NEW VALUES
salary_data <- salary_data |> 
  mutate(industry = industry_scoring$top_word)

##CLEANUP INDUSTRY INTERIM OBJECTS
rm(industry_keywords)
rm(industry_scoring)
rm(keyword_results)
rm(raw_keywords)
rm(score_matrix)

gc()

salary_data$industry <- factor(salary_data$industry)

Sorting Title Data into Bins

Titles are very subjective matter, and their meaning and nomenclature can vary widely across companies, industries, and locations. That subjectivity makes the title variable a less than ideal candidate for a full refactoring like I undertook with the industry variable. Instead, I decided to try to group the titles into 4 bins. I did some research1 to get a better sense of how job titles might be categorized, and then referenced a thesaurus to create word keywords title categorization.

Eventually I defined two separate boolean variables to cross-cut the titles data. One is management / not management, and the other is senior / not senior. This serves to identify senior management [1, 1], management [1, 0], senior staff [0, 1], and all others without any clear indicator [0, 0].

Code
##TITLE

##WORD LISTS FOR CATEGORIZATION
#management_keywords <- c("manage", "manager","management", "supervisor", "director", "chief", "principal", "owner", "president")
#senior_keywords <- c("senior","snr", "ranking", "upper", "lead", "officer", "expert", "sme", "executive", "principal")

##ADD BOOLEAN COLUMNS AND DEFINE
salary_data <- salary_data |> 
  mutate(title_management = case_when(
    str_detect(title, "(?i)manage") ~ TRUE, 
    str_detect(title, "(?i)manager") ~ TRUE,
    str_detect(title, "(?i)management") ~ TRUE,
    str_detect(title, "(?i)supervisor") ~ TRUE,
    str_detect(title, "(?i)director") ~ TRUE,
    str_detect(title, "(?i)chief") ~ TRUE,
    str_detect(title, "(?i)principal") ~ TRUE,
    str_detect(title, "(?i)owner") ~ TRUE,
    str_detect(title, "(?i)president") ~ TRUE,
    .default = FALSE))

salary_data <- salary_data |> 
  mutate(title_senior = case_when(
    str_detect(title, "(?i)senior") ~ TRUE, 
    str_detect(title, "(?i)snr") ~ TRUE,
    str_detect(title, "(?i)ranking") ~ TRUE,
    str_detect(title, "(?i)upper") ~ TRUE,
    str_detect(title, "(?i)lead") ~ TRUE,
    str_detect(title, "(?i)officer") ~ TRUE,
    str_detect(title, "(?i)expert") ~ TRUE,
    str_detect(title, "(?i)sme") ~ TRUE,
    str_detect(title, "(?i)executive") ~ TRUE,
    str_detect(title, "(?i)principal") ~ TRUE,
    .default = FALSE))

##DROP TITLE COLUMNS AND ORGANIZE
salary_data <- salary_data |> 
  select(-title) |> 
  select(-title_context) |> 
  relocate(title_management, title_senior, .after = industry)

Curating Country Using Substring Detection

Country contained a fairly small number of true unique values, but was formatted in a wide variety of ways. To reconcile this I looped through a manual review and update cycle.

  1. Review the data based on frequency, looking for similar data that represent a particular country but are not in the preferred format.
  2. Identify a specific substring that can be used to reconcile several different values that are actually referencing the same country.
  3. Implement the substring based recoding, and begin again at step 1.

As the scope of mis-formatted values narrowed, I began to reconcile single entries. If there was some other reliable indicator in the data (city combined with currency for instance), I would recode it to capture the country, otherwise I would recode it to NA.

Code
##MAKE COUNTRY ALL CAPS
salary_data <- salary_data |> 
  mutate(country = toupper(country))

##RECODE COUNTRY ENTRIES
 salary_data <- salary_data |>
   mutate(country = str_replace_all(country, "\\.", "")) |> 
   mutate(country = case_when(
    str_detect(country, "UNITED S") ~ "UNITED STATES", 
    str_detect(country, "JAPAN") ~ "JAPAN", 
    str_detect(country, "CAN") ~ "CANADA", 
    str_detect(country, "NETHERL") ~ "NETHERLANDS", 
    str_detect(country, "AUSTRAL") ~ "AUSTRALIA", 
    str_detect(country, "XICO") ~ "MEXICO", 
    str_detect(country, "AUSTRI") ~ "AUSTRIA", 
    ##TEMPORARY RECODE OF UKRAINE
    str_detect(country, "UKRAINE") ~ "UZKRAINE", 
    str_detect(country, "UK") ~ "UNITED KINGDOM", 
    ##FIX UKRAINE
    str_detect(country, "UZKRAINE") ~ "UKRAINE", 
    str_detect(country, "ITALY") ~ "ITALY", 
    str_detect(country, "CHINA") ~ "CHINA", 
    str_detect(country, "UNITED K") ~ "UNITED KINGDOM", 
    str_detect(country, "ENGLA") ~ "UNITED KINGDOM", 
    str_detect(country, "SCOTLAND") ~ "UNITED KINGDOM", 
    str_detect(country, "WALES") ~ "UNITED KINGDOM", 
    str_detect(country, "BRITAIN") ~ "UNITED KINGDOM", 
    str_detect(country, "LONDON") ~ "UNITED KINGDOM", 
    str_detect(country, "NORTHERN IRELAND") ~ "UNITED KINGDOM", 
    str_detect(country, "WORLDWIDE") ~ "UNITED STATES", 
    str_detect(country, "BURMA") ~ "MYANMAR", 
    country == "US" ~ "UNITED STATES", 
    country == "US>" ~ "UNITED STATES", 
    country == "ISA" ~ "UNITED STATES", 
    country == "🇺🇺" ~ "UNITED STATES", 
    country == "Y" ~ "UNITED STATES", 
    country == "UNITED Y" ~ "UNITED STATES", 
    country == "IS" ~ "UNITED STATES", 
    country == "UNITEF STATED" ~ "UNITED STATES", 
    country == "THE US" ~ "UNITED STATES", 
    country == "UNITES STATES" ~ "UNITED STATES", 
    country == "U S" ~ "UNITED STATES", 
    country == "USD" ~ "UNITED STATES", 
    country == "UA" ~ "UNITED STATES", 
    country == "VIRGINIA" ~ "UNITED STATES", 
    country == "HARTFORD" ~ "UNITED STATES", 
    country == "CALIFORNIA" ~ "UNITED STATES", 
    country == "UNITER STATEZ" ~ "UNITED STATES", 
    country == "UNIITED STATES" ~ "UNITED STATES", 
    country == "USS" ~ "UNITED STATES", 
    country == "US OF A" ~ "UNITED STATES", 
    country == "UXZ" ~ "UNITED STATES", 
    country == "UNITES KINGDOM" ~ "UNITED KINGDOM", 
    country == "CZECHIA" ~ "CZECH REPUBLIC", 
    country == "EUROPE" ~ "CZECH REPUBLIC", 
    country == "CATALONIA" ~ "SPAIN", 
    country == "NEW ZEALAND AOTEAROA" ~ "NEW ZEALAND", 
    country == "AOTEAROA NEW ZEALAND" ~ "NEW ZEALAND", 
    country == "FROM NEW ZEALAND BUT ON PROJECTS ACROSS APAC" ~ "NEW ZEALAND", 
    country == "NZ" ~ "NEW ZEALAND", 
    country == "CSNADA" ~ "CANADA", 
    country == "LUXEMBURG" ~ "LUXEMBOURG", 
    country == "REMOTE (PHILIPPINES)" ~ "PHILIPPINES", 
    country == "NL" ~ "NETHERLANDS", 
    country == "NEDERLAND" ~ "NETHERLANDS", 
    country == "FROM ROMANIA, BUT FOR AN US BASED COMPANY" ~ "ROMANIA", 
    country == "I WORK FOR A UAE-BASED ORGANIZATIO" ~ "UNITED STATES", 
    country == "SAN FRANCISCO" ~ "UNITED STATES", 
    country == "COMPANY IN GERMANY I WORK FROM PAKISTAN" ~ "PAKISTAN", 
    country == "I WORK FOR AN US BASED COMPANY BUT I'M FROM ARGENTINA" ~ "ARGENTINA", 
    country == "UAE" ~ "UNITED ARAB EMIRATES",
    country == "POLICY" ~ "CANADA",
    country == "HONG KONH" ~ "CHINA",
    country == "PANAMÁ" ~ "PANAMA",
    country == "AFRICA" ~ NA, 
    key == 11070 ~ "INDIA",
    key == 15016 ~ "UNITED STATES", 
    key == 25549 ~ "ARGENTINA", 
    str_detect(country, "USA") ~ "UNITED STATES", 
    str_detect(country, "AMERICA") ~ "UNITED STATES", 
    str_detect(country, "PUERTO RICO") ~ "UNITED STATES", 
    str_detect(country, "HONG KONG") ~ "CHINA", 
    str_detect(country, "WE DON'T GET RAISES, WE GET QUARTERLY BO") ~ NA, 
    str_detect(country, "CONTRACTS") ~ NA, 
    str_detect(country, "GLOBAL") ~ NA, 
    str_detect(country, "CURRENTLY FINANCE") ~ NA, 
    str_detect(country, "DEDUCTED FOR BENEFITS") ~ NA, 
    str_detect(country, "REMOTE FROM WHEREVER I WANT") ~ NA, 
    str_detect(country, "BONUS BASED ON MEETING YEARLY GOALS") ~ NA,
    str_detect(country, "I EARN COMMISSION ON SALES IF I MEET QUOTA") ~ NA, 
    str_detect(country, "INTERNATIONAL") ~ NA, 
    str_detect(country, "I WAS BROUGHT IN ON THIS SALARY TO HELP WITH THE EHR") ~ NA, 
    str_detect(country, "COUNTRY WITHHELD") ~ NA, 
    str_detect(country, "NA") ~ NA, 
    str_detect(country, "STATES") ~ "UNITED STATES", 
   .default = country))

Currency Conversion

Cleaning and Consolidating Currencies Using String Detection

Starting with the currency_sp column, I followed a similar cleaning process to the one used for country, focused on currated substring detection and value replacement. The cleaned currency_sp data was then consolidated with the currency data and further refined, with one key step being the use of the now-clean country data to separate the combined currency of AUD/NZD into two separate currencies.

Referencing Historical Exchange Rate Data to Convert Currencies to CAD

I downloaded exchange rate data2 and ISO 4217 code data3, and, after a bit of cleaning to make the data sets compatible, created a reference dataframe for joining relevant exchange rate data to the analysis data set. The rates were selected based on timestamp date and 3 letter ISO 4217 currency code, to select the appropriate currency conversion rates for the currency in question at the time the data was captured. Conversions were then calculated for the salary and addtl_compensation columns. At this point I also created a new total_compensation_cad column, capturing the salary and additional compensation in a single value.

As a final step, I dropped all observations with an unrealistically low total compensation from the dataset. I set the cutoff value at 23400, which is the Canadian minimum wage at the end of 20214 (15 per hr / 31,200) minus 25%. This decision was based on the original question text for salary, which indicated that data should be entered in a standardized format, representing a 40 hour work week and 52 weeks of work a year.

Code
##CURRENCY CLEANING
##REPLACE SPECIFIC CURRENCY VALUES WITH 3 LETTER CODE
salary_data <- salary_data |> 
  mutate(currency_sp = case_when(
    currency_sp == "Peso Argentino" ~ "ARS", 
    currency_sp == "canadian" ~ "CAD", 
    currency_sp == "Canadian" ~ "CAD", 
    currency_sp == "Indian rupees" ~ "INR", 
    currency_sp == "BRL (R$)" ~ "BRL", 
    currency_sp == "Mexican pesos" ~ "MXN", 
    currency_sp == "PLN (Polish zloty)" ~ "PLN", 
    currency_sp == "czech crowns" ~ "CZK", 
    currency_sp == "Norwegian kroner (NOK)" ~ "NOK", 
    currency_sp == "ILS/NIS" ~ "ILS", 
    currency_sp == "NIS (new Israeli shekel)" ~ "ILS", 
    currency_sp == "US Dollar" ~ "USD", 
    currency_sp == "RMB (chinese yuan)" ~ "RMB", 
    currency_sp == "Taiwanese dollars" ~ "TWD", 
    currency_sp == "Philippine Peso" ~ "PHP", 
    currency_sp == "KRW (Korean Won)" ~ "KRW", 
    currency_sp == "ILS (Shekel)" ~ "ILS", 
    currency_sp == "China RMB" ~ "RMB", 
    currency_sp == "AUD Australian" ~ "AUD", 
    currency_sp == "Polish Złoty" ~ "PLN", 
    currency_sp == "Philippine peso (PHP)" ~ "PHP",
    currency_sp == "Australian Dollars" ~ "AUD", 
    currency_sp == "Argentinian peso (ARS)" ~ "ARS", 
    currency_sp == "Israeli Shekels" ~ "ILS", 
    currency_sp == "PhP (Philippine Peso)" ~ "PHP", 
    currency_sp == "Argentine Peso" ~ "ARS", 
    currency_sp == "Philippine Pesos" ~ "PHP", 
    currency_sp == "INR (Indian Rupee)" ~ "INR", 
    currency_sp == "Danish Kroner" ~ "DKK", 
    currency_sp == "Korean Won" ~ "KRW", 
    currency_sp == "Euro" ~ "EUR", 
    currency_sp == "Mexican Pesos" ~ "MXN", 
    currency_sp == "THAI BAHT" ~ "THB", 
    currency_sp == "Thai Baht" ~ "THB", 
    currency_sp == "croatian kuna" ~ "HRK", 
    currency_sp == "PLN (Zwoty)" ~ "PLN", 
    currency_sp == "Rupees" ~ "INR", 
    currency_sp == "Singapore Dollara" ~ "SGD", 
    currency_sp == "BR$" ~ "BRL", 
    currency_sp == "RM" ~ "MYR", 
    currency_sp == "RS" ~ "INR", 
    currency_sp == "FF" ~ "FRF", 
    .default = currency_sp))

##SET REMAINING currency_sp THAT IS NOT 3 LETTERS, AND ITEMS THAT ARE NOT A CURRENCY, TO NA
salary_data <- salary_data |> 
  mutate(currency_sp = case_when(
    str_length(currency_sp) != 3 ~ NA, 
    currency_sp == "0" ~ NA, 
    currency_sp == "NA" ~ NA, 
    currency_sp == "N/A" ~ NA, 
    currency_sp == "-" ~ NA, 
    .default = currency_sp))

##TOUPPER, REPLACE OTHER IN CURRENCY, RECONCILE REMAINING ISSUES, THEN REMOVE DUPE currency_sp VALUES
salary_data <- salary_data |>
  mutate(currency_sp = toupper(currency_sp)) |> 
  mutate(currency = case_when(
    !is.na(currency_sp) & currency == "Other" ~ currency_sp, 
    .default = currency)) |> 
  mutate(currency = case_when(
    currency_sp == "KWD" & currency == "USD" ~ "KWD", 
    currency_sp == "COP" & currency == "USD" ~ "COP", 
    currency_sp == "AUD" & currency == "AUD/NZD" ~ "AUD", 
    currency_sp == "NZD" & currency == "AUD/NZD" ~ "NZD", 
    currency_sp == "NOK" & currency == "EUR" ~ "NOK", 
    currency_sp == "NOK" & currency == "GBP" ~ "NOK", 
    currency_sp == "DKK" & currency == "USD" ~ "DKK", 
    .default = currency)) |> 
  mutate(currency_sp = case_when( 
    currency_sp == "PKR" & currency == "USD" ~ "USD", 
    .default = currency_sp)) |> 
  mutate(currency_sp = case_when(
    currency_sp == currency ~ NA, 
    .default = currency_sp))

##DROP CURRENCY_SP
salary_data <- salary_data |> 
  select(-currency_sp)

##USE COUNTRY INFO TO SEPARATE AUD FROM NZD
salary_data <- salary_data |> 
  mutate(currency = case_when(
    currency == "AUD/NZD" & country == "AUSTRALIA" ~ "AUD", 
    currency == "AUD/NZD" & country == "NEW ZEALAND" ~ "NZD", 
    .default = currency))

##MANUALLY TRANSFORM OUTSTANDING AUD/NZA AND OTHER VALUES BASED ON LOCATION, AND CHINA TO MATCH RATE CODES USED BELOW
salary_data <- salary_data |>
  mutate(currency = case_when(
    key == 17427 ~ "CAD",
    key == 25511 ~ "USD", 
    key == 25549 ~ "ARS", 
    currency == "RMB" ~ "CNY", 
    currency == "NTD" ~ "TWD", 
    country == "UNITED STATES" & currency == "Other" ~ "USD", 
    country == "MALAYSIA" & currency == "Other" ~ "MYR", 
    country == "INDIA" & currency == "Other" ~ "INR", 
    .default = currency))

##CURRENCY CONVERSION
##VALUES TO CAD BASED ON TIMESTAMP AND CONVERSION RATES AT TIME OF COLLECTION FROM THE TABLES BELOW
##EXCHANGE RATES: https://fiscaldata.treasury.gov/data sets/treasury-reporting-rates-exchange/treasury-reporting-rates-of-exchange
##CONVERSION FROM COUNTRY TO ISO 4217 CODE: https://www.six-group.com/en/products-services/financial-information/market-reference-data/data-standards.html
exchange_rates <- read_csv('RprtRateXchgCln_20201001_20250930.csv', 
                           col_names = c('record_date', 'country', 'country_currency', 'rate', 'effective_date'), 
                           col_types = cols(record_date = col_character(), 
                                            country = col_character(), 
                                            country_currency = col_character(), 
                                            rate = col_character(), 
                                            effective_date = col_character(), 
                                            .default = col_character()))
iso_4217_codes <- read_excel('list-one.xls', col_names = c("temp1", "temp2", "temp3", "temp4", "temp5", "temp6"))

##PREPARE ISO 4217 DATA
iso_colnames <- iso_4217_codes |> 
  slice(4) |> 
  str_replace(" ", "_") |> 
  as.character() |> 
  tolower()

iso_4217_codes <- iso_4217_codes |> 
  rename_with(~ iso_colnames, everything()) |> 
  filter(row_number() > 3) |> 
  filter(is.na(fund)) |> 
  mutate(i_key = str_c(toupper(entity), toupper(str_split_i(currency, " ", -1)), sep = "-")) |> 
  relocate(i_key)

#PREPARE EXCHANGE RATE DATA
exchange_rates <- exchange_rates |> 
  mutate(e_key = str_c(toupper(country_currency))) |> 
  relocate(e_key) |> 
  slice(-1) |> 
  mutate(e_key = case_when(
    country_currency == "Euro Zone-Euro" ~ "EUROPEAN UNION-EURO", 
    country_currency == "United Kingdom-Pound" ~ "UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND (THE)-STERLING", 
    country_currency == "Trinidad & Tobago-Dollar" ~ "TRINIDAD AND TOBAGO-DOLLAR", 
    country_currency == "Czech Republic-Koruna" ~ "CZECHIA-KORUNA", 
    country_currency == "Philippines-Peso" ~ "PHILIPPINES (THE)-PESO", 
    country_currency == "Turkey-New Lira" ~ "TÜRKİYE-LIRA", 
    country_currency == "Israel-Shekel" ~ "ISRAEL-SHEQEL", 
    country_currency == "China-Renminbi" ~ "CHINA-RENMINBI", 
    country_currency == "Taiwan-Dollar" ~ "TAIWAN (PROVINCE OF CHINA)-DOLLAR",
    country_currency == "Korea-Won" ~ "KOREA (THE REPUBLIC OF)-WON", 
    .default = e_key)) |> 
  left_join(select(iso_4217_codes, i_key, alphabetic_code), by = c("e_key" = "i_key")) |> 
  mutate(alphabetic_code = case_when(
    e_key == "CROATIA-KUNA" ~ "HRK", 
    .default = alphabetic_code))

##CALCULATE EXCHANGE RATE BASED ON CLOSEST HISTORICAL RATE
salary_data <- salary_data |> 
  mutate(exchange_rate_date = as.character(ymd(str_c(as.character(year(timestamp)), case_when(
    (round(month(timestamp) / 3) * 3) == 0 ~ "12", 
    .default = as.character(round(month(timestamp) / 3) * 3)), case_when(
    (round(month(timestamp) / 3) * 3) == 6 ~ "30", 
    (round(month(timestamp) / 3) * 3) == 9 ~ "30", 
    .default = "31"), sep = "-")))) |> 
  mutate(temp_cad = "CAD") |> 
  left_join(distinct(select(exchange_rates, alphabetic_code, effective_date, rate)), by = c("temp_cad" = "alphabetic_code", "exchange_rate_date" = "effective_date")) |> 
  rename(cad_rate = rate) |>
  select(-temp_cad) |> 
  left_join(distinct(select(exchange_rates, alphabetic_code, effective_date, rate)), by = c("currency" = "alphabetic_code", "exchange_rate_date" = "effective_date")) |> 
  rename(currency_rate = rate)

##REMOVE TEMPORARY DFS
rm(iso_4217_codes, iso_colnames, exchange_rates)

##CONVERT RATES IN SALARY_DATA
salary_data <- salary_data |> 
  mutate(salary_cad = round((as.numeric(cad_rate) / as.numeric(currency_rate)) * as.numeric(salary), digits = 2)) |> 
  mutate(addtl_compensation_cad = round((as.numeric(cad_rate) / as.numeric(currency_rate)) * as.numeric(addtl_compensation), digits = 2)) |>
  mutate(total_compensation_cad = salary_cad + addtl_compensation_cad) |> 
  relocate(salary_cad, addtl_compensation_cad, total_compensation_cad, exchange_rate_date, cad_rate, currency_rate, .after = addtl_compensation)

##DROP OBSERVATIONS WITH UNREALISICALLY LOW COMPENSATION.
##IN 2021 CANADA SET MINIMUM WAGE TO $15/HR.  ANUALIZED EQUIVALENT (AS REQUESTED IN ORIGINAL SURVEY DATA) OF THAT IS 31,200
##REDUCING BY 25% TO ALLOW FOR MINIMUM WAGE DIFFERENCES GLOBALLY.  23,400 CUTOFF.
salary_data <- salary_data |> 
  filter(total_compensation_cad > 23400)

Data Cleaning - Wrap-up

Breaking State out Horizontally

In accordance with tidy data principles, I split the “select all that apply” form of the state column out into 11 separate columns, one for each state selected in the longest state entry.

Defining a Few Format Rules for Consolidation of City Values

City had a broad range of values, so I implemented a few key formatting rules to clean and consolidate the values, while preserving the diversity of the data.

  1. Removed any non-city location references
  2. Recoded any cities with less than 5 occurrences to Other
  3. Standardized obvious non-standard values for the remaining cities

Defining Education and Gender as Factors

This was a straightforward re-typing, except for a single value in gender that required a manual recode first.

Creating Boolean Columns for Each Category in Ethnicity

Since there are a smaller number of “select all that apply” ethnicity options, I took a different approach from state and created separate Boolean questions for each category. I assigned TRUE or FALSE to each cell based on what had been selected in each particular observation. It is worth noting that the “Prefer not to reply” option was bundled with the “Another option not listed” option during data collection, so I won’t be able to distinguish between those two categories.

Code
##STATE
salary_data <- salary_data |> 
  mutate(state = toupper(state)) |> 
  mutate(state_1 = str_split_i(state, ", ", 1)) |> 
  mutate(state_2 = str_split_i(state, ", ", 2)) |> 
  mutate(state_3 = str_split_i(state, ", ", 3)) |> 
  mutate(state_4 = str_split_i(state, ", ", 4)) |> 
  mutate(state_5 = str_split_i(state, ", ", 5)) |> 
  mutate(state_6 = str_split_i(state, ", ", 6)) |> 
  mutate(state_7 = str_split_i(state, ", ", 7)) |> 
  mutate(state_8 = str_split_i(state, ", ", 8)) |> 
  mutate(state_9 = str_split_i(state, ", ", 9)) |> 
  mutate(state_10 = str_split_i(state, ", ", 10)) |> 
  mutate(state_11 = str_split_i(state, ", ", 11)) |> 
  relocate(state_1, state_2, state_3, state_4, state_5, state_6, state_7, state_8, state_9, state_10, state_11, .after = state) |>
  select(-state)

##CITY
salary_data <- salary_data |> 
  mutate(city = toupper(city)) |> 
  mutate(city = str_remove(city, "\\,\\s*.*?$")) |> 
  mutate(city = str_remove(city, "\\(\\s*.*?$")) |> 
  mutate(city = str_remove(city, "GREATER")) |> 
  mutate(city = str_remove(city, "METRO")) |> 
  mutate(city = str_remove(city, "AREA")) |> 
  mutate(city = str_remove(city, "SUBURBAN")) |> 
  mutate(city = str_remove(city, "SUBURBS")) |> 
  mutate(city = str_remove(city, "SUBURB")) |> 
  mutate(city = str_replace_all(city, "ST\\s", "ST. ")) |> 
  mutate(city = str_replace_all(city, "SAINT\\s", "ST. ")) |> 
  mutate(city = str_trim(city))

city_count <- salary_data |> 
  count(city) |> 
  rename(city_num = n)

salary_data <- salary_data |> 
  left_join(city_count, by = ("city" = "city")) |> 
  mutate(city = case_when(
    city_num < 5 ~ "OTHER",
    country == city ~ "OTHER",
    city == "HOME" ~ "OTHER",
    city == "WORK FROM HOME" ~ "OTHER",
    city == "PREFER NOT TO SAY" ~ "OTHER",
    city == "PREFER NOT TO ANSWER" ~ "OTHER",
    city == "NONE" ~ "OTHER",
    city == "NA" ~ "OTHER",
    city == "N/A" ~ "OTHER",
    city == "UK" ~ "OTHER",
    city == "ONTARIO" ~ "OTHER",
    city == "SMALL TOWN" ~ "OTHER",
    is.na(city) ~ "OTHER",
    city == "-" ~ "OTHER",
    city == "NY" ~ "NEW YORK",
    city == "NYC" ~ "NEW YORK",
    city == "LA" ~ "LOS ANGELES",
    city == "DC" ~ "WASHINGTON D.C.",
    city == "SF" ~ "SAN FRANCISCO",
    city == "D.C." ~ "WASHINGTON D.C.",
    city == "WASHINGTON DC" ~ "WASHINGTON D.C.",
    city == "DISTRICT OF COLUMBIA" ~ "WASHINGTON D.C.",
    city == "VARIOUS" ~ "OTHER",
    str_detect(city, "REMOTE") ~ "OTHER",
    str_detect(city, "WFH") ~ "OTHER",
    .default = city)) |>
  select(-city_num)

rm(city_count, city)

##EDUCATION
education_levels <- c("High School", "Some college", "College degree", "Master's degree", "PhD", "Professional degree (MD, JD, etc.)") 

salary_data <- salary_data |> 
  mutate(education = factor(education, levels = education_levels))

#GENDER
gender_levels <- c("Man", "Woman", "Non-binary", "Other or prefer not to answer")

salary_data <- salary_data |> 
  mutate(gender = case_when(
    gender == "Prefer not to answer" ~ "Other or prefer not to answer", 
    .default = gender)) |> 
  mutate(gender = factor(gender, levels = gender_levels))

#ETHNICITY
salary_data <- salary_data |> 
  mutate(eth_white = case_when(
    str_detect(ethnicity, "White") ~ TRUE, 
    .default = FALSE)) |> 
  mutate(eth_m_east_n_af = case_when(
    str_detect(ethnicity, "African") ~ TRUE, 
    .default = FALSE)) |> 
  mutate(eth_asian = case_when(
    str_detect(ethnicity, "Asian") ~ TRUE, 
    .default = FALSE)) |> 
  mutate(eth_hispanic = case_when(
    str_detect(ethnicity, "Latino") ~ TRUE, 
    .default = FALSE)) |> 
  mutate(eth_other_ref = case_when(
    str_detect(ethnicity, "prefer") ~ TRUE, 
    .default = FALSE)) |> 
  mutate(eth_native_am_ak = case_when(
    str_detect(ethnicity, "Alaska") ~ TRUE, 
    .default = FALSE))

Defining Education Level as an Average Number of Years In Higher Education

As a final step, I transformed education level into an estimated number of years spent on higher education5

Code
##REFACTORING EDUCATION LEVELS
#education_levels <- c("High School", "Some college", "College degree", "Master's degree", "PhD", "Professional degree (MD, JD, etc.)") 
salary_data <- salary_data |> 
  mutate(higher_education_years = case_when(
    education == "High School" ~ 0, 
    education == "Some college" ~ 2, 
    education == "College degree" ~ 4, 
    education == "Master's degree" ~ 5.5, 
    education == "PhD" ~ 10, 
    education == "Professional degree (MD, JD, etc.)" ~ 10, 
    .default = NA))

Analysis and Visualization

The relationship between employment and compensation is often a singularly powerful driver in the how people live their lives, yet we sometimes have a poor understanding of the forces affecting that relationship. Analyzing some of the factors affecting compensation levels will enable workers to make better informed choices about their employment, and it will allow employers to contextualize their approach to compensation, leading to better employee recruitment and retention.

The search for employment often starts with a decision about where to look for work. Understanding how median compensation levels in a particular industry compare to other industries can help calibrate worker expectations and provide context for long term career prospects.

Code
##CREATE DF FOR USE IN PLOT
compensation_by_industry <- salary_data |> 
  select(industry, salary_cad, addtl_compensation_cad, total_compensation_cad) |> 
  drop_na()

##CALCULATE MEDIAN TOTAL COMPENSATION
total_comp_cad_median = compensation_by_industry |> 
  summarise(median_value = median(total_compensation_cad))

##BOXPLOT BY INDUSTRY INCLUDING AN OVERALL MEDIAN LINE
ggplot(
  data = compensation_by_industry, 
  mapping = aes(
    x = fct_reorder(industry, total_compensation_cad, .fun = median), 
    y = total_compensation_cad)) + 
  geom_boxplot(
    fill = "green4", 
    alpha = .5, 
    outlier.alpha = 1, 
    #outlier.color = "green4", 
    outlier.shape = 1) + 
  coord_flip(ylim = c(30000, 450000)) + ##NOTE - REF COMP AS Y, IND AS X
  scale_y_continuous(breaks = seq(50000, 450000, by = 50000), 
                     label = c("50K", "100K", "150K", "200K", "250K", "300K", "350K", "400K", "450K")) +
  geom_hline(yintercept = total_comp_cad_median$median_value, color = "firebrick2", linetype = 2, linewidth = .8) + 
  # stat_summary(fun = mean, geom = "errorbar", aes(ymax = after_stat(y), ymin = after_stat(y)),
  #              width = .75, color = "blue") + 
  labs(
    x = "",
    y = "Total Compensation (CAD)",
    title = "Where the money is",
    subtitle = "A look at median compensation levels across industries",
    caption = "\nData from the ask a manager data set, collected between April 2021 and September 2024.\nhttps://www.askamanager.org/2024/05/data-from-13000-peoples-real-life-salaries.html\nCompensation values not adjusted for inflation over time."
  ) + 
  theme_minimal() + 
  theme(panel.grid.minor = element_blank(), 
        panel.grid.major.y = element_blank(), 
        plot.caption = element_text(size = 7, hjust = 0))

Technical and expertise-based industries show median compensation levels well above the overall median compensation, while indsutries in social sectors and those requiring a lower level of expertise have median compensation levels significantly below the overall median.

But variance in total compensation levels across industries is only part of the picture. Workers must also consider what factors will contribute to continued growth throughout their career. Is it necessary to get a degree? Is there any benefit to staying in a particular field over time? Will I earn more as I get older? An assessment of how changes in age, education, and experience impact overall compensation might help provide some answers.

Code
##NOTE: I CONSULTED WITH GOOGLE GEMINI ON HOW BEST TO APPROACH CAPTURING AND PLOTTING THE LINEAR REGRESSION COEFFICIENTS AND OTHER LM DATA

##MULTIPLE LINEAR REGRESSION TO ISOLATE THE IMPACT OF EACH FACTOR
salary_age_exp_edu_mlr <- lm(total_compensation_cad ~ age + work_experience_field + education, data = salary_data)

##ISOLATE KEY DATA FROM MLR FOR PLOTTING
mlr_rows <- rownames(summary(salary_age_exp_edu_mlr)$coefficients)
salary_age_exp_edu_coef <- data.frame(summary(salary_age_exp_edu_mlr)$coefficients)
names(salary_age_exp_edu_coef) = c("estimate", "std_error","t", "p")

##CONVERT TO TIBBLE, CLEAN UP, DEFINE FACTOR LEVELS, CALCULATE CONFIDENCE INTERVAL
salary_age_exp_edu_coef <- tibble(salary_age_exp_edu_coef) |> 
  mutate(factor_labels = mlr_rows) |> 
  relocate(factor_labels) |> 
  slice(-1) |> ##DROP INTERCEPT
  mutate(fctr_group = case_when(
    str_detect(factor_labels, "^age") ~ "age", 
    str_detect(factor_labels, "^work") ~ "work_experience", 
    str_detect(factor_labels, "^education") ~ "education", 
    .default = NA)) |> 
  mutate(factor_labels = str_remove_all(factor_labels, str_c(fctr_group, ".*?_field"))) |> 
  mutate(factor_labels = str_remove_all(factor_labels, str_c(fctr_group, ".*?-\\s"))) |> 
  mutate(factor_labels = str_remove_all(factor_labels, fctr_group)) |> 
  mutate(factor_labels = factor(factor_labels, levels = c(age_levels, work_experience_levels, education_levels))) |> 
  relocate(fctr_group) |> 
  mutate(fctr_group = case_when(
    str_detect(fctr_group, "age") ~ "AGE", 
    str_detect(fctr_group, "work_experience") ~ "WORK EXPERIENCE", 
    str_detect(fctr_group, "education") ~ "EDUCATION", 
    .default = NA)) |> 
  mutate(plus_ci = estimate + (std_error * 1.96)) |> 
  mutate(minus_ci = estimate - (std_error * 1.96 )) |> 
  group_by(fctr_group)

rm(mlr_rows, salary_age_exp_edu_mlr)

##PLOT MLR COEFFICIENTS VS AGE / EXP / EDUC, RIBBON IS CONFIDENCE INTERVAL BASED ON MLR S.E. AND P < .05
ggplot(
  data = salary_age_exp_edu_coef,  
  mapping = aes(
    x = factor_labels, 
    y = estimate, 
    group = 1)) + 
  geom_line() + 
  geom_ribbon(aes(ymin = minus_ci, ymax = plus_ci), fill = "pink", alpha = .2) + 
  facet_wrap(~ fctr_group, scales = "free") + 
  labs(
  x = "", 
  y = "Impact on Compensation",
  title = "Making an impact",
  subtitle = "Taken independently , how much do differences in key factors account for compensation levels",
  caption = "\nImpact represents multiple linear regression coefficients. Bands represent 95% confidence intervals.\n\nData from the ask a manager data set, collected between April 2021 and September 2024.\nhttps://www.askamanager.org/2024/05/data-from-13000-peoples-real-life-salaries.html\nCompensation values not adjusted for inflation over time.") + 
  theme_minimal() + 
  theme(panel.grid.minor = element_blank(), 
        axis.text = element_blank(), 
        panel.grid.major = element_blank(), 
        plot.caption = element_text(size = 7, hjust = 0), 
        panel.border = element_rect(color = "gray80", fill = NA, linewidth = 0.5))

By using mutiple linear regression to isolate the impacts of age, education, and work experience on overall compensation, it becomes clear that age does very little to drive compensation levels. Education and work experience however show a strong positive relationship with overall compensation levels. The pink bands represent 95% confidence intervals at each point.

This makes it clear that increases in education and work experience can have a powerful impact on overall compensation, while age seems to have virtually no impact. But a potential worker choosing to go to school would be right to wonder if all education is equal. An advance degree may not be required in all industries, and even when an industry does have higher average education levels, will the compensation scale accordingly? It might be good to examine average education levels in different industries, and have an understanding of what the payoff is for those years of education.

Code
avg_education_by_industry <- salary_data |>
  select(industry, total_compensation_cad, higher_education_years) |> 
  drop_na() |> 
  group_by(industry) |>
  summarise(mean_ed_years = mean(higher_education_years), sd_ed_years = sd(higher_education_years), 
            median_comp_cad = median(total_compensation_cad), median_comp_per_yr_ed = median(total_compensation_cad / higher_education_years), .groups = "drop") |>
  arrange(median_comp_per_yr_ed) |>
  mutate(industry = factor(industry, levels = industry))

ggplot(
  data = avg_education_by_industry, 
  mapping = aes(
    x = industry, 
    y = mean_ed_years, 
    fill = median_comp_per_yr_ed)) + 
  geom_col(
    #alpha = .7
  ) + 
  geom_errorbar(aes(ymin = mean_ed_years - sd_ed_years, ymax = mean_ed_years + sd_ed_years), color = "gray40") + 
  scale_fill_gradient(low = "green4", high = "gold2") + 

  coord_flip() + 
  scale_y_continuous(breaks = c(2, 4, 5.5, 10)) +  
  labs(
  x = "",
  y = "Average Years of Education",
  title = "Time is money...or is it?",
  subtitle = "An examination of compensation as it relates\nto higher education levels across industries",
  caption = "\nData from the ask a manager data set, collected between April 2021 and September 2024.\nhttps://www.askamanager.org/2024/05/data-from-13000-peoples-real-life-salaries.html\nCompensation values not adjusted for inflation over time.", 
  fill = "Median compensation\nper year of higher ed (CAD)"
   ) + 
  theme_minimal() + 
  theme(panel.grid.minor = element_blank(), 
        panel.grid.major.x = element_line(color = "gray80"), 
        panel.grid.major.y = element_blank(), 
        plot.caption = element_text(size = 7, hjust = 0), 
        legend.title = element_text(size = 8))

Compensation does not seem to scale evenly as education increases. A decline in the ratio of median compensation per year of higher education for highly educated workers suggest the impact of education can reach a point of diminishing returns.

While it seem that not all education is the same in terms of how much median compensation will increase per year of education, the overall positive impact of education on compensation levels is is clear. But education was only one of the factors that showed a strong relationship with compensation. The other was experience, and that merits a bit more examination as well. One approach might be to define some key milestones in career progression, and see if compensation is different for workers that have achieved those milestones.

Code
salary_by_title <- salary_data |>
  select(total_compensation_cad, title_management, title_senior) |> 
  drop_na() |> 
  group_by(title_management, title_senior) |>
  summarise(
    q1_compensation_cad = quantile(total_compensation_cad, probs = .25), 
    median_compensation_cad = quantile(total_compensation_cad, probs = .5), 
    q3_compensation_cad = quantile(total_compensation_cad, probs = .75), 
    .groups = "drop")

ggplot(data = salary_by_title, 
       aes(x = title_management, y = title_senior, fill = median_compensation_cad)) + 
  geom_tile() + 
  geom_text(aes(x = title_management + 1, y = title_senior + 1, label = str_c(as.character(round(q1_compensation_cad, -3) %/% 1000), "K - ", as.character(round(q3_compensation_cad, -3)  %/% 1000), "K"))) + 
  scale_fill_gradient(low = "darkseagreen1", high = "green3") + 
  scale_x_discrete(labels = c("NON-MANAGER", "MANAGER")) + 
  scale_y_discrete(labels = c("NON-SENIOR", "SENIOR")) + 
  labs(
  x = "", 
  y = "",
  title = "What's in a name?",
  subtitle = "As workers climb the career ladder, their median compensation climbs as well",
  caption = "\nCompensation ranges represent 25th and 75th percentiles.\n\nData from the ask a manager data set, collected between April 2021 and September 2024.\nhttps://www.askamanager.org/2024/05/data-from-13000-peoples-real-life-salaries.html\nCompensation values not adjusted for inflation over time.", 
  fill = "Median compensation (CAD)") + 
  theme_minimal() + 
  theme(panel.grid.minor = element_blank(), 
        panel.grid.major = element_blank(), 
        plot.caption = element_text(size = 7, hjust = 0))

A keyword analysis of worker titles across industries reveals two equivalent paths to the highest level of pay. Career progression through the seniority tier is virtually identical to progression through the management tier. Workers that can then bridge the two tiers effectively are the most well compensated.

This analysis has generated a better understanding of compensation across industries, uncovered key factors that have a positive impact on overall compensation levels, and examined those factors in more depth. With this insight employees and employers alike now have a set of tools that will help them better understand and make decisions about employment goals, compensation, and career progression.

Code
##NOTE - DURING WRAP-UP OF THE ASSIGNMENT, I FELT LIKE THIS PLOT DIDN'T ADD ENOUGH TO MERIT INCLUSION, SO I DROPPED IT.  IT WORKS, JUST UNCOMMENT AND RELOCATE FIG.CAP TO VIEW

# fig.cap = "A small scale case study of workers in the education industry in Canada provides some insight into how overall compensation can vary by location.  Due to the small sample sizes associated with this case study, it is unlikely that it portrays any relationships of statistical significance."}
# #| message: false
# 
# top_education_salary_by_city_ca<- salary_data |>
#   select(industry, total_compensation_cad, country, city) |> 
#   filter(country == "CANADA") |> 
#   group_by(city) |> 
#   mutate(median_comp_by_city_ca = median(total_compensation_cad)) |> 
#   filter(industry == "EDUCATION") |> 
#   filter(n() > 5) |> 
#   ungroup() |> 
#   drop_na()
# 
# ggplot(data = top_education_salary_by_city_ca, aes(
#   x = fct_reorder(city, total_compensation_cad, .fun = median), 
#   y = total_compensation_cad)) + 
#   geom_boxplot(
#     outlier.shape = NA, 
#     fill = "red3", 
#     alpha = .5) +
#   geom_point(aes(
#     x = city, 
#     y = median_comp_by_city_ca), 
#     shape = 23, fill = "white", size = 3) + 
#   scale_y_continuous(breaks = seq(25000, 150000, by = 25000), 
#                      label = c("25K", "50K", "75K", "100K", "125K", "150K")) + 
#   geom_label(x = 2.25, y = 138000, label = "\U25C7 - Overall median compensation", size = 3.1, label.size = NA) + 
#   labs(
#   x = "",
#   y = "Median total compensation (CAD)",
#   title = "How do Canadian cities* compensate their educators?",
#   subtitle = "A case study comparing median compensation in the education sector to overall median compensation",
#   caption = "\n* - For cities with more than 5 data points\n\nData from the ask a manager data set, collected between April 2021 and September 2024.\nhttps://www.askamanager.org/2024/05/data-from-13000-peoples-real-life-salaries.html\nCompensation values not adjusted for inflation over time.", 
#   fill = "Median compensation\nper year of higher ed (CAD)"
#    ) + 
#   theme_minimal() + 
#   theme(panel.grid.minor = element_blank(), 
#         panel.grid.major.y = element_line(color = "gray85"), 
#         panel.grid.major.x = element_blank(), 
#         plot.caption = element_text(size = 7, hjust = 0), 
#         legend.title = element_text(size = 8), 
#         axis.text.x = element_text(size = 8))
# 

Footnotes

  1. https://ca.indeed.com/career-advice/career-development/job-level↩︎

  2. https://fiscaldata.treasury.gov/data sets/treasury-reporting-rates-exchange/treasury-reporting-rates-of-exchange↩︎

  3. https://www.six-group.com/en/products-services/financial-information/market-reference-data/data-standards.html↩︎

  4. https://www.peninsulacanada.ca/resource-hub/minimum-wage/ontario-minimum-wage-history/↩︎

  5. https://www.canada.ca/en/immigration-refugees-citizenship/services/settle-canada/education/school-types/post-secondary.html↩︎