ML 匹配同一列内的字符串数据 - R

问题描述 投票:0回答:1

我有一个个人工作数据集以及某些职业的薪资信息,并且我正在尝试创建一个通过模糊匹配标准化工作名称的子集。具体来说,月薪为 4000 美元的“成本会计师”职位和月薪为 5000 美元的“财务会计师”将在名为“会计师”的新列下匹配,该列计算具有相似名称的职位的平均值。

这是迄今为止我的代码: #上传包

library(stringr)
library(dplyr)
# Print data example with specific columns
dput(job_posts[1:20,c(4,27)])

输出:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless", "Accountant", 
"Calls Admin", "Financial Accountant", "Sales Representative", 
"Procurement Assistant", "Water Quality Analyst", "Resident Engineer", 
"Cost Accountant", "Product Specilaist-2", "Operations Coordinator"
), monthly_income = c(NA, 8500, NA, 20000, 15000, NA, 3500, NA, 
NA, 4000, NA, 500, NA, 5000, NA, 8500, 20000, 9000, 4100, 4500)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

我已按照此处的说明进行操作,这给了我一个良好的开端,因为它标记了已匹配的其他行/观察结果,但我无法按照我在前面的示例中解释的那样标准化职位名称。

# fuzzy matching for job titles, so that similar jobs are stored in one df
job_posts$matched <- sapply(job_posts$jobtitle,agrep,job_posts$jobtitle)
# Print data example with specific columns
dput(job_posts[1:10,c(4,27,28)])

输出:

structure(list(jobtitle = c("PE Teacher", "Accountant", 
"Dewatering Supervisor", "sales account manager", "Sales Lead", 
"Assistant Housekeeping Manager", "Quality Manager", "Approval Officer", 
"Logistics", "Systems Engineer - Networking/Wireless"), monthly_income = c(NA, 
8500, NA, 20000, 15000, NA, NA, NA, NA, NA), matched = list(`PE Teacher` = c(1L, 
1111L), `Accountant` = 2L, 
    `Dewatering Supervisor` = 3L, `sales account manager` = c(4L, 
    1242L, 1309L, 1524L, 1783L), `Sales Lead` = c(5L, 1984L), 
    `Assistant Housekeeping Manager` = 6L, `Quality Manager` = c(7L, 
    196L, 650L, 1856L, 2330L), `Approval Officer` = 8L, Logistics = c(9L, 
    71L, 129L, 176L, 362L, 444L, 446L, 587L, 655L, 935L, 1413L, 
    1508L, 1835L, 2176L, 2300L, 2370L, 2657L, 2685L, 2770L), 
    `Systems Engineer - Networking/Wireless` = 10L)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

当前的 df 如下所示:

jobtitle                 avg_wage
Financial Accountant     $5000   
Cost Accountant          $4000
Retail Accountant        $4000

期望的结果如下,平均工资基于所有会计工资的平均值,而不是“成本会计师”或“财务会计师”,所有会计工作都类似于“会计师”

jobtitle       avg_wage
Accountant     $4333  
dataframe machine-learning dplyr pattern-matching match
1个回答
0
投票

我想这就是你想要的?虽然我不完全确定:

library(tidyverse)

# the same as the smallest example dataframe you gave, with an extra irrelevant row for demonstration
data <- data.frame(
  jobtitle = c("Financial Accountant", "Cost Accountant", "Retail Accountant", "Instagram Influencer"),
  avg_wage = c("$5000", "$4000", "$4000", "$1000")
)

# same with this
job_groups <- c("Accountant", "Butcher", "Baker", "Candlestick Maker")

# basically what's happening here is we're looking for the job group in each job title, removing NA values, then if there's no job group in the title, we're returning NA, else returning the job title(s)
mutate(data, grp = map_chr(jobtitle, ~ str_extract(.x, job_groups) %>% {.[!is.na(.)]} %>% if (length(.) == 0) NA_character_ else .))

输出:

              jobtitle avg_wage        grp
1 Financial Accountant    $5000 Accountant
2      Cost Accountant    $4000 Accountant
3    Retail Accountant    $4000 Accountant
4 Instagram Influencer    $1000       <NA>
© www.soinside.com 2019 - 2024. All rights reserved.