发现异常值和计数发生次数

问题描述 投票:1回答:3

亲爱的朋友,我的数据框架如下

raw_data <- data.frame("id" = 1:5, 
                       "salary" = c(10000,15000,20000,40000,50000), 
                       "expenditure" = c(10000,15000,20000,30000,40000))

如果薪水大于15000患病,则将其标记为离群值;如果支出大于10000,我将其标记为离群值,但问题是如何计算通过特定ID输入的离群值次数。

r
3个回答
1
投票

这里是dplyr解决方案:

raw_data %>% 
  mutate(salary_flag =
           ifelse(salary > 15000, 1, 0),
         expenditure_flag = ifelse(expenditure > 10000, 1, 0)) %>% 
  group_by(id) %>% 
  mutate(total_outlier = sum(salary_flag) + sum(expenditure_flag))

您正在标记salaryexpenditure,然后按id分组,并为每个salary_flag计算所有expenditure_flag的总和和所有id的总和。

 id salary expenditure salary_flag expenditure_flag total_outlier
  <int>  <dbl>       <dbl>       <dbl>            <dbl>         <dbl>
1     1  10000       10000           0                0             0
2     2  15000       15000           0                1             1
3     3  20000       20000           1                1             2
4     4  40000       30000           1                1             2
5     5  50000       40000           1                1             2

如果您只关注总的离群值,@ MartinGal提供了一个非常不错的选择:

raw_data %>% 
group_by(id) %>% 
mutate(total_outlier = sum(salary>15000, expenditure>10000))

给我们:

     id salary expenditure total_outlier
  <int>  <dbl>       <dbl>         <int>
1     1  10000       10000             0
2     2  15000       15000             1
3     3  20000       20000             2
4     4  40000       30000             2
5     5  50000       40000             2

编辑:

这似乎可以得到您想要的最终结果:

raw_data %>% 
  group_by(id) %>% 
  summarise(count = sum(salary>15000, expenditure>10000),
            value = min(salary)) %>% 
  mutate(title = "salary") %>% 
  select(id, title, value, count)

哪个给您:

     id title  value count
  <int> <chr>  <dbl> <int>
1     1 salary 10000     0
2     2 salary 15000     1
3     3 salary 20000     2
4     4 salary 40000     2
5     5 salary 50000     2

0
投票

data.table中,看起来像

raw_data[, flag0 := (salary > 15000) + (expenditure > 10000)]
raw_data[, flag := sum(flag0), by = "id"]

这里flag0是逐行标记(以后可以根据需要将其删除),并且flag将是最终结果。

编辑:看到您对@Matt的答复,您似乎希望分别按薪金和支出总额。您可以执行类似的操作

raw_data[, flag_salary := as.integer(salary > 15000)]
raw_data[, flag_expenditure := as.integer(expenditure > 10000)]
raw_data[, flag_salary := sum(flag_salary), by = "id"]
raw_data[, flag_expenditure := sum(flag_expenditure), by = "id"]

0
投票

您可以尝试以下操作

raw_data <- data.frame("id" = 1:5, 
                       "salary" = c(10000,15000,20000,40000,50000), 
                       "expenditure" = c(10000,15000,20000,30000,40000))

raw_data$SaleryOutlier <- ifelse(
    raw_data$salary > 15000, TRUE, FALSE)

raw_data$ExpenditureOutlier <- ifelse(
    raw_data$expenditure > 10000, TRUE, FALSE)

然后您可以使用aggregate功能汇总数据,例如通过使用FUN=sum为每个ID。看起来应该像

aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)

这是有效的,因为TRUE=1

我希望这会有所帮助。

编辑

根据您的评论,我想您正在寻找

raw_data <- data.frame("id" = c(1, 1, 1, 2, 2), 
                       "salary" = c(10000,15000,20000,40000,50000), 
                       "expenditure" = c(10000,15000,20000,30000,40000))

raw_data$SaleryOutlier <- ifelse(
  raw_data$salary > 15000, TRUE, FALSE)

raw_data$ExpenditureOutlier <- ifelse(
  raw_data$expenditure > 10000, TRUE, FALSE)

raw_data_aggregate <- aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)

raw_data_aggregate$count <- raw_data_aggregate$SaleryOutlier + raw_data_aggregate$ExpenditureOutlier
© www.soinside.com 2019 - 2024. All rights reserved.