亲爱的朋友,我的数据框架如下
raw_data <- data.frame("id" = 1:5,
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
如果薪水大于15000患病,则将其标记为离群值;如果支出大于10000,我将其标记为离群值,但问题是如何计算通过特定ID输入的离群值次数。
这里是dplyr
解决方案:
raw_data %>%
mutate(salary_flag =
ifelse(salary > 15000, 1, 0),
expenditure_flag = ifelse(expenditure > 10000, 1, 0)) %>%
group_by(id) %>%
mutate(total_outlier = sum(salary_flag) + sum(expenditure_flag))
您正在标记salary
和expenditure
,然后按id
分组,并为每个salary_flag
计算所有expenditure_flag
的总和和所有id
的总和。
id salary expenditure salary_flag expenditure_flag total_outlier
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10000 10000 0 0 0
2 2 15000 15000 0 1 1
3 3 20000 20000 1 1 2
4 4 40000 30000 1 1 2
5 5 50000 40000 1 1 2
如果您只关注总的离群值,@ MartinGal提供了一个非常不错的选择:
raw_data %>%
group_by(id) %>%
mutate(total_outlier = sum(salary>15000, expenditure>10000))
给我们:
id salary expenditure total_outlier
<int> <dbl> <dbl> <int>
1 1 10000 10000 0
2 2 15000 15000 1
3 3 20000 20000 2
4 4 40000 30000 2
5 5 50000 40000 2
编辑:
这似乎可以得到您想要的最终结果:
raw_data %>%
group_by(id) %>%
summarise(count = sum(salary>15000, expenditure>10000),
value = min(salary)) %>%
mutate(title = "salary") %>%
select(id, title, value, count)
哪个给您:
id title value count
<int> <chr> <dbl> <int>
1 1 salary 10000 0
2 2 salary 15000 1
3 3 salary 20000 2
4 4 salary 40000 2
5 5 salary 50000 2
在data.table
中,看起来像
raw_data[, flag0 := (salary > 15000) + (expenditure > 10000)]
raw_data[, flag := sum(flag0), by = "id"]
这里flag0
是逐行标记(以后可以根据需要将其删除),并且flag
将是最终结果。
编辑:看到您对@Matt的答复,您似乎希望分别按薪金和支出总额。您可以执行类似的操作
raw_data[, flag_salary := as.integer(salary > 15000)]
raw_data[, flag_expenditure := as.integer(expenditure > 10000)]
raw_data[, flag_salary := sum(flag_salary), by = "id"]
raw_data[, flag_expenditure := sum(flag_expenditure), by = "id"]
您可以尝试以下操作
raw_data <- data.frame("id" = 1:5,
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
raw_data$SaleryOutlier <- ifelse(
raw_data$salary > 15000, TRUE, FALSE)
raw_data$ExpenditureOutlier <- ifelse(
raw_data$expenditure > 10000, TRUE, FALSE)
然后您可以使用aggregate
功能汇总数据,例如通过使用FUN=sum
为每个ID。看起来应该像
aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)
这是有效的,因为TRUE=1
。
我希望这会有所帮助。
编辑
根据您的评论,我想您正在寻找
raw_data <- data.frame("id" = c(1, 1, 1, 2, 2),
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
raw_data$SaleryOutlier <- ifelse(
raw_data$salary > 15000, TRUE, FALSE)
raw_data$ExpenditureOutlier <- ifelse(
raw_data$expenditure > 10000, TRUE, FALSE)
raw_data_aggregate <- aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)
raw_data_aggregate$count <- raw_data_aggregate$SaleryOutlier + raw_data_aggregate$ExpenditureOutlier