我有两个数据框。 dataframe_A包含用户和给定值:
dfA
User Value
[1] 1 54
[2] 2 12
[3] 3 7
[4] 4 123
[5] 5 74
dfB包含值的范围和我要添加到dfA的乘数:
dfB
Min Max Mult
[1] 0 50 0
[2] 50 80 0.5
[3] 80 100 0.8
[4] 100 120 1
[5] 120 1000 1.2
因此,理想的结果是将dfB中的乘数添加到dfB:
dfA
User Value Mult
[1] 1 54 0.5
[2] 2 12 0
[3] 3 7 0.8
[4] 4 123 1.2
[5] 5 74 0.5
我已经尝试过此代码(可用于单个值),但在数据框中不起作用:
dfA$Mult <- print(subset(dfB, dfA$Value > dfB$Min & dfA$Value < dfB$Max)$Mult)
提前感谢!
sapply
中的每个value
,您都可以在基数R中使用dfA
。dfA$mult <- sapply(dfA$Value, function(x) with(dfB, Mult[x >= Min & x <= Max]))
dfA
# User Value mult
#1 1 54 0.5
#2 2 12 0.0
#3 3 7 0.0
#4 4 123 1.2
#5 5 74 0.5
数据
dfA <- structure(list(User = 1:5, Value = c(54L, 12L, 7L, 123L, 74L)), row.names = c(NA, -5L), class = "data.frame") dfB <- structure(list(Min = c(0L, 50L, 80L, 100L, 120L), Max = c(50L, 80L, 100L, 120L, 1000L), Mult = c(0, 0.5, 0.8, 1, 1.2)), class = "data.frame", row.names = c(NA, -5L))
dfB
中的间隔分割成较大范围的分区,我们也可以使用findInterval
或cut
将dfA
中的值与间隔匹配在dfB
中。使用findInterval
:findInterval(x = dfA$Value, vec = c(dfB$Min[1], dfB$Max))
#> [1] 2 1 1 5 2
结合在Mult
中创建新的dfA
列,我们可以编写:
dfA$Mult <- with(dfB, Mult[findInterval(x = dfA$Value, vec = c(Min[1], Max))]) dfA #> User Value Mult #> 1 1 54 0.5 #> 2 2 12 0.0 #> 3 3 7 0.0 #> 4 4 123 1.2 #> 5 5 74 0.5
免责声明:如果
findInterval
中的间隔不能很好地对齐,则dfB
的使用将变得更加乏味,在这种情况下,Ronak的方法可能更简单。
dfA
行中的哪一行应该是Value
在[[ C0]范围。最后一个肯定会更快,因为它会立即过滤掉dfB
,每个dfB
过滤一次,然后仅使用过滤并堆叠的Value
s定义新的dfA
变量-但是它使用一个嵌套在另一个映射内部的并行映射,都在Mult
内部,因此我敢打赌,对于不太熟悉mutate()
的人来说很难阅读。但我希望看到一些比较基准!也请注意,任务还不清楚:purrr
括号之间的边缘情况-如果值为50,则Mult
是0还是0.5?在这里,我选择较高的Mult
。
Mult
由
dfA <- structure(list(User = 1:5, Value = c(54L, 12L, 7L, 123L, 74L)), row.names = c(NA, -5L), class = "data.frame") dfB <- structure(list(Min = c(0L, 50L, 80L, 100L, 120L), Max = c(50L, 80L, 100L, 120L, 1000L), Mult = c(0, 0.5, 0.8, 1, 1.2)), class = "data.frame", row.names = c(NA, -5L)) library(tidyverse) crossing(dfA, dfB) %>% filter(Value >= Min, Value < Max) %>% select(-Min, -Max) #> # A tibble: 5 x 3 #> User Value Mult #> <int> <int> <dbl> #> 1 1 54 0.5 #> 2 2 12 0 #> 3 3 7 0 #> 4 4 123 1.2 #> 5 5 74 0.5 # Slightly more verbose and yet slightly DRYer crossing(dfA, dfB) %>% filter(list(Value, Min, Max - 1) %>% pmap_lgl(between)) %>% select(-Min, -Max) #> # A tibble: 5 x 3 #> User Value Mult #> <int> <int> <dbl> #> 1 1 54 0.5 #> 2 2 12 0 #> 3 3 7 0 #> 4 4 123 1.2 #> 5 5 74 0.5 # Definitely faster, and DRY, and yet more verbose as well as harder to read dfA %>% mutate( Mult = Value %>% map_dbl( ~ dfB %>% filter(list(.x, Min, Max - 1) %>% pmap_lgl(between)) %>% pull(Mult) ) ) #> User Value Mult #> 1 1 54 0.5 #> 2 2 12 0.0 #> 3 3 7 0.0 #> 4 4 123 1.2 #> 5 5 74 0.5
(v0.3.0)在2019-09-29创建
输出:
dfA <- structure(list(User = 1:5, Value = c(54L, 12L, 7L, 123L, 74L)), row.names = c(NA, -5L), class = "data.frame") dfB <- structure(list(Min = c(0L, 50L, 80L, 100L, 120L), Max = c(50L, 80L, 100L, 120L,1000L), Mult = c(0, 0.5, 0.8, 1, 1.2)), class = "data.frame", row.names = c(NA, -5L)) # we add a mult column to dfA and set all its values to NA dfA$mult = NA # now we create a function which takes input as a single value from dfA # and returns the desired multiplier from dfB mult_fun = function(x) { for (j in 1:nrow(dfB)) { if(x > dfB$Min[j] & x < dfB$Max[j]) { return(dfB$Mult[j]) } } } # now we use mult_fun and gets multiplier for every value in dfA for (i in 1:nrow(dfA)) { dfA$mult[i] = mult_fun(dfA$Value[i]) }