在另一个DataFrame上两列的值之间查找DataFrame的值

问题描述 投票:1回答:4

我有两个数据框。 dataframe_A包含用户和给定值:

dfA
    User Value 
[1]    1    54
[2]    2    12
[3]    3     7
[4]    4   123
[5]    5    74

dfB包含值的范围和我要添加到dfA的乘数:

dfB
     Min    Max   Mult 
[1]    0     50      0
[2]   50     80    0.5
[3]   80    100    0.8
[4]  100    120      1
[5]  120   1000    1.2

因此,理想的结果是将dfB中的乘数添加到dfB:

dfA
    User Value  Mult 
[1]    1    54   0.5
[2]    2    12     0
[3]    3     7   0.8
[4]    4   123   1.2
[5]    5    74   0.5

我已经尝试过此代码(可用于单个值),但在数据框中不起作用:

dfA$Mult <- print(subset(dfB, dfA$Value > dfB$Min & dfA$Value < dfB$Max)$Mult)

提前感谢!

r
4个回答
1
投票
对于sapply中的每个value,您都可以在基数R中使用dfA

dfA$mult <- sapply(dfA$Value, function(x) with(dfB, Mult[x >= Min & x <= Max])) dfA # User Value mult #1 1 54 0.5 #2 2 12 0.0 #3 3 7 0.0 #4 4 123 1.2 #5 5 74 0.5

数据

dfA <- structure(list(User = 1:5, Value = c(54L, 12L, 7L, 123L, 74L)), row.names = c(NA, -5L), class = "data.frame") dfB <- structure(list(Min = c(0L, 50L, 80L, 100L, 120L), Max = c(50L, 80L, 100L, 120L, 1000L), Mult = c(0, 0.5, 0.8, 1, 1.2)), class = "data.frame", row.names = c(NA, -5L))

2
投票
[如果如示例中那样通过按顺序的间隔将dfB中的间隔分割成较大范围的分区,我们也可以使用findIntervalcutdfA中的值与间隔匹配在dfB中。使用findInterval

findInterval(x = dfA$Value, vec = c(dfB$Min[1], dfB$Max)) #> [1] 2 1 1 5 2

结合在Mult中创建新的dfA列,我们可以编写:

dfA$Mult <- with(dfB, Mult[findInterval(x = dfA$Value, vec = c(Min[1], Max))]) dfA #> User Value Mult #> 1 1 54 0.5 #> 2 2 12 0.0 #> 3 3 7 0.0 #> 4 4 123 1.2 #> 5 5 74 0.5

免责声明:如果findInterval中的间隔不能很好地对齐,则dfB的使用将变得更加乏味,在这种情况下,Ronak的方法可能更简单。


0
投票
一些tidyverse解决方案。前两个在两个表之间进行交叉连接-对于大型表来说这可能非常昂贵-然后将结果过滤到原始dfA行中的哪一行应该是Value在[[ C0]范围。最后一个肯定会更快,因为它会立即过滤掉dfB,每个dfB过滤一次,然后仅使用过滤并堆叠的Value s定义新的dfA变量-但是它使用一个嵌套在另一个映射内部的并行映射,都在Mult内部,因此我敢打赌,对于不太熟悉mutate()的人来说很难阅读。但我希望看到一些比较基准!

也请注意,任务还不清楚:purrr括号之间的边缘情况-如果值为50,则Mult是0还是0.5?在这里,我选择较高的Mult

Mult

dfA <- structure(list(User = 1:5, Value = c(54L, 12L, 7L, 123L, 74L)), row.names = c(NA, -5L), class = "data.frame") dfB <- structure(list(Min = c(0L, 50L, 80L, 100L, 120L), Max = c(50L, 80L, 100L, 120L, 1000L), Mult = c(0, 0.5, 0.8, 1, 1.2)), class = "data.frame", row.names = c(NA, -5L)) library(tidyverse) crossing(dfA, dfB) %>% filter(Value >= Min, Value < Max) %>% select(-Min, -Max) #> # A tibble: 5 x 3 #> User Value Mult #> <int> <int> <dbl> #> 1 1 54 0.5 #> 2 2 12 0 #> 3 3 7 0 #> 4 4 123 1.2 #> 5 5 74 0.5 # Slightly more verbose and yet slightly DRYer crossing(dfA, dfB) %>% filter(list(Value, Min, Max - 1) %>% pmap_lgl(between)) %>% select(-Min, -Max) #> # A tibble: 5 x 3 #> User Value Mult #> <int> <int> <dbl> #> 1 1 54 0.5 #> 2 2 12 0 #> 3 3 7 0 #> 4 4 123 1.2 #> 5 5 74 0.5 # Definitely faster, and DRY, and yet more verbose as well as harder to read dfA %>% mutate( Mult = Value %>% map_dbl( ~ dfB %>% filter(list(.x, Min, Max - 1) %>% pmap_lgl(between)) %>% pull(Mult) ) ) #> User Value Mult #> 1 1 54 0.5 #> 2 2 12 0.0 #> 3 3 7 0.0 #> 4 4 123 1.2 #> 5 5 74 0.5 (v0.3.0)在2019-09-29创建


0
投票
这里是代码:

reprex package

输出:

dfA <- structure(list(User = 1:5, Value = c(54L, 12L, 7L, 123L, 74L)), row.names = c(NA, -5L), class = "data.frame") dfB <- structure(list(Min = c(0L, 50L, 80L, 100L, 120L), Max = c(50L, 80L, 100L, 120L,1000L), Mult = c(0, 0.5, 0.8, 1, 1.2)), class = "data.frame", row.names = c(NA, -5L)) # we add a mult column to dfA and set all its values to NA dfA$mult = NA # now we create a function which takes input as a single value from dfA # and returns the desired multiplier from dfB mult_fun = function(x) { for (j in 1:nrow(dfB)) { if(x > dfB$Min[j] & x < dfB$Max[j]) { return(dfB$Mult[j]) } } } # now we use mult_fun and gets multiplier for every value in dfA for (i in 1:nrow(dfA)) { dfA$mult[i] = mult_fun(dfA$Value[i]) }

© www.soinside.com 2019 - 2024. All rights reserved.