Next out Winners是指一起参加某场比赛的参赛者(RaceA),在该场比赛后,每个参赛者参加另一场比赛,RaceB(i)。RaceB(i)代表RaceA的每个参赛者接下来参加的下一场比赛。RaceB(i)可以是不同的,也可以是相同的(不要求是相同的比赛)。
Next Out Winners是RaceA(和其他比赛)的指标,反映了在某个指定时间范围内(例如90天)赢得RaceB(i)的选手数量。在指定的时间范围之后进行的比赛不包括在计算Next Out Winners的时候。
我的问题是,如何使用tidyverse来计算这个问题?我假设purrr函数将发挥突出的作用。
解决问题的逻辑:key inputs: 赛事、日期、选手和完赛位置.data: 赛事数据框,按升序排列,即日期、赛号和选手。
关键函数(s): map2_xxx或pmap函数,用于计算所有在90天内再次参加比赛并获胜(完赛=1)的马匹。
疑问:在未来90天内再次参加比赛并获胜(完赛率=1)的所有马匹的总和。: 我们的函数如何知道每匹马只回顾1场下一场比赛)。) 遍历数据框中的每一场比赛,效率会非常低。我已经创建了一个小样本数据框架,并将其包含在一个reprex中。我还在继续努力,但欢迎大家提供任何协助或建议。
Reprex:
library(tidyverse)
library(lubridate)
library(reprex)
#> Warning: package 'reprex' was built under R version 3.5.3
jdiff <- function(x, y, t="equal", i=0, j=0){
case_when(
t == 'equal'~ if_else(x ==y,1L,0L),
t == 'rng' ~ if_else(x >= y-i & x <= y+j,1L,0L),
TRUE ~ 0L
)
}
df<-tibble(runner=c("D.Wottle","D.Wottle","D.Wottle","D.Wottle","D.Wottle","D.Wottle","C.Hottle","C.Hottle","C.Hottle","C.Hottle","C.Hottle","C.Hottle","JJ.Watt","JJ.Watt","JJ.Watt","JJ.Watt","JJ.Watt","JJ.Watt"),
race_number=c(6,5,4,3,2,1,6,5,4,3,2,1,6,5,4,3,2,1),
race_venue=c("xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx","xyx"),
race_date=c(ymd('20170625'),ymd('20170524'),ymd('20170420'),ymd('20170329'),ymd('20170308'),ymd('20170215'),ymd('20170625'),ymd('20170524'),ymd('20170410'),ymd('20170329'),ymd('20170304'),ymd('20170215'),ymd('20170615'),ymd('20170524'),ymd('20170428'),ymd('20170329'),ymd('20170301'),ymd('20170225')),
distance=c(1400L,1400L,1600L,1400L,1500L,1400L,1400L,1400L,1600L,1400L,1500L,1400L,1400L,1400L,1600L,1400L,1500L,1400L),
finish=c(1L,2L,2L,1L,2L,3L,2L,3L,3L,2L,1L,1L,3L,1L,1L,3L,3L,2L)
) %>%
arrange(race_date, race_number, finish) %>%
# Calculation here to add Next out winner for each Race
#
mutate(next_out_winners = 0) %>%
group_by(race_date, race_number)
#group by races to show the next out winner for each race
#group_by(race_dte, race_number)
print(df,36)
#> # A tibble: 18 x 7
#> # Groups: race_date, race_number [12]
#> runner race_number race_venue race_date distance finish next_out_winners
#> <chr> <dbl> <chr> <date> <int> <int> <dbl>
#> 1 C.Hottle 1 xyx 2017-02-15 1400 1 0
#> 2 D.Wottle 1 xyx 2017-02-15 1400 3 0
#> 3 JJ.Watt 1 xyx 2017-02-25 1400 2 0
#> 4 JJ.Watt 2 xyx 2017-03-01 1500 3 0
#> 5 C.Hottle 2 xyx 2017-03-04 1500 1 0
#> 6 D.Wottle 2 xyx 2017-03-08 1500 2 0
#> 7 D.Wottle 3 xyx 2017-03-29 1400 1 0
#> 8 C.Hottle 3 xyx 2017-03-29 1400 2 0
#> 9 JJ.Watt 3 xyx 2017-03-29 1400 3 0
#> 10 C.Hottle 4 xyx 2017-04-10 1600 3 0
#> 11 D.Wottle 4 xyx 2017-04-20 1600 2 0
#> 12 JJ.Watt 4 xyx 2017-04-28 1600 1 0
#> 13 JJ.Watt 5 xyx 2017-05-24 1400 1 0
#> 14 D.Wottle 5 xyx 2017-05-24 1400 2 0
#> 15 C.Hottle 5 xyx 2017-05-24 1400 3 0
#> 16 JJ.Watt 6 xyx 2017-06-15 1400 3 0
#> 17 D.Wottle 6 xyx 2017-06-25 1400 1 0
#> 18 C.Hottle 6 xyx 2017-06-25 1400 2 0
创建于2020-05-14 重读包 (v0.3.0)
期望的输出(手工计算)
我没有用 map
功能--我使用了 lead
nextouts <- df %>%
# These are clutter
select(- race_venue, -distance) %>%
arrange(runner, race_date) %>%
group_by(runner) %>%
# Use 'lead' to compare this race to the next one.
mutate(next_finish = lead(finish, default= 0),
next_date = lead(race_date, default = NA),
is_next_out_win =
(next_finish == 1 & next_date - race_date <= 90)) %>%
ungroup()
目前看来效果不错
# A tibble: 18 x 7
runner race_number race_date finish next_finish next_date is_next_out_win
<chr> <dbl> <date> <int> <dbl> <date> <lgl>
1 C.Hottle 1 2017-02-15 1 1 2017-03-04 TRUE
2 C.Hottle 2 2017-03-04 1 2 2017-03-29 FALSE
3 C.Hottle 3 2017-03-29 2 3 2017-04-10 FALSE
4 C.Hottle 4 2017-04-10 3 3 2017-05-24 FALSE
5 C.Hottle 5 2017-05-24 3 2 2017-06-25 FALSE
6 C.Hottle 6 2017-06-25 2 0 NA FALSE
7 D.Wottle 1 2017-02-15 3 2 2017-03-08 FALSE
8 D.Wottle 2 2017-03-08 2 1 2017-03-29 TRUE
9 D.Wottle 3 2017-03-29 1 2 2017-04-20 FALSE
10 D.Wottle 4 2017-04-20 2 2 2017-05-24 FALSE
11 D.Wottle 5 2017-05-24 2 1 2017-06-25 TRUE
12 D.Wottle 6 2017-06-25 1 0 NA FALSE
13 JJ.Watt 1 2017-02-25 2 3 2017-03-01 FALSE
14 JJ.Watt 2 2017-03-01 3 3 2017-03-29 FALSE
15 JJ.Watt 3 2017-03-29 3 1 2017-04-28 TRUE
16 JJ.Watt 4 2017-04-28 1 1 2017-05-24 TRUE
17 JJ.Watt 5 2017-05-24 1 3 2017-06-15 FALSE
18 JJ.Watt 6 2017-06-15 3 0 NA FALSE
看看每场比赛的结果。
nextoutsR <- nextouts %>%
arrange(race_number)
# A tibble: 18 x 7
runner race_number race_date finish next_finish next_date is_next_out_win
<chr> <dbl> <date> <int> <dbl> <date> <lgl>
1 C.Hottle 1 2017-02-15 1 1 2017-03-04 TRUE
2 D.Wottle 1 2017-02-15 3 2 2017-03-08 FALSE
3 JJ.Watt 1 2017-02-25 2 3 2017-03-01 FALSE
4 C.Hottle 2 2017-03-04 1 2 2017-03-29 FALSE
5 D.Wottle 2 2017-03-08 2 1 2017-03-29 TRUE
6 JJ.Watt 2 2017-03-01 3 3 2017-03-29 FALSE
7 C.Hottle 3 2017-03-29 2 3 2017-04-10 FALSE
8 D.Wottle 3 2017-03-29 1 2 2017-04-20 FALSE
9 JJ.Watt 3 2017-03-29 3 1 2017-04-28 TRUE
10 C.Hottle 4 2017-04-10 3 3 2017-05-24 FALSE
11 D.Wottle 4 2017-04-20 2 2 2017-05-24 FALSE
12 JJ.Watt 4 2017-04-28 1 1 2017-05-24 TRUE
13 C.Hottle 5 2017-05-24 3 2 2017-06-25 FALSE
14 D.Wottle 5 2017-05-24 2 1 2017-06-25 TRUE
15 JJ.Watt 5 2017-05-24 1 3 2017-06-15 FALSE
16 C.Hottle 6 2017-06-25 2 0 NA FALSE
17 D.Wottle 6 2017-06-25 1 0 NA FALSE
18 JJ.Watt 6 2017-06-15 3 0 NA FALSE
看起来每场比赛都有一个下一个赢家(除了最后一场) 这很合理,因为每场比赛都只有三匹马,而且是跑马。
我们来总结一下。 记住,你可以把 "真 "和 "假 "相加,因为它们是1,0。
nextOutWinsInRace <- nextouts %>%
group_by(race_number) %>%
summarise(nextOutWinCount = sum(is_next_out_win))
R> nextOutWinsInRace
# A tibble: 6 x 2
race_number nextOutWinCount
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 0
叫它收工。