DataCamp 中有一个关于计算赢得 NBA 系列赛的概率的问题。骑士队和勇士队正在进行七场冠军系列赛。第一个赢得四场比赛的人赢得了系列赛。他们每场比赛都有 50-50 的获胜机会。如果骑士队输掉第一场比赛,他们赢得系列赛的概率是多少?
以下是 DataCamp 如何使用 Monte Carlo 模拟计算概率:
B <- 10000
set.seed(1)
results<-replicate(B,{x<-sample(0:1,6,replace=T) # 0 when game is lost and 1 when won.
sum(x)>=4})
mean(results)
这是他们使用简单代码计算概率的不同方法:
# Assign a variable 'n' as the number of remaining games.
n<-6
# Assign a variable `outcomes` as a vector of possible game outcomes: 0 indicates a loss and 1 a win for the Cavs.
outcomes<-c(0,1)
# Assign a variable `l` to a list of all possible outcomes in all remaining games. Use the `rep` function on `list(outcomes)` to create list of length `n`.
l<-rep(list(outcomes),n)
# Create a data frame named 'possibilities' that contains all combinations of possible outcomes for the remaining games.
possibilities<-expand.grid(l) # My comment: note how this produces 64 combinations.
# Create a vector named 'results' that indicates whether each row in the data frame 'possibilities' contains enough wins for the Cavs to win the series.
rowSums(possibilities)
results<-rowSums(possibilities)>=4
# Calculate the proportion of 'results' in which the Cavs win the series.
mean(results)
问题/问题:
他们赢得系列赛的概率大致相同 ~ 0.34。然而,概念和代码设计似乎存在缺陷。例如,代码(采样六次)允许进行如下组合:
G2 G3 G4 G5 G6 G7 rowSums
0 0 0 0 0 0 0 # Series over after G4 (Cavs lose). No need for game G5-G7.
0 0 0 0 1 0 1 # Series over after G4 (Cavs lose). Double counting!
0 0 0 0 0 1 1 # Double counting!
...
1 1 1 1 0 0 4 # No need for game G6 and G7.
1 1 1 1 0 1 5 # Double counting! This is the same as 1,1,1,1,0,0.
0 1 1 1 1 1 5 # No need for game G7.
1 1 1 1 1 1 6 # Series over after G5 (Cavs win). Double counting!
> rowSums(possibilities)
[1] 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 2 3 3 4 3 4 4 5 3 4 4 5 4 5 5 6
如您所见,这些都是不可能的。赢得剩余六场比赛中的前四场后,不应再进行任何比赛。同样,在输掉剩余六场比赛中的前三场比赛后,不应再进行任何比赛。因此,这些组合不应包含在赢得系列赛的概率的计算中。某些组合存在重复计算。
这是我所做的,以省略一些现实生活中不可能的组合。
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities<-possibilities %>% mutate(rowsums=rowSums(possibilities)) %>% filter(rowsums<=4)
但是我无法省略其他不必要的组合。例如,我想删除这三个中的两个: (a) 1,0,0,0,0,0 (b) 1,0,0,0,0,1 (c) 1,0,0,0 ,1,1。这是因为连续输掉三场比赛后将不再进行比赛。而且它们基本上是重复计算的。
条件太多,我无法单独筛选。必须有一种更有效、更直观的方法来做到这一点。有人可以给我一些关于如何解决这整个混乱的提示吗?
这里有一个方法:
library(dplyr)
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities %>%
mutate(rowsums=rowSums(cur_data()),
anti_sum = rowSums(!cur_data())) %>%
filter(rowsums<=4, anti_sum <= 3)
我们利用 r 可以强制转换为 0 为假的逻辑这一事实。请参阅
sum(!0)
作为简短示例。
我运行以下代码
results=replicate(10000,{
n=6
outcomes=c(0,1)
l<-sample(rep(outcomes,n),6)
sum(l)>=4
}
)
mean(results)
我的结果是0.2821。与上面的计算相比,我做错了什么?