场景是飙车...有时车手与竞争对手比赛,有时他们只是一个人比赛。驾驶员及其技能水平始终是完全随机的。比赛在第12圈结束后,每天进行一次比赛,持续10年。有数百个驱动程序。独立的观察员在比赛期间记录了数据,包括驾驶员的速度,但仅限于其中一名驾驶员!因此,数据丢失。这是数据的前6行:
df <- data.frame(
Driver_name = c("Rick", "Julie", "Denver", "Johny", "Cassandra", "Phillip"),
Driver_level = c("A", "C", "D", "A", "B", "B"),
Driver_speed = c(96, 91, 89, 94, 88, 99),
Competitor= c("Yes", "Yes", "Yes", "Yes", "No", "No"),
Comp_name= c("Julie", "Rick", "Johnny", "Denver", "NA", "NA"),
Comp_level= c("B", "B", "D", "A", "NA", "NA"),
Comp_speed= c("???", "???", "???", "???", "NA", "NA"),
Race_day= c(165, 165, 72, 72, 92, 65),
Lap_number= c(9, 9, 12, 12, 8, 4),
Humidity= c(33, 33, 88, 88, 12, 55),
Temperature= c(28, 28, 12, 12, 20, 28)
)
每行代表不同的驱动程序,但我需要填写数据以了解竞争对手的速度!我将手动输入速度,以说明其余数据集需要执行的操作。
df_1 <- data.frame(
Driver_name = c("Rick", "Julie", "Denver", "Johny", "Cassandra", "Phillip"),
Driver_level = c("A", "C", "D", "A", "B", "B"),
Driver_speed = c(96, 91, 89, 94, 88, 99),
Competitor= c("Yes", "Yes", "Yes", "Yes", "No", "No"),
Comp_name= c("Julie", "Rick", "Johnny", "Denver", "NA", "NA"),
Comp_level= c("B", "B", "D", "A", "NA", "NA"),
Comp_speed= c(91, 96, 94, 89, "NA", "NA"),
Race_day= c(165, 165, 72, 72, 92, 65),
Lap_number= c(9, 9, 12, 12, 8, 4),
Humidity= c(33, 33, 88, 88, 12, 55),
Temperature= c(28, 28, 12, 12, 20, 28)
)
你能帮我吗?
这是left_join
的理想选择。
您的数据
df <- data.frame(
Driver_name = c("Rick", "Julie", "Denver", "Johny", "Cassandra", "Phillip"),
Driver_level = c("A", "C", "D", "A", "B", "B"),
Driver_speed = c(96, 91, 89, 94, 88, 99),
Competitor= c("Yes", "Yes", "Yes", "Yes", "No", "No"),
Comp_name= c("Julie", "Rick", "Johnny", "Denver", "NA", "NA"),
Comp_level= c("B", "B", "D", "A", "NA", "NA"),
Comp_speed= c("???", "???", "???", "???", "NA", "NA"),
Race_day= c(165, 165, 72, 72, 92, 65),
Lap_number= c(9, 9, 12, 12, 8, 4),
Humidity= c(33, 33, 88, 88, 12, 55),
Temperature= c(28, 28, 12, 12, 20, 28)
)
我们加载了dplyr
包
#install.packages("dplyr") #if you don't have it
library(dplyr)
让我们摆脱当前具有“ ???”的Comp_speed
列值。
df <- df %>% select(-Comp_speed)
让我们创建一个仅包含名称和速度的第二个数据帧,然后我们即时将Driver_speed重命名为Comp_speed。
df2 <- df %>%
select(Driver_name, Comp_speed = Driver_speed)
现在我们可以将left_join
数据帧df
转换为df2
。 Comp_name
中的df
与Driver_name
中的df2
匹配
df_updated <- df %>% left_join(df2, by = c("Comp_name" = "Driver_name")) #> Warning: Column `Comp_name`/`Driver_name` joining factors with different #> levels, coercing to character vector
这是结果数据帧
df_updated
df_updated
#> Driver_name Driver_level Driver_speed Competitor Comp_name Comp_level
#> 1 Rick A 96 Yes Julie B
#> 2 Julie C 91 Yes Rick B
#> 3 Denver D 89 Yes Johnny D
#> 4 Johny A 94 Yes Denver A
#> 5 Cassandra B 88 No NA NA
#> 6 Phillip B 99 No NA NA
#> Race_day Lap_number Humidity Temperature Comp_speed
#> 1 165 9 33 28 91
#> 2 165 9 33 28 96
#> 3 72 12 88 12 NA
#> 4 72 12 88 12 89
#> 5 92 8 12 20 NA
#> 6 65 4 55 28 NA
我们需要将df留给自己。!names(df)%in%c(“ Comp_speed”)从第一个数据帧x中删除变量Comp_speed。