我有两张表格,其中的信息是相互补充的。
dataset_a <- data.frame(id = 1:10, country = c(rep("England", 5), rep("Northern Ireland", 5)), population = c(10000, 12000, 20000, 20000, 15000, 2500, 2800, 9000, 10110, 11000), health_rank = c(6500:6504, rep(NA, 5)), health_decile_numeric = c(rep("4", 5), rep(NA, 5)), health_decile_text = c(rep("Top 40%", 5), rep(NA, 5)))
dataset_b <- data.frame(id = 1:10, country = c(rep("England", 5), rep("Northern Ireland", 5)), health_rank = c(rep(NA, 5), 850:854), health_decile_numeric = c(rep(NA, 5), rep("2", 5)), health_decile_text = c(rep(NA, 5), rep("Top 20%", 5)))
> dataset_aa
id country population health_rank health_decile_numeric health_decile_text
1 1 England 10000 6500 4 Top 40%
2 2 England 12000 6501 4 Top 40%
3 3 England 20000 6502 4 Top 40%
4 4 England 20000 6503 4 Top 40%
5 5 England 15000 6504 4 Top 40%
6 6 Northern Ireland 2500 NA <NA> <NA>
7 7 Northern Ireland 2800 NA <NA> <NA>
8 8 Northern Ireland 9000 NA <NA> <NA>
9 9 Northern Ireland 10110 NA <NA> <NA>
10 10 Northern Ireland 11000 NA <NA> <NA>
dataset_bb
id country health_rank health_decile_numeric health_decile_text
1 1 England NA <NA> <NA>
2 2 England NA <NA> <NA>
3 3 England NA <NA> <NA>
4 4 England NA <NA> <NA>
5 5 England NA <NA> <NA>
6 6 Northern Ireland 850 2 Top 20%
7 7 Northern Ireland 851 2 Top 20%
8 8 Northern Ireland 852 2 Top 20%
9 9 Northern Ireland 853 2 Top 20%
10 10 Northern Ireland 854 2 Top 20%
dataset_a
比dataset_b
大得多。 dataset_a
的长度超过 250 万行,而 dataset_b
只有几千行(有关北爱尔兰与英国其他地区的数据)。
我需要加入一个公共变量,比如
id
,更新公共变量并保持独占变量不变。到目前为止我所做的是
new_dataset <- left_join(dataset_a, dataset_b,
by = "id")
但现在我有重复的列
health_rank.x
和health_rank.y
,health_decile_numeric.x
和health_decile_numeric.y
等等。删除 dataset_a
中的第 5 到 10 行(删除北爱尔兰记录)并使用新信息再次添加它们不是一个选项,因为真实数据集包含更多 dataset_b
中未包含的变量,并且需要将它们保留在适当的位置,比如 population
但还有很多其他的。
如何合并两个数据表更新行而不是添加新列?我正在寻找如下结果表:
> new_dataset
id country population health_rank health_decile_numeric health_decile_text
1 1 England 10000 6500 4 Top 40%
2 2 England 12000 6501 4 Top 40%
3 3 England 20000 6502 4 Top 40%
4 4 England 20000 6503 4 Top 40%
5 5 England 15000 6504 4 Top 40%
6 6 Northern Ireland 2500 850 2 Top 20%
7 7 Northern Ireland 2800 851 2 Top 20%
8 8 Northern Ireland 9000 852 2 Top 20%
9 9 Northern Ireland 10110 853 2 Top 20%
10 10 Northern Ireland 11000 854 2 Top 20%
据我了解,您可以只对北爱尔兰数据进行索引。你提到有额外的列,所以我创建了一个可以修改的变量
wantcols
:
# columns you want to update
wantcols <- names(dataset_a[,-c(1:3)])
# [1] "health_rank" "health_decile_numeric" "health_decile_text"
dataset_a[dataset_a$country %in% "Northern Ireland", wantcols] <-
dataset_b[dataset_b$country %in% "Northern Ireland", wantcols]
输出:
# id country population health_rank health_decile_numeric health_decile_text
# 1 1 England 10000 6500 4 Top 40%
# 2 2 England 12000 6501 4 Top 40%
# 3 3 England 20000 6502 4 Top 40%
# 4 4 England 20000 6503 4 Top 40%
# 5 5 England 15000 6504 4 Top 40%
# 6 6 Northern Ireland 2500 850 2 Top 20%
# 7 7 Northern Ireland 2800 851 2 Top 20%
# 8 8 Northern Ireland 9000 852 2 Top 20%
# 9 9 Northern Ireland 10110 853 2 Top 20%
# 10 10 Northern Ireland 11000 854 2 Top 20%
注意,这适用于您的示例数据,但不确定它是否会扩展到您的真实数据。如果没有,请告诉我,我可以更新答案。