如何连接R中的两个表,用第二个表中的数据更新第一个表中的NA?

问题描述 投票:0回答:1

我有两张表格,其中的信息是相互补充的。

dataset_a <- data.frame(id = 1:10, country = c(rep("England", 5), rep("Northern Ireland", 5)), population = c(10000, 12000, 20000, 20000, 15000, 2500, 2800, 9000, 10110, 11000), health_rank = c(6500:6504, rep(NA, 5)), health_decile_numeric = c(rep("4", 5), rep(NA, 5)), health_decile_text = c(rep("Top 40%", 5), rep(NA, 5)))

dataset_b <- data.frame(id = 1:10, country = c(rep("England", 5), rep("Northern Ireland", 5)), health_rank = c(rep(NA, 5), 850:854), health_decile_numeric = c(rep(NA, 5), rep("2", 5)), health_decile_text = c(rep(NA, 5), rep("Top 20%", 5)))

> dataset_aa
   id          country population health_rank health_decile_numeric health_decile_text
1   1          England      10000        6500                     4            Top 40%
2   2          England      12000        6501                     4            Top 40%
3   3          England      20000        6502                     4            Top 40%
4   4          England      20000        6503                     4            Top 40%
5   5          England      15000        6504                     4            Top 40%
6   6 Northern Ireland       2500          NA                  <NA>               <NA>
7   7 Northern Ireland       2800          NA                  <NA>               <NA>
8   8 Northern Ireland       9000          NA                  <NA>               <NA>
9   9 Northern Ireland      10110          NA                  <NA>               <NA>
10 10 Northern Ireland      11000          NA                  <NA>               <NA>

dataset_bb
   id          country health_rank health_decile_numeric health_decile_text
1   1          England          NA                  <NA>               <NA>
2   2          England          NA                  <NA>               <NA>
3   3          England          NA                  <NA>               <NA>
4   4          England          NA                  <NA>               <NA>
5   5          England          NA                  <NA>               <NA>
6   6 Northern Ireland         850                     2            Top 20%
7   7 Northern Ireland         851                     2            Top 20%
8   8 Northern Ireland         852                     2            Top 20%
9   9 Northern Ireland         853                     2            Top 20%
10 10 Northern Ireland         854                     2            Top 20%

dataset_a
dataset_b
大得多。
dataset_a
的长度超过 250 万行,而
dataset_b
只有几千行(有关北爱尔兰与英国其他地区的数据)。

我需要加入一个公共变量,比如

id
,更新公共变量并保持独占变量不变。到目前为止我所做的是

new_dataset <- left_join(dataset_a, dataset_b,
                         by = "id")

但现在我有重复的列

health_rank.x
health_rank.y
health_decile_numeric.x
health_decile_numeric.y
等等。删除
dataset_a
中的第 5 到 10 行(删除北爱尔兰记录)并使用新信息再次添加它们不是一个选项,因为真实数据集包含更多
dataset_b
中未包含的变量,并且需要将它们保留在适当的位置,比如
population
但还有很多其他的。

如何合并两个数据表更新行而不是添加新列?我正在寻找如下结果表:

> new_dataset
   id          country population health_rank health_decile_numeric health_decile_text
1   1          England      10000        6500                     4            Top 40%
2   2          England      12000        6501                     4            Top 40%
3   3          England      20000        6502                     4            Top 40%
4   4          England      20000        6503                     4            Top 40%
5   5          England      15000        6504                     4            Top 40%
6   6 Northern Ireland       2500         850                     2            Top 20%
7   7 Northern Ireland       2800         851                     2            Top 20%
8   8 Northern Ireland       9000         852                     2            Top 20%
9   9 Northern Ireland      10110         853                     2            Top 20%
10 10 Northern Ireland      11000         854                     2            Top 20%
r dataframe dplyr left-join
1个回答
0
投票

据我了解,您可以只对北爱尔兰数据进行索引。你提到有额外的列,所以我创建了一个可以修改的变量

wantcols

# columns you want to update
wantcols <- names(dataset_a[,-c(1:3)])

# [1] "health_rank" "health_decile_numeric" "health_decile_text"   

dataset_a[dataset_a$country %in% "Northern Ireland", wantcols] <-
  dataset_b[dataset_b$country %in% "Northern Ireland", wantcols]

输出:

#    id          country population health_rank health_decile_numeric health_decile_text
# 1   1          England      10000        6500                     4            Top 40%
# 2   2          England      12000        6501                     4            Top 40%
# 3   3          England      20000        6502                     4            Top 40%
# 4   4          England      20000        6503                     4            Top 40%
# 5   5          England      15000        6504                     4            Top 40%
# 6   6 Northern Ireland       2500         850                     2            Top 20%
# 7   7 Northern Ireland       2800         851                     2            Top 20%
# 8   8 Northern Ireland       9000         852                     2            Top 20%
# 9   9 Northern Ireland      10110         853                     2            Top 20%
# 10 10 Northern Ireland      11000         854                     2            Top 20%

注意,这适用于您的示例数据,但不确定它是否会扩展到您的真实数据。如果没有,请告诉我,我可以更新答案。

© www.soinside.com 2019 - 2024. All rights reserved.