R中如何根据特定变量匹配数据

问题描述 投票:0回答:1

我有一个样本文件,其中包含2016年伊朗人口普查中提取的总人口的2%(7500万总人口中的150万个样本)。下面我以22个人为例:

sample <- structure(list(household.ID = c(16523634, 16523634, 16523634, 16523634,16525912, 
    16525912, 16540127,16540127, 16598050, 16598050, 16611764,16611764, 16611764, 16643309, 
    16643309, 16652356, 16652356,16652356, 16672105, 16672105, 16672105,16672105
    ),Member.ID= c(16527193, 16529443, 16532250, 16534992,16527527, 16529230, 
    16542499,16545263, 16616975, 16620223, 16633984,16642611, 16650837, 16646986, 16650210, 
    16660335, 16665128,16668381, 16676674, 16681528, 16685073,16687491
    ),Relatshinship= c(1,2,3,3,1,2,1,2,1,3,1,2,3,1,2,1,2,3,1,2,3,3),birth.year= 
    c(1346,1348,1376,1377,1357,1367,1316,1319,1329,1374,1339,1342,1367,1343,1336 ,1321 
     ,1326,1367,1338,1352,1372,1381),Gender  = c(1,2 ,1,2,1,2,1 ,2,1 ,1,1 ,2,1 ,1 ,2 ,1,2 ,2,1 
     ,2,1,1),age    = c(49,47,19 ,18,38,28,78,75,66 ,21,56 ,52 ,28,51 ,58 ,74 ,68 ,27  ,56 
    ,43,23 ,13),marriage.stuatus= c(1,1 ,4 ,4 ,1,1,1,1,2,4,1,1 ,1,1 ,1,1,1 ,4,1 ,1,4 ,4),   
    number.of.children.ever.born= c(NA,2,NA,NA,NA,NA,NA,6,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA),    
    number.of.living.children  = c(NA,2,NA,NA,NA,NA,NA,4,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA)), 
    row.names = c(NA, -22L),class = "data.frame")

我想为女性创造一部生育史。为此,我需要将孩子与他们的母亲相匹配。我的数据中有一个专栏,其中提到了个人与户主的关系。代码1是户主,2是户主的妻子,3是孩子,4是女婿或低等女儿,5是孙子,6是父亲或母亲等。例如,在我的数据显示,第一户(ID:16523634)有4名成员,其中包括户主,这里是一名男性(代码:1=男,2=女),他的妻子,以及两个孩子,一个儿子(年龄:19)和一个女儿(年龄:18)。长话短说,我需要将孩子(也出现在数据中)与家庭中的母亲进行匹配,这样我就可以为每位母亲建立一列,其中在列中指定每个孩子的年龄相匹配。我希望我的数据最终能达到这样的结果:

H.ID M.ID B.年 性别 年龄 第一个孩子 第二个孩子 第三个孩子
16523634 16529443 1348 2 47 19 18 不适用
r match
1个回答
0
投票

起始数据

当我从 OP 加载

sample
时,数据框中有两个向量的长度仅为 == 21,因此我添加了一些
NA
只是为了获得有效的起点。这是我用过的:

library(tidyverse)

sample <-
  tibble(
    household.ID = c(
      16523634,16523634,16523634,16523634,16525912,16525912,
      16540127,16540127,16598050,16598050,16611764,16611764,
      16611764,16643309,16643309,16652356,16652356,16652356,
      16672105,16672105,16672105,16672105),
    Member.ID= c(16527193, 16529443, 16532250, 16534992,16527527, 16529230, 
                 16542499,16545263, 16616975, 16620223, 16633984,16642611, 
                 16650837, 16646986, 16650210, 16660335, 16665128,16668381, 
                 16676674, 16681528, 16685073,16687491),
    Relatshinship = c(1,2,3,3,1,2,1,2,1,3,1,2,3,1,2,1,2,3,1,2,3,3),
    birth.year = c(1346,1348,1376,1377,1357,1367,1316,1319,1329,1374,1339,
                   1342,1367,1343,1336 ,1321,1326,1367,1338,1352,1372,1381),
    Gender = c(1,2,1,2,1,2,1,2,1,1,1,2,1,1,2,1,2,2,1,2,1,1),
    age = c(49,47,19,18,38,28,78,75,66,21,56,52,28,51,58,74,68,27,56,43,23,13),
  marriage.stuatus= c(1,1,4,4,1,1,1,1,2,4,1,1,1,1,1,1,1,4,1,1,4,4),
  number.of.children.ever.born= c(NA,2,NA,NA,NA,NA,NA,6,NA,NA,NA,2,NA,NA,
                                  3,NA,3,NA,NA,2,NA,NA),    
  number.of.living.children  = c(NA,2,NA,NA,NA,NA,NA,4,NA,NA,NA,2,NA,NA,
                                 3,NA,3,NA,NA,2,NA,NA))

看似残酷,但让我们从将母亲和孩子与家人分开开始吧:

df_mothers <- sample %>% 
  filter(Relatshinship == 2) %>% 
  print()

# A tibble: 7 × 9
  household.ID Member.ID Relatshinship birth.year Gender   age marriage.stuatus number.of.children.ever.born number.of.living.children
         <dbl>     <dbl>         <dbl>      <dbl>  <dbl> <dbl>            <dbl>                        <dbl>                     <dbl>
1     16523634  16529443             2       1348      2    47                1                            2                         2
2     16525912  16529230             2       1367      2    28                1                           NA                        NA
3     16540127  16545263             2       1319      2    75                1                            6                         4
4     16611764  16642611             2       1342      2    52                1                            2                         2
5     16643309  16650210             2       1336      2    58                1                            3                         3
6     16652356  16665128             2       1326      2    68                1                            3                         3
7     16672105  16681528             2       1352      2    43                1                            2                         2

df_children <- sample %>% 
  filter(Relatshinship == 3) %>% 
  group_by(household.ID) %>% 
  arrange(household.ID,desc(age)) %>% 
  mutate(birth.order = ordinal(row_number())) %>% 
  select(-c(marriage.stuatus, number.of.children.ever.born, number.of.living.children)) %>%
  print()

# A tibble: 7 × 7
# Groups:   household.ID [5]
  household.ID Member.ID Relatshinship birth.year Gender   age birth.order
         <dbl>     <dbl>         <dbl>      <dbl>  <dbl> <dbl> <chr>      
1     16523634  16532250             3       1376      1    19 1st        
2     16523634  16534992             3       1377      2    18 2nd        
3     16598050  16620223             3       1374      1    21 1st        
4     16611764  16650837             3       1367      1    28 1st        
5     16652356  16668381             3       1367      2    27 1st        
6     16672105  16685073             3       1372      1    23 1st        
7     16672105  16687491             3       1381      1    13 2nd        

好吧,这可以让你按家庭对孩子进行分组并显示出生顺序,但是你想将每个家庭的孩子塞进一排(就像我的姐妹们塞进一间卧室一样),所以

pivot_wider()
来救援:

df_children_pivot <- df_children %>% 
  pivot_wider(id_cols = household.ID,
              names_from = birth.order,
              names_glue = "{birth.order}_born",
              values_from = age) %>% 
  print()

# A tibble: 5 × 3
# Groups:   household.ID [5]
  household.ID `1st_born` `2nd_born`
         <dbl>      <dbl>      <dbl>
1     16523634         19         18
2     16598050         21         NA
3     16611764         28         NA
4     16652356         27         NA
5     16672105         23         13

现在,因为我们的残忍只能到此为止,所以让我们把那些被我们从家人身边夺走的孩子们带走,把他们塞进一排,现在我们将让他们与他们的母亲团聚:

df_reunited <- df_mothers %>% 
  left_join(df_children_pivot, by = "household.ID") %>% 
  select(household.ID,Member.ID,birth.year,Gender,age,`1st_born`:last_col()) %>% 
  print()

# A tibble: 7 × 7
  household.ID Member.ID birth.year Gender   age `1st_born` `2nd_born`
         <dbl>     <dbl>      <dbl>  <dbl> <dbl>      <dbl>      <dbl>
1     16523634  16529443       1348      2    47         19         18
2     16525912  16529230       1367      2    28         NA         NA
3     16540127  16545263       1319      2    75         NA         NA
4     16611764  16642611       1342      2    52         28         NA
5     16643309  16650210       1336      2    58         NA         NA
6     16652356  16665128       1326      2    68         27         NA
7     16672105  16681528       1352      2    43         23         13

现在,我上面假设

left_join()
是合适的,并且所有孩子都有母亲在场,但我猜你可能有没有母亲的孤儿,所以根据实际数据,你可能需要调整什么您使用的加入方式。但我想这就是你想要的。

© www.soinside.com 2019 - 2024. All rights reserved.