我有一个样本文件,其中包含2016年伊朗人口普查中提取的总人口的2%(7500万总人口中的150万个样本)。下面我以22个人为例:
sample <- structure(list(household.ID = c(16523634, 16523634, 16523634, 16523634,16525912,
16525912, 16540127,16540127, 16598050, 16598050, 16611764,16611764, 16611764, 16643309,
16643309, 16652356, 16652356,16652356, 16672105, 16672105, 16672105,16672105
),Member.ID= c(16527193, 16529443, 16532250, 16534992,16527527, 16529230,
16542499,16545263, 16616975, 16620223, 16633984,16642611, 16650837, 16646986, 16650210,
16660335, 16665128,16668381, 16676674, 16681528, 16685073,16687491
),Relatshinship= c(1,2,3,3,1,2,1,2,1,3,1,2,3,1,2,1,2,3,1,2,3,3),birth.year=
c(1346,1348,1376,1377,1357,1367,1316,1319,1329,1374,1339,1342,1367,1343,1336 ,1321
,1326,1367,1338,1352,1372,1381),Gender = c(1,2 ,1,2,1,2,1 ,2,1 ,1,1 ,2,1 ,1 ,2 ,1,2 ,2,1
,2,1,1),age = c(49,47,19 ,18,38,28,78,75,66 ,21,56 ,52 ,28,51 ,58 ,74 ,68 ,27 ,56
,43,23 ,13),marriage.stuatus= c(1,1 ,4 ,4 ,1,1,1,1,2,4,1,1 ,1,1 ,1,1,1 ,4,1 ,1,4 ,4),
number.of.children.ever.born= c(NA,2,NA,NA,NA,NA,NA,6,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA),
number.of.living.children = c(NA,2,NA,NA,NA,NA,NA,4,NA,NA,NA,2,NA,NA,3,NA,3,NA,NA,2,NA)),
row.names = c(NA, -22L),class = "data.frame")
我想为女性创造一部生育史。为此,我需要将孩子与他们的母亲相匹配。我的数据中有一个专栏,其中提到了个人与户主的关系。代码1是户主,2是户主的妻子,3是孩子,4是女婿或低等女儿,5是孙子,6是父亲或母亲等。例如,在我的数据显示,第一户(ID:16523634)有4名成员,其中包括户主,这里是一名男性(代码:1=男,2=女),他的妻子,以及两个孩子,一个儿子(年龄:19)和一个女儿(年龄:18)。长话短说,我需要将孩子(也出现在数据中)与家庭中的母亲进行匹配,这样我就可以为每位母亲建立一列,其中在列中指定每个孩子的年龄相匹配。我希望我的数据最终能达到这样的结果:
H.ID | M.ID | B.年 | 性别 | 年龄 | 第一个孩子 | 第二个孩子 | 第三个孩子 |
---|---|---|---|---|---|---|---|
16523634 | 16529443 | 1348 | 2 | 47 | 19 | 18 | 不适用 |
当我从 OP 加载
sample
时,数据框中有两个向量的长度仅为 == 21,因此我添加了一些 NA
只是为了获得有效的起点。这是我用过的:
library(tidyverse)
sample <-
tibble(
household.ID = c(
16523634,16523634,16523634,16523634,16525912,16525912,
16540127,16540127,16598050,16598050,16611764,16611764,
16611764,16643309,16643309,16652356,16652356,16652356,
16672105,16672105,16672105,16672105),
Member.ID= c(16527193, 16529443, 16532250, 16534992,16527527, 16529230,
16542499,16545263, 16616975, 16620223, 16633984,16642611,
16650837, 16646986, 16650210, 16660335, 16665128,16668381,
16676674, 16681528, 16685073,16687491),
Relatshinship = c(1,2,3,3,1,2,1,2,1,3,1,2,3,1,2,1,2,3,1,2,3,3),
birth.year = c(1346,1348,1376,1377,1357,1367,1316,1319,1329,1374,1339,
1342,1367,1343,1336 ,1321,1326,1367,1338,1352,1372,1381),
Gender = c(1,2,1,2,1,2,1,2,1,1,1,2,1,1,2,1,2,2,1,2,1,1),
age = c(49,47,19,18,38,28,78,75,66,21,56,52,28,51,58,74,68,27,56,43,23,13),
marriage.stuatus= c(1,1,4,4,1,1,1,1,2,4,1,1,1,1,1,1,1,4,1,1,4,4),
number.of.children.ever.born= c(NA,2,NA,NA,NA,NA,NA,6,NA,NA,NA,2,NA,NA,
3,NA,3,NA,NA,2,NA,NA),
number.of.living.children = c(NA,2,NA,NA,NA,NA,NA,4,NA,NA,NA,2,NA,NA,
3,NA,3,NA,NA,2,NA,NA))
看似残酷,但让我们从将母亲和孩子与家人分开开始吧:
df_mothers <- sample %>%
filter(Relatshinship == 2) %>%
print()
# A tibble: 7 × 9
household.ID Member.ID Relatshinship birth.year Gender age marriage.stuatus number.of.children.ever.born number.of.living.children
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 16523634 16529443 2 1348 2 47 1 2 2
2 16525912 16529230 2 1367 2 28 1 NA NA
3 16540127 16545263 2 1319 2 75 1 6 4
4 16611764 16642611 2 1342 2 52 1 2 2
5 16643309 16650210 2 1336 2 58 1 3 3
6 16652356 16665128 2 1326 2 68 1 3 3
7 16672105 16681528 2 1352 2 43 1 2 2
df_children <- sample %>%
filter(Relatshinship == 3) %>%
group_by(household.ID) %>%
arrange(household.ID,desc(age)) %>%
mutate(birth.order = ordinal(row_number())) %>%
select(-c(marriage.stuatus, number.of.children.ever.born, number.of.living.children)) %>%
print()
# A tibble: 7 × 7
# Groups: household.ID [5]
household.ID Member.ID Relatshinship birth.year Gender age birth.order
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 16523634 16532250 3 1376 1 19 1st
2 16523634 16534992 3 1377 2 18 2nd
3 16598050 16620223 3 1374 1 21 1st
4 16611764 16650837 3 1367 1 28 1st
5 16652356 16668381 3 1367 2 27 1st
6 16672105 16685073 3 1372 1 23 1st
7 16672105 16687491 3 1381 1 13 2nd
好吧,这可以让你按家庭对孩子进行分组并显示出生顺序,但是你想将每个家庭的孩子塞进一排(就像我的姐妹们塞进一间卧室一样),所以
pivot_wider()
来救援:
df_children_pivot <- df_children %>%
pivot_wider(id_cols = household.ID,
names_from = birth.order,
names_glue = "{birth.order}_born",
values_from = age) %>%
print()
# A tibble: 5 × 3
# Groups: household.ID [5]
household.ID `1st_born` `2nd_born`
<dbl> <dbl> <dbl>
1 16523634 19 18
2 16598050 21 NA
3 16611764 28 NA
4 16652356 27 NA
5 16672105 23 13
现在,因为我们的残忍只能到此为止,所以让我们把那些被我们从家人身边夺走的孩子们带走,把他们塞进一排,现在我们将让他们与他们的母亲团聚:
df_reunited <- df_mothers %>%
left_join(df_children_pivot, by = "household.ID") %>%
select(household.ID,Member.ID,birth.year,Gender,age,`1st_born`:last_col()) %>%
print()
# A tibble: 7 × 7
household.ID Member.ID birth.year Gender age `1st_born` `2nd_born`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 16523634 16529443 1348 2 47 19 18
2 16525912 16529230 1367 2 28 NA NA
3 16540127 16545263 1319 2 75 NA NA
4 16611764 16642611 1342 2 52 28 NA
5 16643309 16650210 1336 2 58 NA NA
6 16652356 16665128 1326 2 68 27 NA
7 16672105 16681528 1352 2 43 23 13
现在,我上面假设
left_join()
是合适的,并且所有孩子都有母亲在场,但我猜你可能有没有母亲的孤儿,所以根据实际数据,你可能需要调整什么您使用的加入方式。但我想这就是你想要的。