使用data.table模糊连接两个数据帧

问题描述 投票:2回答:1

我一直在研究fuzzyjoin将两个数据帧连接在一起,但由于内存问题,连接会导致cannot allocate memory of…。所以我试图使用data.table加入数据。下面是数据样本。

df1看起来像:

        ID     f_date               ACCNUM    flmNUM start_date   end_date
1    50341 2002-03-08 0001104659-02-000656   2571187 2002-09-07 2003-08-30
2  1067983 2009-11-25 0001047469-09-010426  91207220 2010-05-27 2011-05-19
3   804753 2004-05-14 0001193125-04-088404   4805453 2004-11-13 2005-11-05
4  1090727 2013-05-22 0000712515-13-000022  13865105 2013-11-21 2014-11-13
5  1467858 2010-02-26 0001193125-10-043035  10640035 2010-08-28 2011-08-20
6   858877 2019-01-31 0001166691-19-000005  19556540 2019-08-02 2020-07-24
7     2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17
8  1478242 2004-03-12 0001193125-04-039482   4664082 2004-09-11 2005-09-03
9  1467858 2017-02-16 0001555280-17-000044  17618235 2017-08-18 2018-08-10
10   14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20

df2看起来像:

     ID       date fyear     at     lt
1 50341 1998-12-31  1998 104382  94973
2 50341 1999-12-31  1999 190692 175385
3 50341 2000-12-31  2000 179519 163347
4 50341 2001-12-31  2001 203638 186030
5 50341 2002-12-31  2002 190453 173620
6 50341 2003-12-31  2003 200235 181955

我将专注于ID = 50341。如果df2$datedf1$start_datedf1$end_date的时间段然后加入他们。所以这里df2$date = 2002-12-31介于df1开始2002-09-07和结束2003-08-30之间,因此加入这一行。

我运行以下代码并获得相应的输出:

df1$f_date <- as.Date(df1$f_date)
df2$date <- as.Date(df2$date)

df1$start_date <- df1$f_date + 183
df1$end_date <- df1$f_date + 540

library(fuzzyjoin)
final_data <- fuzzy_left_join(
  df1, df2,
  by = c(
    "ID" = "ID",
    "start_date" = "date",
    "end_date" = "date"
  ),
  match_fun = list(`==`, `<`, `>=`)
)

final_data

输出:

      ID.x     f_date               ACCNUM    flmNUM start_date   end_date    ID.y       date fyear         at         lt
1    50341 2002-03-08 0001104659-02-000656   2571187 2002-09-07 2003-08-30   50341 2002-12-31  2002 190453.000 173620.000
2  1067983 2009-11-25 0001047469-09-010426  91207220 2010-05-27 2011-05-19 1067983 2010-12-31  2010 372229.000 209295.000
3   804753 2004-05-14 0001193125-04-088404   4805453 2004-11-13 2005-11-05  804753 2004-12-31  2004    982.265    383.614
4  1090727 2013-05-22 0000712515-13-000022  13865105 2013-11-21 2014-11-13 1090727 2013-12-31  2013  36212.000  29724.000
5  1467858 2010-02-26 0001193125-10-043035  10640035 2010-08-28 2011-08-20 1467858 2010-12-31  2010 138898.000 101739.000
6   858877 2019-01-31 0001166691-19-000005  19556540 2019-08-02 2020-07-24      NA       <NA>    NA         NA         NA
7     2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17    2488 2016-12-31  2016   3321.000   2905.000
8  1478242 2004-03-12 0001193125-04-039482   4664082 2004-09-11 2005-09-03      NA       <NA>    NA         NA         NA
9  1467858 2017-02-16 0001555280-17-000044  17618235 2017-08-18 2018-08-10 1467858 2017-12-31  2017 212482.000 176282.000
10   14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20   14693 2016-04-30  2015   4183.000   2621.000

在这里我们可以看到ID = 50341正确连接。

当我尝试运行data.table方式时,我得到这个输出:

码:

dt_final_data <- setDT(df2)[df1, on = .(ID, date > start_date, date <= end_date)]

输出:

         ID       date fyear         at         lt     date.1     f_date               ACCNUM    flmNUM
 1:   50341 2002-09-07  2002 190453.000 173620.000 2003-08-30 2002-03-08 0001104659-02-000656   2571187
 2: 1067983 2010-05-27  2010 372229.000 209295.000 2011-05-19 2009-11-25 0001047469-09-010426  91207220
 3:  804753 2004-11-13  2004    982.265    383.614 2005-11-05 2004-05-14 0001193125-04-088404   4805453
 4: 1090727 2013-11-21  2013  36212.000  29724.000 2014-11-13 2013-05-22 0000712515-13-000022  13865105
 5: 1467858 2010-08-28  2010 138898.000 101739.000 2011-08-20 2010-02-26 0001193125-10-043035  10640035
 6:  858877 2019-08-02    NA         NA         NA 2020-07-24 2019-01-31 0001166691-19-000005  19556540
 7:    2488 2016-08-25  2016   3321.000   2905.000 2017-08-17 2016-02-24 0001193125-16-476010 161452982
 8: 1478242 2004-09-11    NA         NA         NA 2005-09-03 2004-03-12 0001193125-04-039482   4664082
 9: 1467858 2017-08-18  2017 212482.000 176282.000 2018-08-10 2017-02-16 0001555280-17-000044  17618235
10:   14693 2016-04-28  2015   4183.000   2621.000 2017-04-20 2015-10-28 0001193125-15-356351 151180619
dt_final_data

start_datedf1现在变成了dateend_datedf1变成了date.1。因此,我在date的原始df2专栏已经消失,这是检查合并是否合理的重要日期之一。

两个问题:

如何在fuzzyjoin示例中保留所有日期列? data.table更改名称的方式使我在检查连接时有点混乱。

代码/逻辑是否正确?我已经多次查看过这个连接的数据并且“看起来”是正确的。

数据1:

df1 <- 
    structure(list(ID = c(50341L, 1067983L, 804753L, 1090727L, 1467858L, 
858877L, 2488L, 1478242L, 1467858L, 14693L), f_date = structure(c(11754, 
14573, 12552, 15847, 14666, 17927, 16855, 12489, 17213, 16736
), class = "Date"), ACCNUM = c("0001104659-02-000656", "0001047469-09-010426", 
"0001193125-04-088404", "0000712515-13-000022", "0001193125-10-043035", 
"0001166691-19-000005", "0001193125-16-476010", "0001193125-04-039482", 
"0001555280-17-000044", "0001193125-15-356351"), flmNUM = c(2571187L, 
91207220L, 4805453L, 13865105L, 10640035L, 19556540L, 161452982L, 
4664082L, 17618235L, 151180619L), 
start_date = structure(c(11937, 14756, 12735, 16030, 14849, 18110, 17038, 
                         12672, 17396, 16919), class = "Date"), 
end_date = structure(c(12294, 15113, 13092, 16387, 15206, 18467, 17395, 13029,
                       17753, 17276), class = "Date")
), row.names = c(NA, -10L), class = "data.frame")

数据2:

df2 <-
    structure(list(ID = c(2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 1067983L, 1067983L, 
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 14693L, 14693L, 
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 
14693L, 14693L, 14693L, 50341L, 50341L, 50341L, 50341L, 50341L, 
50341L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 
1090727L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 
804753L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L, 
1478242L, 1478242L, 1478242L, 1478242L, 858877L, 858877L, 858877L, 
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 
858877L, 858877L, 858877L, 858877L), date = structure(c(10591, 
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878, 
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166, 
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783, 
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070, 
16435, 16800, 17166, 17531, 17896, 10346, 10711, 11077, 11442, 
11807, 12172, 12538, 12903, 13268, 13633, 13999, 14364, 14729, 
15094, 15460, 15825, 16190, 16555, 16921, 17286, 17651, 10591, 
10956, 11322, 11687, 12052, 12417, 10591, 10956, 11322, 11687, 
12052, 12417, 12783, 13148, 13513, 13878, 14244, 14609, 14974, 
15339, 15705, 16070, 16435, 16800, 17166, 17531, 17896, 10591, 
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878, 
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166, 
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783, 
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070, 
16435, 16800, 17166, 17531, 17896, 14609, 14974, 15339, 15705, 
16070, 16435, 16800, 17166, 17531, 17896, 10438, 10803, 11169, 
11534, 11899, 12264, 12630, 12995, 13360, 13725, 14091, 14456, 
14821, 15186, 15552, 15917, 16282, 16647, 17013, 17378, 17743
), class = "Date"), fyear = c(1998L, 1999L, 2000L, 2001L, 2002L, 
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
2018L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 1998L, 1999L, 2000L, 2001L, 2002L, 
2003L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 
2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L, 
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
2018L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 
2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 2018L), at = c(4252.968, 4377.698, 
5767.735, 5647.242, 5619.181, 7094.345, 7844.21, 7287.779, 13147, 
11550, 7675, 9078, 4964, 4954, 4000, 4337, 3767, 3109, 3321, 
3540, 4556, 122237, 131416, 135792, 162752, 169544, 180559, 188874, 
198325, 248437, 273160, 267399, 297119, 372229, 392647, 427452, 
484931, 526186, 552257, 620854, 702095, 707794, 1494, 1735, 1802, 
1939, 2016, 2264, 2376, 2624, 2728, 3551, 3405, 3475, 3383, 3712, 
3477, 3626, 4103, 4193, 4183, 4625, 4976, 104382, 190692, 179519, 
203638, 190453, 200235, 257389, 274730, 303100, 323969, 370782, 
448507, 479921, 476078, 186192, 148883, 91047, 136295, 138898, 
144603, 149422, 166344, 177677, 194520, 221690, 212482, 227339, 
17067, 23043, 21662, 24636, 26357, 28909, 33026, 35222, 33210, 
39042, 31879, 31883, 33597, 34701, 38863, 36212, 35471, 38311, 
40377, 45403, 50016, 436.485, 660.891, 616.411, 712.302, 779.279, 
859.34, 982.265, 1303.629, 1491.39, 1689.956, 1880.988, 2148.567, 
2422.79, 3000.358, 3704.468, 4098.364, 4530.565, 5561.984, 5629.963, 
6469.311, 6708.636, NA, NA, 2322.917, 2499.153, 3066.797, 3305.832, 
3926.316, 21208, 22742, 22549, 8916.705, 14725, 32870, 35238, 
37795, 37107, 35594, 33883, 43315, 53340, 58734, 68128, 81130, 
87095, 91759, 101191, 105134, 113481, 121652, 129818, 108784), 
    lt = c(2247.919, 2398.425, 2596.068, 2092.187, 3151.916, 
    3938.395, 3993.516, 3700.954, 7072, 8295, 7588, 7354, 3951, 
    3364, 3462, 3793, 3580, 3521, 2905, 2929, 3290, 63190, 72232, 
    72799, 103453, 104116, 102218, 102216, 106025, 137756, 149759, 
    153820, 161334, 209295, 223686, 235864, 260446, 283159, 293630, 
    334495, 350141, 355294, 677, 818, 754, 752, 705, 1424, 1291, 
    1314, 1165, 1978, 1680, 1659, 1488, 1652, 1408, 1998, 2071, 
    2288, 2621, 3255, 3660, 94973, 175385, 163347, 186030, 173620, 
    181955, 241738, 253490, 272218, 303516, 363134, 422932, 452164, 
    460442, 190443, 184363, 176387, 107340, 101739, 105612, 112422, 
    123170, 141653, 154197, 177615, 176282, 184562, 9894, 10569, 
    11927, 14388, 13902, 14057, 16642, 18338, 17728, 26859, 25099, 
    24187, 25550, 27593, 34130, 29724, 33313, 35820, 39948, 44373, 
    46979, 165.342, 281.954, 272.694, 317.463, 338.035, 363.494, 
    383.614, 541.81, 571.972, 556.242, 568.693, 567.769, 517.373, 
    689.557, 870.818, 930.7, 964.597, 1691.6, 1702.016, 1683.963, 
    1780.247, NA, NA, 3292.513, 3858.197, 3734.282, 4009.844, 
    4261.997, 12348, 14384, 15595, 1766.98, 3003, 6328, 8096, 
    9124, 9068, 9678, 10699, 19397, 21850, 24332, 29451, 36845, 
    39836, 40458, 42063, 48473, 53774, 58067, 63681, 65580)), row.names = c(NA, 
-163L), class = "data.frame")
r data.table fuzzyjoin
1个回答
1
投票

To clarify terminology:

针对您的问题的data.table方法不需要使用带有data.table的模糊连接[至少不是在不精确匹配的意义上]。相反,您只想使用非等二元运算符>=><=和/或<加入data.table列。在data.table术语中,这些被称为“非等联合”。

你在第一次尝试使用库(fuzzyjoin)之后,你可以理解地将你的问题命名为“使用data.table模糊连接两个数据框”。 (没问题,只是为读者澄清。)

Solution using data.table non equi joins to compare date columns:

你非常接近一个有效的data.table解决方案:

dt_final_data <- setDT(df2)[df1, 
                            on = .(ID, date > start_date, date <= end_date)]

要修改它以使其按预期工作,只需添加一个data.table j表达式,按照您希望它们编辑的顺序选择所需的列:并在问题列前加上x.(告诉data.table返回来自x加入的dt_x[dt_i,]一侧的列)例如,如下所示调用列x.date

dt_final_data <- setDT(df2)[df1, 
                            .(ID, f_date, ACCNUM, flmNUM, start_date, end_date, x.date, fyear, at, lt), 
                            on = .(ID, date > start_date, date <= end_date)]

现在,这将为您提供以下输出:

dt_final_data
         ID     f_date               ACCNUM    flmNUM start_date   end_date     x.date fyear         at         lt
 1:   50341 2002-03-08 0001104659-02-000656   2571187 2002-09-07 2003-08-30 2002-12-31  2002 190453.000 173620.000
 2: 1067983 2009-11-25 0001047469-09-010426  91207220 2010-05-27 2011-05-19 2010-12-31  2010 372229.000 209295.000
 3:  804753 2004-05-14 0001193125-04-088404   4805453 2004-11-13 2005-11-05 2004-12-31  2004    982.265    383.614
 4: 1090727 2013-05-22 0000712515-13-000022  13865105 2013-11-21 2014-11-13 2013-12-31  2013  36212.000  29724.000
 5: 1467858 2010-02-26 0001193125-10-043035  10640035 2010-08-28 2011-08-20 2010-12-31  2010 138898.000 101739.000
 6:  858877 2019-01-31 0001166691-19-000005  19556540 2019-08-02 2020-07-24       <NA>    NA         NA         NA
 7:    2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17 2016-12-31  2016   3321.000   2905.000
 8: 1478242 2004-03-12 0001193125-04-039482   4664082 2004-09-11 2005-09-03       <NA>    NA         NA         NA
 9: 1467858 2017-02-16 0001555280-17-000044  17618235 2017-08-18 2018-08-10 2017-12-31  2017 212482.000 176282.000
10:   14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20 2016-04-30  2015   4183.000   2621.000

如上所述,ID = 50341的结果现在为date = 2002-12-31。换句话说,结果列date现在来自df2.date

您当然可以重命名j表达式中的x.date列:

setDT(df2)[ df1, 
            .(ID, 
              f_date, 
              ACCNUM, 
              flmNUM, 
              start_date, 
              end_date, 
              my_result_date_name = x.date, 
              fyear, 
              at, 
              lt), 
            on = .(ID, date > start_date, date <= end_date)]

Why does data.table (currently) rename columns in non-equi joins and return data from a different column:

来自@ScottRitchie的This explanation总结得非常好:

执行任何连接时,结果中只返回每个键列的一个副本。目前,返回i中的列,并使用x中的列名标记,使equi连接与base merge()的行为一致。

如果您在版本1.9.8之前记住,则上面有意义data.table没有非equi连接。

通过并包含当前1.12.2版本的data.table,这个(以及几个重叠的问题)已成为关于data.table github问题列表的大量讨论的源头。例如:possible inconsistency in non-equi join, returning join columns #3437SQL-like column return for non-equi and rolling joins #2706只是其中的两个。

但是,请注意这个github问题:继续上述讨论后,data.table团队敏锐的分析思维正在努力使一些(希望不会太遥远)的未来版本更加混乱:Both columns for rolling and non-equi joins #3093

© www.soinside.com 2019 - 2024. All rights reserved.