我正在尝试使用sqldf基于日期范围过滤数据帧,如下面的示例代码。我有类似下面的示例数据的数据。 sqldf返回的datedf数据帧没有记录。在该日期范围内的SHV数据框中有记录,任何人都可以看到我做错了什么,让我知道如何按sqldf中的日期范围进行过滤。对我而言,日期总是很棘手。
Code:
datedf<-sqldf("select field1
,fieldDate
from SHV
where fieldDate between '2004-01-01' and '2005-01-01'
")
Data:
dput(SHV[1:50,c("field1","fieldDate")])
structure(list(field1 = c(1378L, 1653L, 1882L, 2400L,
2305L, 2051L, 2051L, 2051L, 1796L, 2054L, 2568L, 1290L, 1804L,
1804L, 3855L, 1297L, 2321L, 2321L, 2321L, 2071L, 2071L, 2074L,
2588L, 1567L, 1317L, 1317L, 808L, 808L, 1321L, 2350L, 1586L,
2613L, 1590L, 2614L, 2107L, 1340L, 1085L, 1085L, 2365L, 1344L,
1601L, 1858L, 1603L, 1603L, 1860L, 2376L, 1355L, 1867L, 2382L,
1872L), fieldDate = structure(c(12551, NA, NA, 14057, 15337,
12919, 13336, 10325, 14984, 15643, 12864, 11242, 10749, 11207,
10602, NA, 12646, 15649, NA, NA, NA, NA, NA, 17015, 13938, NA,
16693, NA, NA, 12634, 12614, 10689, 12755, 10844, 11375, 4899,
17298, 10905, 11450, NA, 10330, 15429, 12634, 10504, 12625, 11081,
10939, NA, 12934, 11176), class = "Date")), .Names = c("field1",
"fieldDate"), row.names = c(NA, 50L), class = "data.frame")
在此数据样本中,您在该日期范围内没有记录:
SHV[SHV$fieldDate >= "2010-01-01" & SHV$fieldDate < "2011-01-01",]
field1 fieldDate
NA NA <NA>
NA.1 NA <NA>
NA.2 NA <NA>
NA.3 NA <NA>
NA.4 NA <NA>
NA.5 NA <NA>
NA.6 NA <NA>
NA.7 NA <NA>
NA.8 NA <NA>
NA.9 NA <NA>
NA.10 NA <NA>
NA.11 NA <NA>
NA.12 NA <NA>
根据sqldf()
documentation,需要将日期格式化为其数值,以便将它们作为日期处理。这可以在生成SQL查询时使用sprintf()
完成。
SHV <- structure(list(field1 = c(1378L, 1653L, 1882L, 2400L,
2305L, 2051L, 2051L, 2051L, 1796L, 2054L, 2568L, 1290L, 1804L,
1804L, 3855L, 1297L, 2321L, 2321L, 2321L, 2071L, 2071L, 2074L,
2588L, 1567L, 1317L, 1317L, 808L, 808L, 1321L, 2350L, 1586L,
2613L, 1590L, 2614L, 2107L, 1340L, 1085L, 1085L, 2365L, 1344L,
1601L, 1858L, 1603L, 1603L, 1860L, 2376L, 1355L, 1867L, 2382L,
1872L), fieldDate = structure(c(12551, NA, NA, 14057, 15337,
12919, 13336, 10325, 14984, 15643, 12864, 11242, 10749, 11207,
10602, NA, 12646, 15649, NA, NA, NA, NA, NA, 17015, 13938, NA,
16693, NA, NA, 12634, 12614, 10689, 12755, 10844, 11375, 4899,
17298, 10905, 11450, NA, 10330, 15429, 12634, 10504, 12625, 11081,
10939, NA, 12934, 11176), class = "Date")), .Names = c("field1",
"fieldDate"), row.names = c(NA, 50L), class = "data.frame")
library(sqldf)
sqlStmt <- paste("select field1, fieldDate from SHV",
"where fieldDate between ",
sprintf("%d and %d",as.Date('2004-01-01','%Y-%m-%d'),
as.Date('2005-01-01','%Y-%m-%d')))
datedf<-sqldf(sqlStmt)
datedf
> datedf
field1 fieldDate
1 1378 2004-05-13
2 2321 2004-08-16
3 2350 2004-08-04
4 1586 2004-07-15
5 1590 2004-12-03
6 1603 2004-08-04
7 1860 2004-07-26
>
sprintf()
语句将日期转换为数值,这可确保SQL中的between
运算符正常工作。
> sqlStmt
[1] "select field1, fieldDate from SHV where fieldDate between 12418 and 12784"
>
根据this article,在执行sqldf之前,应将日期字段转换为字符。
在将任何日期传递给SQLdf之前,我们需要先将它们转换为字符串。否则,SQLdf会尝试将它们视为数字 - 这会引起很多心痛。
...
相反,我们应该将DateCreated列转换为字符串而不是日期。然后,SQL实际上将它从字符串转换为日期。
困惑?当我想要自己解决这个问题时,想象一下我。
所以你的代码可能是:
SHV$fieldDate <- as.character(SHV$fieldDate)
datedf <- sqldf("
SELECT
field1,
fieldDate
FROM SHV
WHERE fieldDate between '2004-01-01' and '2005-01-01'
--WHERE '2004-01-01' <= fieldDate --and fieldDate <= '2005-01-01'
ORDER BY fieldDate
")
# Both should equal 7. Verify that null rows are handled as desired.
nrow(datedf)
sum(as.Date('2004-01-01') <= SHV$fieldDate & SHV$fieldDate <= as.Date('2005-01-01'), na.rm=T)
我希望它能更好地解释何时将具有日期的变量转换为实际日期。如果你正在寻找更多,@ g-grothendieck的这个SO response采用不同的方法并将sqldf查询中的数据类型等同起来。