扩展由“从”和“到”列定义的范围

问题描述 投票:0回答:11

这个问题也被称为“将“开始-结束”数据集转换为面板数据集”

我有一个数据框,其中包含

"name"
的美国总统,他们开始和结束任期的年份,(
"from"
"to"
列)。这是一个示例:

presidents <- data.frame(
  name = c("Bill Clinton", "George W. Bush", "Barack Obama"),
  from = c(1993, 2001, 2009),
  to = c(2001, 2009, 2012)
)
presidents
#>             name from   to
#> 1   Bill Clinton 1993 2001
#> 2 George W. Bush 2001 2009
#> 3   Barack Obama 2009 2012

我想创建包含两列(

"name"
"year"
)的数据框,总统在任的每一年都有一行。因此,我需要创建一个每年从“
from
”到
"to"
的常规序列。这是我的预期:

name           year
Bill Clinton   1993
Bill Clinton   1994
...
Bill Clinton   2000
Bill Clinton   2001
George W. Bush 2001
George W. Bush 2002
... 
George W. Bush 2008
George W. Bush 2009
Barack Obama   2009
Barack Obama   2010
Barack Obama   2011
Barack Obama   2012

我知道我可以用

data.frame(name = "Bill Clinton", year = seq(1993, 2001))
为单个总统扩展东西,但我不知道如何为每个总统迭代。

我该怎么做?我觉得我应该知道这个,但我是一片空白。

更新 1

好的,我已经尝试了两种解决方案,但出现错误:

foo<-structure(list(name = c("Grover Cleveland", "Benjamin Harrison", "Grover Cleveland"), from = c(1885, 1889, 1893), to = c(1889, 1893, 1897)), .Names = c("name", "from", "to"), row.names = 22:24, class = "data.frame")
ddply(foo, "name", summarise, year = seq(from, to))
Error in seq.default(from, to) : 'from' must be of length 1
r dataframe sequence
11个回答
20
投票

这里有一个

data.table
解决方案。它有一个很好的(如果是次要的)特征,就是让总统按照他们提供的顺序排列:

library(data.table)
dt <- data.table(presidents)
dt[, list(year = seq(from, to)), by = name]
#               name year
#  1:   Bill Clinton 1993
#  2:   Bill Clinton 1994
#  ...
#  ...
# 21:   Barack Obama 2011
# 22:   Barack Obama 2012

编辑: 要处理任期不连续的总统,请改用:

dt[, list(year = seq(from, to)), by = c("name", "from")]

18
投票

您可以使用

plyr
包:

library(plyr)
ddply(presidents, "name", summarise, year = seq(from, to))
#              name year
# 1    Barack Obama 2009
# 2    Barack Obama 2010
# 3    Barack Obama 2011
# 4    Barack Obama 2012
# 5    Bill Clinton 1993
# 6    Bill Clinton 1994
# [...]

如果数据按年份排序很重要,可以使用

arrange
函数:

df <- ddply(presidents, "name", summarise, year = seq(from, to))
arrange(df, df$year)
#              name year
# 1    Bill Clinton 1993
# 2    Bill Clinton 1994
# 3    Bill Clinton 1995
# [...]
# 21   Barack Obama 2011
# 22   Barack Obama 2012

编辑 1:跟随@edgester 的“更新 1”,更合适的方法是使用

adply
来计算任期不连续的总统:

adply(foo, 1, summarise, year = seq(from, to))[c("name", "year")]

14
投票

使用

tidyverse
unnest
的替代
map2
方法。无论您有多少数据列(例如
name
),它们都将正确出现在新数据框中。

library(tidyverse)

presidents %>%
  mutate(year = map2(from, to, seq)) %>%
  unnest(year) %>%
  select(-from, -to)

#              name  year
# 1    Bill Clinton  1993
# 2    Bill Clinton  1994
...
# 21   Barack Obama  2011
# 22   Barack Obama  2012

tidyr v1.0.0
之前,可以创建变量作为
unnest()
的一部分。

presidents %>%
  unnest(year = map2(from, to, seq)) %>%
  select(-from, -to)

8
投票

这是一个

dplyr
解决方案:

library(dplyr)

# the data
presidents <- 
structure(list(name = c("Bill Clinton", "George W. Bush", "Barack Obama"
), from = c(1993, 2001, 2009), to = c(2001, 2009, 2012)), .Names = c("name", 
"from", "to"), row.names = 42:44, class = "data.frame")

# the expansion of the table
presidents %>%
    rowwise() %>%
    do(data.frame(name = .$name, year = seq(.$from, .$to, by = 1)))

# the output
Source: local data frame [22 x 2]
Groups: <by row>

             name  year
            (chr) (dbl)
1    Bill Clinton  1993
2    Bill Clinton  1994
3    Bill Clinton  1995
4    Bill Clinton  1996
5    Bill Clinton  1997
6    Bill Clinton  1998
7    Bill Clinton  1999
8    Bill Clinton  2000
9    Bill Clinton  2001
10 George W. Bush  2001
..            ...   ...

h/t:https://stackoverflow.com/a/24804470/1036500


7
投票

两个

base
解决方案。

使用

sequence

len = d$to - d$from + 1
data.frame(name = d$name[rep(1:nrow(d), len)], year = sequence(len, d$from))

使用

mapply

l <- mapply(`:`, d$from, d$to) 
data.frame(name = d$name[rep(1:nrow(d), lengths(l))], year = unlist(l))

#              name year
# 1    Bill Clinton 1993
# 2    Bill Clinton 1994
# ...snip
# 8    Bill Clinton 2000
# 9    Bill Clinton 2001
# 10 George W. Bush 2001
# 11 George W. Bush 2002
# ...snip
# 17 George W. Bush 2008
# 18 George W. Bush 2009
# 19   Barack Obama 2009
# 20   Barack Obama 2010
# 21   Barack Obama 2011
# 22   Barack Obama 2012

正如@Esteis 在评论中 所指出的那样,随着范围的扩大,很可能需要重复几列(不仅仅是“名称”,就像在 OP 中一样)。在这种情况下,不是重复单个列的值,而是简单地重复整个数据框的行,“从”和“到”列除外。一个简单的例子:

d = data.frame(x = 1:2, y = 3:4, names = c("a", "b"),
               from = c(2001, 2011), to = c(2003, 2012))
#   x y names from   to
# 1 1 3     a 2001 2003
# 2 2 4     b 2011 2012

len = d$to - d$from + 1

cbind(d[rep(1:nrow(d), len), setdiff(names(d), c("from", "to"))],
      year = sequence(len, d$from))

    x y names year
1   1 3     a 2001
1.1 1 3     a 2002
1.2 1 3     a 2003
2   2 4     b 2011
2.1 2 4     b 2012

3
投票

这是一个快速的基础-

R
解决方案,其中
Df
是您的
data.frame

do.call(rbind, apply(Df, 1, function(x) {
  data.frame(name=x[1], year=seq(x[2], x[3]))}))

它给出了一些关于行名的警告,但似乎返回了正确的

data.frame


2
投票

使用

tidyverse
的另一个选项可能是将
gather
数据转换为长格式,
group_by
name
并在
from
to
日期之间创建一个序列。

library(tidyverse)

presidents %>%
  gather(key, date, -name) %>%
  group_by(name) %>%
  complete(date = seq(date[1], date[2]))%>%
  select(-key) 

# A tibble: 22 x 2
# Groups:   name [3]
#   name          date
#   <chr>        <dbl>
# 1 Barack Obama  2009
# 2 Barack Obama  2010
# 3 Barack Obama  2011
# 4 Barack Obama  2012
# 5 Bill Clinton  1993
# 6 Bill Clinton  1994
# 7 Bill Clinton  1995
# 8 Bill Clinton  1996
# 9 Bill Clinton  1997
#10 Bill Clinton  1998
# … with 12 more rows

2
投票

另一个使用

dplyr
tidyr
的解决方案。它正确地保留了您拥有的任何数据列。

library(magrittr) # for pipes

df <- data.frame(
    tata = c('toto1', 'toto2'),
    from = c(2000, 2004),
    to = c(2001, 2009),
    measure1 = rnorm(2),
    measure2 = 10 * rnorm(2)
)

   tata from   to measure1 measure2
1 toto1 2000 2001   -0.575   -10.13
2 toto2 2004 2009   -0.258    17.37

df %>% 
  dplyr::rowwise() %>%
  dplyr::mutate(year = list(seq(from, to))) %>%
  dplyr::select(-from, -to) %>%
  tidyr::unnest(c(year))

# A tibble: 8 x 4
  tata  measure1 measure2  year
  <chr>    <dbl>    <dbl> <int>
1 toto1   -0.575    -10.1  2000
2 toto1   -0.575    -10.1  2001
3 toto2   -0.258     17.4  2004
4 toto2   -0.258     17.4  2005
5 toto2   -0.258     17.4  2006
6 toto2   -0.258     17.4  2007
7 toto2   -0.258     17.4  2008
8 toto2   -0.258     17.4  2009

1
投票

使用

by
创建一个
by
列表
L
data.frames,每个总统一个 data.frame,然后
rbind
他们在一起。没有使用包。

L <- by(presidents, presidents$name, with, data.frame(name, year = from:to))
do.call("rbind", setNames(L, NULL))

如果您不介意行名,那么最后一行可以简化为:

do.call("rbind", L)

1
投票

tidyverse
解决方案的补充可以是:

df %>%
 uncount(to - from + 1) %>%
 group_by(name) %>%
 transmute(year = seq(first(from), first(to)))

   name            year
   <chr>          <dbl>
 1 Bill Clinton    1993
 2 Bill Clinton    1994
 3 Bill Clinton    1995
 4 Bill Clinton    1996
 5 Bill Clinton    1997
 6 Bill Clinton    1998
 7 Bill Clinton    1999
 8 Bill Clinton    2000
 9 Bill Clinton    2001
10 George W. Bush  2001

0
投票

这是另一个应该相当快的基本 R 解决方案:

a <- lapply(1:nrow(presidents),function(a){
  data.frame(
    name=rep(presidents$name[a],presidents$to[a]-presidents$from[a]+1),
    year=presidents$from[a]:presidents$to[a]
    )
  })
do.call('rbind',a)

和任何你想保留的数据都可以像名称列一样处理。

我很好奇上面 Jason Morgan 对其他 Base R 解决方案的基准测试:

f_max <- function(x){
  a <- lapply(1:nrow(x),function(a){
    data.frame(
      name=rep(x$name[a],x$to[a]-x$from[a]+1),
      year=x$from[a]:x$to[a]
    )
  })
  do.call(rbind,a)
}

f_jason <- function(y){
  do.call(rbind, apply(y, 1, function(x) {
    data.frame(name=x[1], year=seq(x[2], x[3]))}))
}

f_combined <- function(y){
  do.call(rbind, apply(y, 1, function(x) {
    data.frame(name=rep(x[1],as.numeric(x[3])-as.numeric(x[2])+1), year=x[2]:x[3])}))
}

r <- f_jason(presidents)
all(r==f_max(presidents))
all(r==f_combined(presidents))
res <- microbenchmark(f_jason(presidents),f_combined(presidents),f_max(presidents))
print(res, order="mean")

                   expr   min     lq    mean median      uq    max neval cld
      f_max(presidents) 436.9 462.75 561.074 482.30  521.25 5601.0   100  a 
 f_combined(presidents) 566.5 605.95 796.029 639.70  723.60 7548.7   100   b
    f_jason(presidents) 770.2 829.70 998.108 906.15 1011.85 4891.0   100   b
© www.soinside.com 2019 - 2024. All rights reserved.