我有一个csv
文件,其中行1-5
代表一个状态,5-10
代表另一个等等...我也有一个列,每个状态重复1970,1980,..,2010
年。在R
(虽然我不反对Excel中的解决方案,如果这更容易),我希望每个州计算当年和1970年之间的百分比差异,即Alabama 1990
它将是(AL 1990 - AL 1970)/(AL 1970)
,并将其添加到新的数据表中的列,以便我可以将其导出到csv
。
State, Year, Num
AL, 1970, 1
AL, 1980, 2
AL, 1990, 3
AL, 2000, 4
AL, 2010, 6
输出将是一列
pct_change
0
1
2
3
5
dplyr
包中包含函数first
,它提供了获取组的第一个值的简单方法。因此,如果我们安排Year
使1970年将成为每组的第一个值,当我们group_by(State)
时,我们可以使用first(Num)
来获得代表1970年价值的Num
的第一个值:
# Example data with 2 states
df <- structure(list(State = c("AL", "AL", "AL", "AL", "AL", "TX",
"TX", "TX", "TX", "TX"), Year = c(1970L, 1980L, 1990L, 2000L,
2010L, 1970L, 1980L, 1990L, 2000L, 2010L), Num = c(1, 2, 3, 4,
6, 5, 2, 10, 12, 6)), class = "data.frame", row.names = c(NA,
-10L))
library(dplyr)
df %>%
arrange(State, Year) %>%
group_by(State) %>%
mutate(perc_diff = 100 * (Num - first(Num))/first(Num))
# A tibble: 10 x 4
# Groups: State [2]
State Year Num perc_diff
<chr> <int> <dbl> <dbl>
1 AL 1970 1 0
2 AL 1980 2 100
3 AL 1990 3 200
4 AL 2000 4 300
5 AL 2010 6 500
6 TX 1970 5 0
7 TX 1980 2 -60
8 TX 1990 10 100
9 TX 2000 12 140
10 TX 2010 6 20
我们可以使用data.table
。将'data.frame'转换为'data.table'(setDT(df)
),order
转换为'State','year'在i
中,按'State'分组,得到'Num'与first
值'的差异' Num'并指定(:=
)创建'perc_diff'
library(data.table)
setDT(df)[order(State, Year), perc_diff :=
100 * (Num - first(Num))/first(Num), State][]
# State Year Num perc_diff
# 1: AL 1970 1 0
# 2: AL 1980 2 100
# 3: AL 1990 3 200
# 4: AL 2000 4 300
# 5: AL 2010 6 500
# 6: TX 1970 5 0
# 7: TX 1980 2 -60
# 8: TX 1990 10 100
# 9: TX 2000 12 140
#10: TX 2010 6 20
或者使用base R
v1 <- with(df, ave(Num, State, FUN = function(x) x[1]))
df$perc_diff <- with(df, 100 * (Num - v1)/v1)
df <- structure(list(State = c("AL", "AL", "AL", "AL", "AL", "TX",
"TX", "TX", "TX", "TX"), Year = c(1970L, 1980L, 1990L, 2000L,
2010L, 1970L, 1980L, 1990L, 2000L, 2010L), Num = c(1, 2, 3, 4,
6, 5, 2, 10, 12, 6)), class = "data.frame", row.names = c(NA,
-10L))
使用R
基础tapply
解决方案
df <- df[with(df, order(State, Year)), ]
df$pct_change <- unlist( tapply(df$Num, df$State, function(x) 100 * (x - x[1]) / x[1]) )
> df
State Year Num pct_change
1 AL 1970 1 0
2 AL 1980 2 100
3 AL 1990 3 200
4 AL 2000 4 300
5 AL 2010 6 500
6 TX 1970 5 0
7 TX 1980 2 -60
8 TX 1990 10 100
9 TX 2000 12 140
10 TX 2010 6 20