我试图找到历史连续多年销售高峰的项目。我的问题是有些物品在过去已售出并已停产,但仍需要成为分析的一部分。例如:
我在r中已经完成了一些for循环,但是我不确定如何解决连续多年总结的问题,并将其与同一数据集中的其他局部最大值进行比较。
Year Item Sales
2001 Trash Can 100
2002 Trash Can 125
2003 Trash Can 90
2004 Trash Can 97
2002 Red Balloon 23
2003 Red Balloon 309
2004 Red Balloon 67
2005 Red Balloon 8
1998 Blue Bottle 600
1999 Blue Bottle 565
根据以上数据,如果我想计算出2年的销售高峰,我想输出Blue Bottle 1165(1998年和1999年的总和),Red Balloon 376(2003年和2004年的总和)和垃圾桶225(总和) 2001年和2002年)。但是,如果我想要一个3年的峰值,蓝瓶将没有资格,因为它只有2年的数据。
如果我想计算3年的销售高峰,我想输出红气球399(2002年到2004年的总和)和垃圾桶315(2001年到2003年的总和)。
在SQL中,您可以使用窗口函数。符合条件的2年销售额:
select item, sales, year
from (select t.*,
sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
row_number() over (partition by item order by year) as seqnum
from t
) t
where seqnum >= 2;
并获得高峰:
select t.*
from (select item, two_year_sales, year,
max(two_year_sales) over (partition by item) as max_two_year_sales
from (select t.*,
sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
row_number() over (partition by item order by year) as seqnum
from t
) t
where seqnum >= 2
) t
where two_year_sales = max_two_year_sales;
我只能用SQL
部分帮助你;使用GROUP BY
和HAVING
。使用HAVIG
,它将过滤掉所有项目,而没有指定的最小历史数据年数。
检查此查询是否调整了您的要求。
SELECT
item
, count(*) as num_years
, sum(Sales) as local_max
from [your_table]
where year between [year_ini] and [year_end]
group by item
having count(*) >= [number_of_years]
使用tidyverse
和RcppRoll
的R解决方案:
#Loading the packages and your data as a `tibble`
library("RcppRoll")
library("dplyr")
tbl <- tribble(
~Year, ~Item, ~Sales,
2001, "Trash Can", 100,
2002, "Trash Can", 125,
2003, "Trash Can", 90,
2004, "Trash Can", 97,
2002, "Red Balloon", 23,
2003, "Red Balloon", 309,
2004, "Red Balloon", 67,
2005, "Red Balloon", 8,
1998, "Blue Bottle", 600,
1999, "Blue Bottle", 565
)
# Set the number of consecutive years
n <- 2
# Compute the rolling sums (assumes data to be sorted) and take max
res <- tbl %>%
group_by(Item) %>%
mutate(rollingsum = roll_sumr(Sales, n)) %>%
summarize(best_sum = max(rollingsum, na.rm = TRUE))
print(res)
## A tibble: 3 x 2
# Item best_sum
# <chr> <dbl>
#1 Blue Bottle 1165
#2 Red Balloon 376
#3 Trash Can 225
设置n <- 3
产生不同的res
:
print(res)
## A tibble: 3 x 2
# Item best_sum
# <chr> <dbl>
#1 Blue Bottle -Inf
#2 Red Balloon 399
#3 Trash Can 315
阅读数据dat
(在最后的注释中可重复显示)到动物园系列中,每个Item
有一列,然后转换为ts系列tt
(用NA填写缺失的年份)。然后使用rollsumr
获取每个k
的每个连续Item
年的总和,找到每个Item
的最大值,将其堆叠到数据框中并省略任何NA行。函数Max
就像max(x, na.rm = TRUE)
,除非x是所有NAs,它返回NA而不是-Inf并且不发出警告。 stack
输出item列第二个,因此使用2:1反转列并添加更好的名称。
library(zoo)
Max <- function(x) if (all(is.na(x))) NA else max(x, na.rm = TRUE)
peak <- function(data, k) {
tt <- as.ts(read.zoo(data, split = "Item"))
s <- na.omit(stack(apply(rollsumr(tt, k), 2, Max)))
setNames(s[2:1], c("Item", "Sum"))
}
peak(dat, 2)
## Item Sum
## 1 Blue Bottle 1165
## 2 Red Balloon 376
## 3 Trash Can 225
peak(dat, 3)
## Item Sum
## 2 Red Balloon 399
## 3 Trash Can 315
可重复形式的输入假定为:
dat <-
structure(list(Year = c(2001L, 2002L, 2003L, 2004L, 2002L, 2003L,
2004L, 2005L, 1998L, 1999L), Item = c("Trash Can", "Trash Can",
"Trash Can", "Trash Can", "Red Balloon", "Red Balloon", "Red Balloon",
"Red Balloon", "Blue Bottle", "Blue Bottle"), Sales = c(100L,
125L, 90L, 97L, 23L, 309L, 67L, 8L, 600L, 565L)), row.names = c(NA,
-10L), class = "data.frame")