我在sql数据库中有一些数据,我想计算斜率。数据具有以下布局:
Date | Keyword | Score
2012-01-10 | ipad | 0.12
2012-01-11 | ipad | 0.17
2012-01-12 | ipad | 0.24
2012-01-10 | taco | 0.19
2012-01-11 | taco | 0.34
2012-01-12 | taco | 0.45
我希望通过使用SQL创建新表来使最终输出看起来像这样:
Date | Keyword | Score | Slope
2012-01-10 | ipad | 0.12 | 0.06
2012-01-11 | ipad | 0.17 | 0.06
2012-01-12 | ipad | 0.24 | 0.06
2012-01-10 | taco | 0.19 | 0.13
2012-01-11 | taco | 0.34 | 0.13
2012-01-12 | taco | 0.45 | 0.13
为了使事情复杂化,并非所有的关键字都有3个日期,例如,有些只有2个。
SQL越简单越好,因为我的数据库是专有的,并且我不确定是否可以使用哪些公式,尽管我知道它可以执行OVER(PARTITION BY)。谢谢!
更新:我将斜率定义为最佳拟合y = mx + p,在excel中,它是= slope()
这是我通常在excel中处理的另一个实际示例:
date keyword score slope
1/22/2012 water bottle 0.010885442 0.000334784
1/23/2012 water bottle 0.011203949 0.000334784
1/24/2012 water bottle 0.008460835 0.000334784
1/25/2012 water bottle 0.010363991 0.000334784
1/26/2012 water bottle 0.011800716 0.000334784
1/27/2012 water bottle 0.012948411 0.000334784
1/28/2012 water bottle 0.012732459 0.000334784
1/29/2012 water bottle 0.011682568 0.000334784
我能做的最干净的一个:
SELECT
Scores.Date, Scores.Keyword, Scores.Score,
(N * Sum_XY - Sum_X * Sum_Y)/(N * Sum_X2 - Sum_X * Sum_X) AS Slope
FROM Scores
INNER JOIN (
SELECT
Keyword,
COUNT(*) AS N,
SUM(CAST(Date as float)) AS Sum_X,
SUM(CAST(Date as float) * CAST(Date as float)) AS Sum_X2,
SUM(Score) AS Sum_Y,
SUM(Score*Score) AS Sum_Y2,
SUM(CAST(Date as float) * Score) AS Sum_XY
FROM Scores
GROUP BY Keyword
) G ON G.Keyword = Scores.Keyword;
它使用Simple Linear Regression来计算斜率。
结果:
Date Keyword Score Slope
2012-01-22 water bottle 0,010885442 0,000334784345222076
2012-01-23 water bottle 0,011203949 0,000334784345222076
2012-01-24 water bottle 0,008460835 0,000334784345222076
2012-01-25 water bottle 0,010363991 0,000334784345222076
2012-01-26 water bottle 0,011800716 0,000334784345222076
2012-01-27 water bottle 0,012948411 0,000334784345222076
2012-01-28 water bottle 0,012732459 0,000334784345222076
2012-01-29 water bottle 0,011682568 0,000334784345222076
每个数据库系统似乎都有不同的方法将日期转换为数字:
TO_SECONDS(date)
或TO_DAYS(date)
TO_NUMBER(TO_CHAR(date, 'J'))
或date - TO_DATE('1','yyyy')
CAST(date AS float)
(或等效的CONVERT
)如果您将坡度定义为从最早点到最新点的坡度,并且分数仅随日期增加,那么您可以通过以下方式获得上面的输出:
SELECT *
FROM scores
JOIN
(SELECT foo.keyword,
(MAX(score)-MIN(score)) / DATEDIFF(MAX(date),MIN(date)) AS score
FROM scores
GROUP BY keyword) a
USING(keyword);
但是,如果要进行线性回归,或者分数会随着时间的增加而减少或增加,则需要更复杂的东西。
从小数点到小数对我来说并不能给出正确的结果,它与日期不是线性的。改用TO_DAYS(date_field)
,这变得正确。
使用此SUM(CONVERT(float,datediff(dd,'1/1/1900',date_field)))代替SUM(CAST(date_field as float))