cube，rollup和groupBy运算符有什么区别？

Question

标题中几乎有问题。我找不到有关差异的详细文档。

我确实注意到了一个差异，因为在交换多维数据集和groupBy函数调用时，我得到了不同的结果。我注意到，对于使用“多维数据集”的结果，我经常将其分组的表达式上有很多空值。

Answer 1

这些不旨在以相同的方式工作。 groupBy与标准SQL中的GROUP BY子句完全等效。换句话说

table.groupBy($"foo", $"bar")

等效于：

SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar

cube等效于CUBE对GROUP BY的扩展。它获取列列表，并将聚合表达式应用于分组列的所有可能的组合。假设您有这样的数据：

val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")

df.show

// +---+---+
// |  x|  y|
// +---+---+
// |foo|  1|
// |foo|  2|
// |bar|  2|
// |bar|  2|
// +---+---+

并且您将cube(x, y)计算为一个总数：

df.cube($"x", $"y").count.show

// +----+----+-----+     
// |   x|   y|count|
// +----+----+-----+
// |null|   1|    1|   <- count of records where y = 1
// |null|   2|    3|   <- count of records where y = 2
// | foo|null|    2|   <- count of records where x = foo
// | bar|   2|    2|   <- count of records where x = bar AND y = 2
// | foo|   1|    1|   <- count of records where x = foo AND y = 1
// | foo|   2|    1|   <- count of records where x = foo AND y = 2
// |null|null|    4|   <- total count of records
// | bar|null|    2|   <- count of records where x = bar
// +----+----+-----+

与cube类似的功能是rollup，它从左到右计算分层的小计：

df.rollup($"x", $"y").count.show
// +----+----+-----+
// |   x|   y|count|
// +----+----+-----+
// | foo|null|    2|   <- count where x is fixed to foo
// | bar|   2|    2|   <- count where x is fixed to bar and y is fixed to  2
// | foo|   1|    1|   ...
// | foo|   2|    1|   ...
// |null|null|    4|   <- count where no column is fixed
// | bar|null|    2|   <- count where x is fixed to bar
// +----+----+-----+

仅作比较，让我们看一下普通groupBy的结果：

df.groupBy($"x", $"y").count.show

// +---+---+-----+
// |  x|  y|count|
// +---+---+-----+
// |foo|  1|    1|   <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo|  2|    1|   <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar|  2|    2|   <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+

总结：

使用普通GROUP BY时，每行在其相应的摘要中仅包含一次。

GROUP BY CUBE(..)中的每一行都包含在它代表的级别的每个组合的摘要中，包括通配符。从逻辑上讲，上面显示的内容等效于此类内容（假设我们可以使用NULL占位符）：

SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL, y,    COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y

具有GROUP BY ROLLUP(...)类似于CUBE，但通过从左到右填充列来分层地工作。

SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x,    NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT x,    y,    COUNT(*) FROM table GROUP BY x, y

ROLLUP和CUBE来自数据仓库扩展，因此，如果您想更好地了解其工作原理，还可以查看自己喜欢的RDMBS的文档。例如PostgreSQL在9.5和these are relatively well documented中都引入了。

Answer 2

如果您不希望为null，请首先使用以下示例将其删除Dfwithoutnull = df.na.drop（“ all”，seq（列名1，列名2））上面的表达式将从原始数据帧中删除null。

2.group，你知道我猜。

3.rollup和cube是GROUPING SET运算符。汇总是一种分层的多维集结和处理元素

并且在多维数据集中而不是对元素进行分层处理，多维数据集在所有维度上都执行相同的操作。您可以尝试grouping_id来了解抽象级别

cube，rollup和groupBy运算符有什么区别？

问题描述投票：35回答：2

2个回答

最新问题

cube，rollup和groupBy运算符有什么区别？

问题描述 投票：35回答：2

2个回答

最新问题

问题描述投票：35回答：2