如何星火计算平均值和字符串列的STDDEV

问题描述 投票:1回答:1

我有以下数据(只显示一个片段)

DEST_COUNTRY_NAME   ORIGIN_COUNTRY_NAME count
United States   Romania 15
United States   Croatia 1
United States   Ireland 344
Egypt   United States   15

我设置为inferSchema然后truedescribe选择阅读。这似乎很好地工作。

scala> val data = spark.read.option("header", "true").option("inferSchema","true").csv("./data/flight-data/csv/2015-summary.csv")
scala> data.describe().show()
+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|             count|
+-------+-----------------+-------------------+------------------+
|  count|              256|                256|               256|
|   mean|             null|               null|       1770.765625|
| stddev|             null|               null|23126.516918551915|
|    min|          Algeria|             Angola|                 1|
|    max|           Zambia|            Vietnam|            370002|
+-------+-----------------+-------------------+------------------+

如果我不指定inferSchema,那么所有的列被视为字符串。

scala> val dataNoSchema = spark.read.option("header", "true").csv("./data/flight-data/csv/2015-summary.csv")
dataNoSchema: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> dataNoSchema.printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: string (nullable = true)

问题1)为什么然后Sparkmeanstddev值的最后一列count

scala> dataNoSchema.describe().show();
+-------+-----------------+-------------------+------------------+
|summary|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|             count|
+-------+-----------------+-------------------+------------------+
|  count|              256|                256|               256|
|   mean|             null|               null|       1770.765625|
| stddev|             null|               null|23126.516918551915|
|    min|          Algeria|             Angola|                 1|
|    max|           Zambia|            Vietnam|               986|
+-------+-----------------+-------------------+------------------+

问题2)如果现在Spark解释count作为numeric柱那么为什么max值是986,而不是37002(如在数据数据帧)

apache-spark
1个回答
0
投票

火花SQL希望成为SQL标准兼容,因此使用相同的评估规则,并且如果需要,透明胁迫类型以满足式(参见例如my answerPySpark DataFrames - filtering using comparisons between columns of different types)。

这意味着maxmean / stddev情况是根本不相同的:

  • 最大是有意义的字符串(以lexicographic ordering),无需强迫 Seq.empty[String].toDF("count").agg(max("count")).explain == Physical Plan == SortAggregate(key=[], functions=[max(count#69)]) +- Exchange SinglePartition +- SortAggregate(key=[], functions=[partial_max(count#69)]) +- LocalTableScan <empty>, [count#69]
  • 平均或标准差都没有,而参数铸造翻番 Seq.empty[String].toDF("count").agg(mean("count")).explain == Physical Plan == *(2) HashAggregate(keys=[], functions=[avg(cast(count#81 as double))]) +- Exchange SinglePartition +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(count#81 as double))]) +- LocalTableScan <empty>, [count#81].
© www.soinside.com 2019 - 2024. All rights reserved.