Pyspark或缩放问题中的ORC条带大小设置

Question

我在使用PySpark设置orc文件上的条带大小，索引跨度和索引时遇到问题。当我预期256MB设置仅5个条纹时，我将为1.2GB文件获得大约2000个条纹。

尝试以下选项

在数据帧编写器上设置.options。 .option中的压缩设置有效，但其他.option设置无效。研究Dataframe类中的.option方法，该方法仅用于压缩，而不适用于条带，索引和跨度。

df.\
 .repartition(custom field)\
 .sortWithPartitions(custom field, sort field 1 , sort field 2)\
 .write.format(orc)\
 .option("compression","zlib")\                 only this option worked
 .option("preserveSortOrder","true")\
 .option("orc.stripe.size","268435456")\
 .option("orc.row.index.stride","true")\
 .option("orc.create.index","true")\
 .save(s3 location )

使用上述ORC设置创建了一个空的HIVE表，并使用Spark SaveAsTable和insertInto方法将其加载到该表中。结果表具有比预期更多的条纹

df.\
 .repartition(custom field)\
 .sortWithPartitions(custom field, sort field 1 , sort field 2)\
 .write.format(orc)\
 .mode("apped")
 .saveAsTable(hive tablename )    & tried .insertInto (hive table name)

对于这两个选项，我都启用了以下配置

spark.sql("set spark.sql.orc.impl=native")
spark.sql("set spark.sql.orc.enabled=true")
spark.sql("set spark.sql.orc.cache.stripe.details.size=" 268435456  ")

[请让我知道是否缺少任何代码或数据框编写器级别的方法或Spark会话级别的配置，这些使我们能够获得所需的结果。

Answer 1

0
投票

"orc.row.index.stride"应为数字值

Pyspark或缩放问题中的ORC条带大小设置

问题描述投票：3回答：1

1个回答

最新问题

Pyspark或缩放问题中的ORC条带大小设置

问题描述 投票：3回答：1

1个回答

最新问题

问题描述投票：3回答：1