使用spark读取镶木地板文件

Question

我想使用spark读取parquet文件并一一处理每个文件的内容。我试图使用以下方法来实现它

    spark.read
      .option("wholetext", "true")
      .option("compression", "none")
      .textFile("/test/part-00001-d3a107e9-ead6-45f0-bccf-fadcecae45bb-c000.zstd.parquet")

还尝试了许多不同的类似方法，但 Spark 似乎正在以某种方式修改文件内容，因为在读取文件时可能添加了一些选项。

我的最终目标是使用 Scala 中的 okhttp 客户端在 Clickhouse 中加载这些文件。我尝试加载的文件没有损坏，并且当未在 Spark 中使用时，Clickhouse 成功处理了它。但是，当我尝试通过以下方法使用 Spark 时，Clickhouse 会响应

std::exception. Code: 1001, type: parquet::ParquetException, e.what() = Couldn't deserialize thrift: TProtocolException: Invalid data

当我尝试打印从文件中读取的内容时，我看到了这个

Europe/Moscoworg.apache.spark.version3.4.1)org.apache.spark.sql.parquet.row.metadata�{"type":"struct","fields":[{"name":"field1","type":"integer","nullable":true,"metadata":{}},{"name":"field2","type":{"type":"array","elementType":"integer","containsNull":true},"nullable":true,"metadata":{}},{"name":"field3","type":{"type":"struct","fields":[{"name":"x","type":"integer","nullable":true,"metadata":{}},{"name":"y","type":"integer","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}org.apache.spark.legacyDateTimeJparquet-mr version 1.12.3 (build f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)L^PAR1

其中看起来没有任何数据，并且似乎只有元数据。

我的问题是如何使用 Spark 读取字符串格式的二进制镶木地板文件。

Answer 1

如果您尝试使用spark.sparkContext.binaryFiles（BinaryFiles代码）会怎么样？

val files: RDD[(String, PortableDataStream)] = spark.sparkContext.binaryFiles("/path/to/parquet/files/")
files.foreach { case (path, stream) =>
  // Use the `stream` PortableDataStream to get InputStream and read the binary content
  // Process or write the binary content to Clickhouse
}

另一种方法是使用 Java nio API :

import java.nio.file.{Files, Paths}

val bytes = Files.readAllBytes(Paths.get("/path/to/parquet/file"))

使用spark读取镶木地板文件

问题描述投票：0回答：1

1个回答

最新问题

使用spark读取镶木地板文件

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1