如何使用 apache arrow 在 java 中编写镶木地板文件

Question

我正在尝试将java中的数据写入apache parquet。到目前为止，我所做的是通过此处的示例使用 apache arrow：https://arrow.apache.org/cookbook/java/schema.html#creating-fields 并创建箭头格式数据集。

问题是，之后如何将其写入镶木地板中？另外，我是否需要使用 apache arrow 将数据输出为 parquet 文件？或者我可以直接使用apache parquet序列化数据然后将其输出为parquet文件吗？

我做了什么：

try (BufferAllocator allocator = new RootAllocator()) {
    Field name = new Field("name", FieldType.nullable(new ArrowType.Utf8()), null);
    Field age = new Field("age", FieldType.nullable(new ArrowType.Int(32, true)), null);
    Schema schemaPerson = new Schema(asList(name, age));
    try(
        VectorSchemaRoot vectorSchemaRoot = VectorSchemaRoot.create(schemaPerson, allocator)
    ){
        VarCharVector nameVector = (VarCharVector) vectorSchemaRoot.getVector("name");
        nameVector.allocateNew(3);
        nameVector.set(0, "David".getBytes());
        nameVector.set(1, "Gladis".getBytes());
        nameVector.set(2, "Juan".getBytes());
        IntVector ageVector = (IntVector) vectorSchemaRoot.getVector("age");
        ageVector.allocateNew(3);
        ageVector.set(0, 10);
        ageVector.set(1, 20);
        ageVector.set(2, 30);
        vectorSchemaRoot.setRowCount(3);
        File file = new File("randon_access_to_file.arrow");
        try (
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            ArrowFileWriter writer = new ArrowFileWriter(vectorSchemaRoot, null, fileOutputStream.getChannel())
        ) {
            writer.start();
            writer.writeBatch();
            writer.end();
            System.out.println("Record batches written: " + writer.getRecordBlocks().size() + ". Number of rows written: " + vectorSchemaRoot.getRowCount());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

但这输出为箭头文件。不是镶木地板。有什么想法可以将其输出到镶木地板文件吗？我是否需要箭头来生成镶木地板文件 - 或者我可以直接使用镶木地板吗？

Answer 1

Arrow Java 尚不支持写入 Parquet 文件，但您可以使用 Parquet 来执行此操作。

Arrow 数据集测试类中的一些代码可能会有所帮助。见

org.apache.arrow.dataset.ParquetWriteSupport;
org.apache.arrow.dataset.file.TestFileSystemDataset;

第二个课程有一些使用第一个课程中的实用程序的测试。

您可以在 GitHub 上找到它们： https://github.com/apache/arrow/tree/master/java/dataset/src/test/java/org/apache/arrow/dataset

Answer 2

我们可以看到，Java Arrow 实现中的测试正在使用 parquet-hadoop 库，从 POM 可以看出。目前这有点不幸，因为 parquet-hadoop 依赖于 hadoop 库，例如 hadoop-common，它因大依赖链（和大量 CVE）而臭名昭著。

即使是最新版本的 hadoop-common 也有 15 个 CVE。 Arrow 的其他语言实现（例如 C++ 或 Rust）不需要这样做，它们可用于更轻松的 Parquet 集成。

如何使用 apache arrow 在 java 中编写镶木地板文件

问题描述投票：0回答：2

2个回答

最新问题

如何使用 apache arrow 在 java 中编写镶木地板文件

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2