如何处理 pyarrow read_csv 中的读取错误

问题描述 投票:0回答:1

我正在尝试 apache arrow 但出现行列计数错误。请问我怎样才能跳过这些行?在 Pandas 中这很容易,但我不知道如何在 pyarrow 中做同样的事情。我只是想跳过有问题的行。


from pyarrow import csv

test_arrow = csv.read_csv( test_file)

产量。

ArrowInvalid                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 test_arrow = csv.read_csv( core_test_file)

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1261, in pyarrow._csv.read_csv()

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1270, in pyarrow._csv.read_csv()

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: CSV parse error: Expected 340 columns, got 679: rtee,2024-01-02,0,10.89,10.89,408,46,1409,5,11460,43.73,37.693,83.51,50.68,43.89,58.05,103.217,6 ...
pyarrow read.csv
1个回答
0
投票

您可以使用

parse_options
中的
csv.read_csv
参数跳过有错误的行:

from pyarrow import csv

def skip_comment(row):
    if row.text.startswith("# "):
        return 'skip'
    else:
        return 'error'

parse_options = csv.ParseOptions(invalid_row_handler=skip_comment)
test_arrow = csv.read_csv(test_file, parse_options=parse_options)

此示例取自 pyarrow 文档

© www.soinside.com 2019 - 2024. All rights reserved.