我正在尝试 apache arrow 但出现行列计数错误。请问我怎样才能跳过这些行?在 Pandas 中这很容易,但我不知道如何在 pyarrow 中做同样的事情。我只是想跳过有问题的行。
from pyarrow import csv
test_arrow = csv.read_csv( test_file)
产量。
ArrowInvalid Traceback (most recent call last)
Cell In[6], line 1
----> 1 test_arrow = csv.read_csv( core_test_file)
File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1261, in pyarrow._csv.read_csv()
File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1270, in pyarrow._csv.read_csv()
File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()
File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowInvalid: CSV parse error: Expected 340 columns, got 679: rtee,2024-01-02,0,10.89,10.89,408,46,1409,5,11460,43.73,37.693,83.51,50.68,43.89,58.05,103.217,6 ...
您可以使用
parse_options
中的 csv.read_csv
参数跳过有错误的行:
from pyarrow import csv
def skip_comment(row):
if row.text.startswith("# "):
return 'skip'
else:
return 'error'
parse_options = csv.ParseOptions(invalid_row_handler=skip_comment)
test_arrow = csv.read_csv(test_file, parse_options=parse_options)
此示例取自 pyarrow 文档