HuggingFace load_dataset 错误(.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow')

问题描述 投票:0回答:1

我正在按照教程来微调模型,但陷入了无法解决的 load_dataset 错误。作为上下文,本教程首先将this数据集上传到 HF,然后我成功上传了一个相同的

但是,当我运行脚本下载数据集时,出现了问题。如果我正在下载原始数据集,则过程会顺利进行,并且所有文件都会正确获取。但是当我尝试下载我的文件时,似乎我要下载部分文件(直到您在错误消息中看到 0.0.0 文件夹,但此后什么也没有)。

我运行的命令是

dataset = load_dataset("FelipeBandeiraPoatek/invoices-donut-data-v2", split="train")
,我收到的错误日志如下:

Downloading data files: 100%|████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
Extracting data files: 100%|████████████████████████████████████████| 3/3 [00:00<00:00, 198.67it/s] 
Traceback (most recent call last):
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1852, in _prepare_split_single
    writer = writer_class(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\arrow_writer.py", line 334, in __init__
    self.stream = self._fs.open(fs_token_paths[2][0], "wb")
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\spec.py", line 1241, in open
    f = self._open(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 184, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 315, in __init__
    self._open()
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 320, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Felipe Bandeira/.cache/huggingface/datasets/FelipeBandeiraPoatek___parquet/FelipeBandeiraPoatek--invoices-donut-data-v2-ca49e83826870faf/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 11, in <module>
    main()
  File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 7, in main   
    dataset_tester.test("FelipeBandeiraPoatek/invoices-donut-data-v2")
  File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\tools\donut\dataset_tester.py", line 10, in test
    dataset = load_dataset(dataset_name, split="train")
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1749, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1892, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

我还没有找到任何解决方案,也无法弄清楚为什么原始数据集下载得很好,但我的(相同的)却没有。有线索吗?

(我试过了:

  1. 检查下载数据集的功能
  2. 检查错误日志
  3. 删除电脑中存储下载内容的文件夹并重复该过程
  4. 将原始存储库中的文件克隆到我自己的存储库上

在所有情况下,我都可以从原始存储库正确下载数据集,但不能从我自己的存储库下载。同样的错误不断发生)

filenotfoundexception huggingface huggingface-datasets
1个回答
0
投票

我也遇到了这个问题,最后我发现文件名太长超出了系统命名长度限制,将文件名改短一点就行了!

© www.soinside.com 2019 - 2024. All rights reserved.