如何通过 HuggingFace CLI 下载 HuggingFace 数据集,同时保留原始文件名?

问题描述 投票:0回答:1

我通过 HuggingFace CLI 下载了托管在 HuggingFace 上的数据集,如下所示:

pip install huggingface_hub[hf_transfer]
huggingface-cli download huuuyeah/MeetingBank_Audio --repo-type dataset --local-dir-use-symlinks False 

但是,下载的文件没有原始文件名。相反,它们的哈希值(git-sha 或 sha256,取决于它们是否是 LFS 文件)用作文件名:

--- /home/dernonco/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/blobs ---------------------------------------------
                         /..                                                                                                       
   12.9 GiB [##########]  b581945ddee5e673fa2059afb25274b1523f270687b5253cb8aa72865760ebc0
    3.9 GiB [###       ]  86ebd2861a42b27168d75f346dd72f0e2b9eaee0afb90890beff15d025af45c6
    3.9 GiB [##        ]  f9b81739ee30450b930390e1155e2cdea1b3063379ba6fd9253513eba1ab1e05
    3.7 GiB [##        ]  e54c7d123ad93f4144eebdca2827ef81ea1ac282ddd2243386528cd157c02f36
    3.7 GiB [##        ]  736e225a7dd38a7987d0745b1b2f545ab701cfdf1f639874f5743b5bfb5cb1e1
    3.7 GiB [##        ]  0687246c92ec87b54e1c5fe623a77b650c02e6884e17a6f0fb4052a862d928d0
    3.6 GiB [##        ]  2becb5f9878b95f1b12622f50868f5855221985f05910d7cc759e6be074e6b8e
    3.5 GiB [##        ]  2208068c69b39c46ee9fac862da3c060c58b61adcaee1b3e6aa5d6d5dd3eba86
    3.5 GiB [##        ]  caf87e71232cbb8a31960a26ba30b9412c15893c831ef118196c581cfd3a3779
    3.4 GiB [##        ]  dc88cbf0ef45351bdc1f53c4396466d3e79874803719e266630ed6c3ad911d6a
    3.4 GiB [##        ]  f05f7fb3b55b6840ebc4ada5daa28742bbae6ad4dcc35781dc811024f27a1b4e
    3.4 GiB [##        ]  88bd831618b36330ef5cd84b7ccbc4d5f3f55955c0b223208bc2244b27fb2d78
    3.4 GiB [##        ]  bf80943b3389ddbeb8fb8a56af2d7fa5d09c5af076aac93f54ad921ee382c77d
    3.3 GiB [##        ]  83b2627e644c9ad0486e3bd966b02f014722e668d26b9d52394c974fcf2fdcf8
    3.2 GiB [##        ]  e52e7b086dabd431b25cf309e1fe513190543e058f4e7a2d8e05b22821ded4fe
    3.2 GiB [##        ]  4fe583348f3ac118f34c7b93b6a187ba4e21a5a7f5b6ca1a6adbce1cc6d563a9
    3.2 GiB [##        ]  ae6b6faca3bbd75e7ca99ccf20b55b017393bf09022efb8459293afffe06dc6e
    3.1 GiB [##        ]  5865379a894f8dc40703bdc1093d45fda67d5e1a742a2eebddd37e1a00f067fd
    3.1 GiB [##        ]  cd346324b29390a589926ccab7187ae818cf5f9fcbaf8ecc95313e6cdfab86bc
    3.0 GiB [##        ]  914eb2b1174a662e3faebac82f6b5591a54def39a9d3a7e5ab2347ecc87a982f
    2.9 GiB [##        ]  24789f33332e8539b2ee72a0a489c0f4d0c6103f7f9600de660d78543ade9111
    2.9 GiB [##        ]  35e8da5f831b36416c9569014c58f881a0a30c00db9f3caae0d7db6a8fd3c694
    2.8 GiB [##        ]  d5127e0298661d40a343d58759ed6298f9d2ef02d5c4f6a30bd9e07bc5423317
    2.8 GiB [##        ]  1b4e1951da2462ca77d94d220a58c97f64caa2b2defe4df95feed9defcee6ca7
    2.8 GiB [##        ]  75a4725625c095d98ecef7d68d384d7b1201ace046ef02ed499776b0ac02b61e
    2.8 GiB [##        ]  fefbbc3e87be522b7e571c78a188aba35bd5d282cf8f41257097a621af64ff60
 Total disk usage: 184.8 GiB  Apparent size: 184.8 GiB  Items: 85                                          

如何通过 HuggingFace CLI 下载 HuggingFace 数据集,同时保留原始文件名?

python download dataset huggingface-datasets
1个回答
0
投票

必须查看

snapshots
文件夹:

/home/username/.cache/huggingface/hub/datasets--huuuyeah--MeetingBank_Audio/snapshots

它包含原始的、可读的文件名。但是,这些文件是指向以哈希值作为文件名的 blob 文件的符号链接。人们可以将这些符号链接替换为实际文件(存储在 blob 中),并且它将为原始文件提供原始文件名。

要将符号链接替换为 Linux 上的实际文件,可以使用 u1686_grawityscript:

script.sh

#!/bin/sh
set -e
for link; do
    test -h "$link" || continue

    dir=$(dirname "$link")
    reltarget=$(readlink "$link")
    case $reltarget in
        /*) abstarget=$reltarget;;
        *)  abstarget=$dir/$reltarget;;
    esac

    rm -fv "$link"
    cp -afv "$abstarget" "$link" || {
        # on failure, restore the symlink
        rm -rfv "$link"
        ln -sfv "$reltarget" "$link"
    }
done

运行:

find . -type l -exec /path/to/script.sh {} +
© www.soinside.com 2019 - 2024. All rights reserved.