无法在colab中使用Dataflow下载c4数据集

问题描述 投票:0回答:1

我想下载c4数据集。按照说明页面:https://www.tensorflow.org/datasets/catalog/c4,建议使用数据流。我按照此处介绍的步骤操作:[google colab中的https://www.tensorflow.org/datasets/beam_datasets

包装:

!pip install -q tensorflow-datasets
!pip install -q apache-beam[gcp]

这是我要在colab中运行的单元格

%env DATASET_NAME=c4/en
%env GCP_PROJECT=......
%env GCS_BUCKET=gs://c4-dump
%env DATAFLOW_JOB_NAME=c4-en-gen

!echo "tensorflow_datasets[$DATASET_NAME]" > /tmp/beam_requirements.txt

!python -m tensorflow_datasets.scripts.download_and_prepare \
  --datasets=$DATASET_NAME
  --data_dir=$GCS_BUCKET \
  --beam_pipeline_options="runner=DataflowRunner,project=$GCP_PROJECT,job_name=$DATAFLOW_JOB_NAME,staging_location=$GCS_BUCKET/binaries,temp_location=$GCS_BUCKET/temp,requirements_file=/tmp/beam_requirements.txt"

与本教程中的代码几乎相同。但是,在“数据流”选项卡中没有创建任何数据流作业,看起来它是在本地下载的。查看输出日志:

env: DATASET_NAME=c4/en
env: GCP_PROJECT=ai-vs-covid19
env: GCS_BUCKET=gs://c4-dump
env: DATAFLOW_JOB_NAME=c4-en-gen
2020-03-31 02:18:46.297213: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
I0331 02:18:49.098738 139869050173312 download_and_prepare.py:180] Running download_and_prepare for datasets:
c4/en
I0331 02:18:49.099436 139869050173312 download_and_prepare.py:181] Version: "None"
I0331 02:18:50.353859 139869050173312 dataset_builder.py:202] Load pre-computed datasetinfo (eg: splits) from bucket.
I0331 02:18:50.468347 139869050173312 dataset_info.py:431] Loading info from GCS for c4/en/2.2.1
I0331 02:18:50.522799 139869050173312 download_and_prepare.py:130] download_and_prepare for dataset c4/en/2.2.1...
I0331 02:18:50.560583 139869050173312 driver.py:124] Generating grammar tables from /usr/lib/python3.6/lib2to3/Grammar.txt
I0331 02:18:50.683776 139869050173312 driver.py:124] Generating grammar tables from /usr/lib/python3.6/lib2to3/PatternGrammar.txt
I0331 02:18:51.189772 139869050173312 dataset_builder.py:310] Generating dataset c4 (gs://c4-dump/c4/en/2.2.1)
Downloading and preparing dataset c4/en/2.2.1 (download: 6.96 TiB, generated: 816.78 GiB, total: 7.76 TiB) to gs://c4-dump/c4/en/2.2.1...

然后是一堆

Dl Completed...:   0% 0/18 [00:38<?, ? url/s]
Dl Completed...:   0% 0/18 [00:38<?, ? url/s]
Dl Completed...:   0% 0/18 [00:39<?, ? url/s]I0331 02:19:33.506697 139869050173312 download_manager.py:256] Downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-18/segments/1555578517558.8/wet/CC-MAIN-20190418101243-20190418123243-00326.warc.wet.gz into gs://c4-dump/downloads/comm.s3_craw-data_CC-MAIN-2019-18_segm_1555iQS7Yn3hZ3JmwClTiCNY5qtVgGfQQAObrCqx7cMloOg.gz.tmp.1bbeb83abada465287dcecabb0e4f4b0...

我是想念东西还是只是准备阶段?我主要担心的是,我看不到数据流作业正在运行。

谢谢!

UPD:对计算实例尝试了相同的方法-结果相同。

google-colaboratory apache-beam tensorflow-datasets dataflow
1个回答
0
投票

我刚刚更新了tfds-nightly程序包,因此原始文件将下载到DataFlow Worker而不是Manager上。请尝试2.1.0.dev202003312203版本,如果您有任何问题,请告诉我。

© www.soinside.com 2019 - 2024. All rights reserved.