我尝试在 dask 上使用 iterrows 时显示进度条。但是,它只显示(我假设的)第一个循环进度条。
import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from tqdm.auto import tqdm
# Create a Dask array with many rows of data
data = da.ones((10000, 1))
# Convert the Dask array to a Dask DataFrame with a single column
df = dd.from_dask_array(data, columns=['value'])
with ProgressBar():
for i, row in df.iterrows():
process(row)
我希望它显示整个循环的进度条
更新
ProgressBar
的目标与tqdm
不一样。试试这个看看区别:
import dask as ds
import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import time
data = da.random.randint(1, 100, (10000, 2))
df = dd.from_dask_array(data, columns=['x', 'y'])
@ds.delayed
def taskA(df):
time.sleep(1)
return df['x'] * df['y']
@ds.delayed
def taskB(sr):
time.sleep(1)
return sr / 100
with ProgressBar():
ds.compute(taskB(taskA(df)))
你可以试试:
import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from tqdm.auto import tqdm
# Create a Dask array with many rows of data
data = da.ones((10000, 1))
# Convert the Dask array to a Dask DataFrame with a single column
df = dd.from_dask_array(data, columns=['value'])
for i, row in tqdm(df.iterrows(), total=len(df)):
process(row)
输出:
100%|███████████████████████████████████████████████| 10000/10000 [00:00<00:00, 34659.83it/s]
要在 Dask DataFrame 上使用 iterrows 时显示每次迭代的进度条,您可以使用 tqdm 包来包装 iterrows 返回的迭代器。这是一个例子:
import time
import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from tqdm.auto import tqdm
def process(row):
time.sleep(0.001)
# Create a Dask array with many rows of data
data = da.ones((10000, 1))
# Convert the Dask array to a Dask DataFrame with a single column
df = dd.from_dask_array(data, columns=['value'])
# Use tqdm to wrap the iterator returned by iterrows
for i, row in tqdm(df.iterrows(), total=len(df)):
process(row)
本例中,tqdm用于包装df.iterrows()返回的迭代器。 total 参数设置为 DataFrame 的长度,以提供对迭代总数的估计。进度条将在循环的每次迭代中更新。