如何在 dask DataFrame.iterrows 上显示进度条

问题描述 投票:0回答:2

我尝试在 dask 上使用 iterrows 时显示进度条。但是,它只显示(我假设的)第一个循环进度条。

 import dask.array as da
 import dask.dataframe as dd
 from dask.diagnostics import ProgressBar
 from tqdm.auto import tqdm
 
 # Create a Dask array with many rows of data
 data = da.ones((10000, 1))
 
 # Convert the Dask array to a Dask DataFrame with a single column
 df = dd.from_dask_array(data, columns=['value'])
 
 with ProgressBar():
     for i, row in df.iterrows():
         process(row)
 

我希望它显示整个循环的进度条

python pandas dask tqdm
2个回答
1
投票

更新

ProgressBar
的目标与
tqdm
不一样。试试这个看看区别:

import dask as ds
import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import time

data = da.random.randint(1, 100, (10000, 2))
df = dd.from_dask_array(data, columns=['x', 'y'])

@ds.delayed
def taskA(df):
    time.sleep(1)
    return df['x'] * df['y']

@ds.delayed
def taskB(sr):
    time.sleep(1)
    return sr / 100
        
with ProgressBar():
    ds.compute(taskB(taskA(df)))

你可以试试:

import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from tqdm.auto import tqdm
 
# Create a Dask array with many rows of data
data = da.ones((10000, 1))
 
 # Convert the Dask array to a Dask DataFrame with a single column
df = dd.from_dask_array(data, columns=['value'])
 
for i, row in tqdm(df.iterrows(), total=len(df)):
   process(row)

输出:

100%|███████████████████████████████████████████████| 10000/10000 [00:00<00:00, 34659.83it/s]

0
投票

要在 Dask DataFrame 上使用 iterrows 时显示每次迭代的进度条,您可以使用 tqdm 包来包装 iterrows 返回的迭代器。这是一个例子:

import time
import dask.array as da
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from tqdm.auto import tqdm

def process(row):
   time.sleep(0.001)
# Create a Dask array with many rows of data
data = da.ones((10000, 1))

# Convert the Dask array to a Dask DataFrame with a single column
df = dd.from_dask_array(data, columns=['value'])

# Use tqdm to wrap the iterator returned by iterrows
for i, row in tqdm(df.iterrows(), total=len(df)):
    process(row)

本例中,tqdm用于包装df.iterrows()返回的迭代器。 total 参数设置为 DataFrame 的长度,以提供对迭代总数的估计。进度条将在循环的每次迭代中更新。

© www.soinside.com 2019 - 2024. All rights reserved.