Dask - 如何使用apply将Series连接到DataFrame?

问题描述 投票:1回答:1

如何从Dask系列上应用的函数返回多个值?我试图从dask.Series.apply的每次迭代返回一个系列,并且最终结果是dask.DataFrame

以下代码告诉我meta是错误的。然而,全熊猫版本有效。这有什么不对?

更新:我认为我没有正确指定元/架构。我该怎么做?现在,当我删除meta参数时它会起作用。但是,它引发了警告。我想“正确”使用dask。

import dask.dataframe as dd
import pandas as pd
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()

def transformMyCol(x):
    #Minimal Example Function
    return(pd.Series(['Tom - ' + str(x),'Deskflip - ' + str(x / 8),'']))

#
## Pandas Version - Works as expected.
#
pandas_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
pandas_df.target.apply(transformMyCol,1)

#
## Dask Version (second attempt) - Raises a warning
#
df = dd.from_pandas(pandas_df, npartitions=10)

unpacked = df.target.apply(transformMyCol)
unpacked.head()

#
## Dask Version (first attempt) - Raises an exception 
#
df = dd.from_pandas(pandas_df, npartitions=10)

unpacked_dask_schema = {"name" : str, "action" : str, "comments" : str}

unpacked = df.target.apply(transformMyCol, meta=unpacked_dask_schema)
unpacked.head()

这是我得到的错误:

  File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
    raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata

我也纠结了下面的内容,它也没有用。

meta_df = pd.DataFrame(dtype='str',columns=list(unpacked_dask_schema.keys()))


unpacked = df.FILEDATA.apply(transformMyCol, meta=meta_df)
unpacked.head()

同样的错误:

  File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
    raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
python pandas dataframe dask dask-distributed
1个回答
3
投票

你是对的,问题是你没有正确指定元;更具体地说,如错误消息所示,元数据列("name", "action", "comments")与计算数据中的列(0, 1, 2)不匹配。你应该:

  1. 将元数据列更改为0,1,2:
   unpacked_dask_schema = dict.fromkeys(range(3), str)
   df.target.apply(transformMyCol, meta=unpacked_dask_schema)

要么

  1. 更改transformMyCol以使用命名列:

    def transformMyCol(x):
        return pd.Series({
            'name': 'Tom - ' + str(x), 
            'action': 'Deskflip - ' + str(x / 8), 
            'comments': '',
        }))
© www.soinside.com 2019 - 2024. All rights reserved.