从_pandas创建dask数据框时如何为列表列指定正确的数据类型？

Question

使用

Dataframe

方法创建 dask

from_pandas

时，之前正确的 dtype

object

变为

string[pyarrow]

。

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(
    {
        "lists": [["a", "b"], ["c", "b", "a"], ["b"]],
        "a": [1, 1, 0],
        "b": [1, 0, 1],
        "c": [0, 1, 0],
    }
)

print(df)

       lists  a  b  c
0     [a, b]  1  1  0
1  [c, b, a]  1  0  1
2        [b]  0  1  0

print(df.dtypes)

lists    object
a         int64
b         int64
c         int64
dtype: object

我可以正确访问列表的第一个元素，pandas 按预期工作：

print(df["lists"].str[0])

0    a
1    c
2    b
Name: lists, dtype: object

但是使用 dask，dtype 不正确：

df = dd.from_pandas(df, npartitions=2)

print(df.dtypes)

lists    string[pyarrow]
a                  int64
b                  int64
c                  int64

显式设置 dtype 也不起作用

df["lists"] = df["lists"].astype(object)

print(df.dtypes)

lists    string[pyarrow]
a                  int64
b                  int64
c                  int64

如何告诉 dask 该列不包含字符串？

Answer 1

这似乎是一个已知问题。

对于在列中存储复杂数据的人，建议将
dataframe.convert-string
设置设置为
false
，对于包含字符串的列，在读取数据时显式提供
string[pyarrow]
dtype。

所以在上面的例子中：

import dask.dataframe as dd
import pandas as pd
from dask import config

df = pd.DataFrame(
    {
        "lists": [["a", "b"], ["c", "b", "a"], ["b"]],
        "a": [1, 1, 0],
        "b": [1, 0, 1],
        "c": [0, 1, 0],
    }
)

config.set({"dataframe.convert-string": False})
df = dd.from_pandas(df, npartitions=2)
print(df.dtypes)

lists    object
a         int64
b         int64
c         int64
dtype: object

或者在配置中以其他方式设置值。

从_pandas创建dask数据框时如何为列表列指定正确的数据类型？

问题描述投票：0回答：1

1个回答

最新问题

从_pandas创建dask数据框时如何为列表列指定正确的数据类型？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1