如何从 pandas DataFrame 中删除所有分类列？

Question

分类列是节省 pandas 内存的好方法，但有时它们只会减慢速度。特别是在您通过了拥有大数据框并且现在正在子集中工作的阶段之后。例如，它似乎不太适合在 Jupyter 中打印或使用 qgrid 等库。

我基本上想从数据框中删除所有分类列以加快简单的操作：

这是一个例子：

df = pd.DataFrame({"A": ["a", "b", "c", "a"], 
                   "B": ["a", "b", "c", "a"],
                   "C": [0,3,0,3],
                   "D": [0.2,0.2,0.3,0.3],
                   "F": [0,1,2,3]
                  }
                 )
df["B"] = df["B"].astype('category')
df["C"] = df["C"].astype('category')
df["D"] = df["D"].astype('category')

这导致某些列是分类的（具有不同类型：int、float、str）。

df.dtypes
A      object
B    category
C    category
D    category
F       int64
dtype: object

理想情况下是这样的：

df = df.remove_all_categorical_columns();

这将返回原始的基本类型：

df.dtypes
A     object
B     object
C      int64
D    float64
F      int64
dtype: object

Answer 1

与toto的答案类似，但没有

df.apply()

。

def recover_dtypes(df):
    for col in df.columns:
        if df[col].dtype == 'category':
            df[col] = df[col].astype(df[col].cat.categories.to_numpy().dtype)
    return df

df1 = recover_dtypes(df)
print(df1.dtypes)
>>>
A     object
B     object
C      int64
D    float64
F      int64
dtype: object

Answer 2

您可以使用

df['column'].cat.categories.dtype

恢复原始数据类型。剩下的就是使用

df['column'].astype(df['column'].cat.categories.dtype)

浏览所有列。

以下内容适用于您的示例（希望对于其他情况足够通用）：

def uncategorize(col):
    if col.dtype.name == 'category':
        try:
            return col.astype(col.cat.categories.dtype)
        except:
            # In case there is pd.NA (pandas >= 1.0), Int64 should be used instead of int64
            return col.astype(col.cat.categories.dtype.name.title())           
    else:
        return col

df = df.apply(uncategorize, axis=0)

然后，您恢复原始的数据类型。

df.dtypes
A     object
B     object
C      int64
D    float64
F      int64
dtype: object

如何从 pandas DataFrame 中删除所有分类列？

问题描述投票：0回答：2

2个回答

最新问题

如何从 pandas DataFrame 中删除所有分类列？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2