在大型数据框中降低因子水平的Python方法

Question

我正在尝试减少pandas数据框中一列内的因子级别数，以使任何因子的总实例在所有列行中所占比例均低于定义的阈值（默认设置为1％）的比例将被存储标记为“其他”的新因素。以下是我用来完成此任务的功能：

def condenseMe(df, column_name, threshold = 0.01, newLabel = "Other"):

    valDict = dict(df[column_name].value_counts() / len(df[column_name]))
    toCondense = [v for v in valDict.keys() if valDict[v] < threshold]
    if 'Missing' in toCondense:
        toCondense.remove('Missing')
    df[column_name] = df[column_name].apply(lambda x: newLabel if x in toCondense else x)

我遇到的问题是我正在使用大型数据集（约1800万行），并试图在具有10,000多个级别的列上使用此功能。因此，在此列上执行此功能需要花费很长时间。是否有更Python的方法来减少执行得更快的因子级别数？任何帮助将不胜感激！

Answer 1

您可以结合使用groupby，tranform和count：

def condenseMe(df, col, threshold = 0.01, newLabel="Other"): # Create a new Series with the normalized value counts counts = df[[col]].groupby(col)[col].transform('count') / len(df) # Create a 1D mask based on threshold (ignoring "Missing") mask = (counts < threshold) & (df[col] != 'Missing') # Assign these masked values a new label df[col][mask] = newLabel

在大型数据框中降低因子水平的Python方法

问题描述投票：0回答：1

1个回答

最新问题

在大型数据框中降低因子水平的Python方法

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1