创建一个函数来标准化 Python 中给定 ID 的标签

Question

我正在尝试创建一个函数，根据给定条件标准化给定 ID 的标签列。

我想根据该 ID 最常用的标签来标准化标签，如果没有常见/多数标签，则只需将第一个观察结果作为默认标准。

到目前为止我拥有的功能如下：


def standardize_labels(df, id_col, label_col):
    # Function to find the most common label or the first one if there's a tie
    def most_common_label(group):
        labels = group[label_col].value_counts()
        # Check if the top two labels have the same count
        if len(labels) > 1 and labels.iloc[0] == labels.iloc[1]:
            return group[label_col].iloc[0]
        return labels.idxmax()

    # Group by the ID column and apply the most_common_label function
    common_labels = df.groupby(id_col).apply(most_common_label)

    # Map the IDs in the original DataFrame to their common labels
    df['standardized_label'] = df[id_col].map(common_labels)

    return df

它大部分都有效，但是我注意到一些奇怪的现象，其中标签中的趋势发生了变化，然后标签会根据给定的 ID 进行更改，如下所示：

身份证	原始标签	标准化_标签
222	洛杉矶地铁	洛杉矶地铁
222	洛杉矶地铁	洛杉矶地铁
222	洛杉矶地铁	洛杉矶地铁
222	洛杉矶地铁	洛杉矶地铁
222	洛杉矶地铁	洛杉矶地铁

当输出相反时，我希望所有标准化_标签都是LA Metro，因为这是每个 ID 的大多数标签。

Answer 1

代码按我的预期工作。但是，您可以使用

mode

使其更易于阅读。您还可以转换 groupby 中的函数以直接分配给列，这将使您的整个操作变成一行代码。

df['standardized_label'] = df.groupby('ID')['raw_label'].transform(lambda x: x.mode()[0])

或者您也可以使用

groupby.apply

并绘制它。无论如何，该函数看起来像：

def standardize_labels(df, id_col, label_col):
    # Function to find the most common label or the first one if there's a tie
    def most_common_label(group):
        return group.mode()[0]

    # Group by the ID column and apply the most_common_label function
    common_labels = df.groupby(id_col)[label_col].apply(most_common_label)

    # Map the IDs in the original DataFrame to their common labels
    df['standardized_label'] = df[id_col].map(common_labels)

    return df

创建一个函数来标准化 Python 中给定 ID 的标签

问题描述投票：0回答：1

1个回答

最新问题

创建一个函数来标准化 Python 中给定 ID 的标签

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1