如何根据条件转换数据框列中的前 n 个值?

问题描述 投票:0回答:1


import pandas as pd

df = pd.DataFrame({
    "review_num": [2,2,2,1,1,1,1,1,3,3],
    "review": ["The second review", "The second review", "The second review",
               "This is the first review", "This is the first review",
               "This is the first review", "This is the first review",
               "This is the first review",' Not Noo', 'Not Noo'],
    "token":["The", "second", "review", "This", "is", "the", "first", "review", "Not", "Noo"],


# Identify the line with the max score for each review
token_max_score = df.groupby("review_num", sort=False)["score"].idxmax()

# keep only lines with max score by review
Modified_df = df.loc[token_max_score, ["review_num", "review"]]

def modify_word(w):
    return w + "E"  # just to simplify the example

# Add the new column
Modified_df = Modified_df.join(
            "Modified_review": [
                txt.replace(w, modify_word(w))
                for w, txt in zip(
                    df.loc[token_max_score, "token"], df.loc[token_max_score, "review"]

我需要应用变换函数 n 次,而不是一次(如我的代码中那样)


   review_num                    review           Modified_review
2           2         The second review        The second reviewE
5           1  This is the first review  This is theE first review
9           3                   Not Noo                    Not NooE

n=2 的预期修改数据帧是:

   review_num                    review              Modified_review
2           2         The second review          TheE second reviewE
5           1  This is the first review   This isE theE first review
9           3                   Not Noo                    NotE NooE


python pandas dataframe group-by

这是使用 Pandas 应用的一种方法:

# Group and sort in descending order tokens and scores
df = df.groupby(["review_num", "review"]).agg(list)[["token", "score"]]
df["token_and_score"] = df.apply(
    lambda x: {t: s for t, s in zip(x["token"], x["score"])}, axis=1
df["token_and_score"] = df["token_and_score"].apply(
    lambda x: sorted(x.items(), key=lambda y: y[1], reverse=True)

# Iterate on new column "modified_review" and apply 'modify_word' function
df = df.reset_index()
df["modified_review"] = df["review"]
N = 2
for i in range(N):
    df["modified_review"] = df.apply(
        lambda x: " ".join(
                if (
                    i < len(x["token_and_score"]) and word == x["token_and_score"][i][0]
                else word
                for word in x["modified_review"].split(" ")

# Cleanup
df = df[["review_num", "review", "modified_review"]]


# Output
   review_num                    review             modified_review
0           1  This is the first review  This isE theE first review
1           2         The second review         TheE second reviewE
2           3                   Not Noo                   NotE NooE
© www.soinside.com 2019 - 2024. All rights reserved.