我有一个包含评论的熊猫数据框。对于每条评论,我都有不同的词和特定的分数如下:
import pandas as pd
df = pd.DataFrame({
"review_num": [2,2,2,1,1,1,1,1,3,3],
"review": ["The second review", "The second review", "The second review",
"This is the first review", "This is the first review",
"This is the first review", "This is the first review",
"This is the first review",' Not Noo', 'Not Noo'],
"token_num":[1,2,3,1,2,3,4,5,1,2],
"token":["The", "second", "review", "This", "is", "the", "first", "review", "Not", "Noo"],
"score":[0.3,-0.6,0.4,0.5,0.6,0.7,-0.6,0.4,0.5,0.6]
})
使用以下代码,我可以通过将转换函数应用于得分最高的单词来修改评论,并创建一个包含新旧评论的新数据框。
# Identify the line with the max score for each review
token_max_score = df.groupby("review_num", sort=False)["score"].idxmax()
# keep only lines with max score by review
Modified_df = df.loc[token_max_score, ["review_num", "review"]]
def modify_word(w):
return w + "E" # just to simplify the example
# Add the new column
Modified_df = Modified_df.join(
pd.DataFrame(
{
"Modified_review": [
txt.replace(w, modify_word(w))
for w, txt in zip(
df.loc[token_max_score, "token"], df.loc[token_max_score, "review"]
)
]
},
index=token_max_score,
)
)
我需要应用变换函数 n 次,而不是一次(如我的代码中那样)
当前修改的数据框是:
review_num review Modified_review
2 2 The second review The second reviewE
5 1 This is the first review This is theE first review
9 3 Not Noo Not NooE
n=2 的预期修改数据帧是:
review_num review Modified_review
2 2 The second review TheE second reviewE
5 1 This is the first review This isE theE first review
9 3 Not Noo NotE NooE
谢谢你的帮助。
这是使用 Pandas 应用的一种方法:
# Group and sort in descending order tokens and scores
df = df.groupby(["review_num", "review"]).agg(list)[["token", "score"]]
df["token_and_score"] = df.apply(
lambda x: {t: s for t, s in zip(x["token"], x["score"])}, axis=1
)
df["token_and_score"] = df["token_and_score"].apply(
lambda x: sorted(x.items(), key=lambda y: y[1], reverse=True)
)
# Iterate on new column "modified_review" and apply 'modify_word' function
df = df.reset_index()
df["modified_review"] = df["review"]
N = 2
for i in range(N):
df["modified_review"] = df.apply(
lambda x: " ".join(
[
modify_word(word)
if (
i < len(x["token_and_score"]) and word == x["token_and_score"][i][0]
)
else word
for word in x["modified_review"].split(" ")
]
),
axis=1,
)
# Cleanup
df = df[["review_num", "review", "modified_review"]]
然后:
print(df)
# Output
review_num review modified_review
0 1 This is the first review This isE theE first review
1 2 The second review TheE second reviewE
2 3 Not Noo NotE NooE