使用子字符串动态更新 Python DataFrame 中的列

Question

假设我有一个数据框，我们称之为 A

关键字_词	代码
市场	A1
剧院	A2

我还有另一个数据框，我们称之为 B

句子	来自_A的组件
约翰去了剧院
玛丽去了市场，然后去了剧院

我想找到一种动态更新数据帧B的方法，使其循环遍历每个b.sentence，如果找到A.Key_word的实例，则记录A.Code。如果某个关键字在A中出现多次，我希望它在 B.Components_from_A 中只出现一次。如果出现多个关键字我希望将其记录为Code1+Code2+Code3.....

句子	来自_A的组件
约翰去了剧院	A2
玛丽去了市场，然后去了剧院	A1+A2
杰克决定现在就去剧院，而不是稍后再去剧院	A2

我认为我可以构建类似的东西

    import pandas as pd
    import numpy as np

    A=pd.read_csv('A.csv')
    B=pd.read_csv('B.csv')
    Dict=dict(zip(A.Key_Word,A.Code))

我正在努力解决的是动态更新 B 中的列以及我需要按照我描述的方式进行操作的逻辑

Answer 1

您可以通过创建数据帧 A 的字典并迭代 B 的每个句子来完成此操作。检查是否是重要的关键字并收集信息

import pandas as pd

data_A = {'Key_Word': ['Market', 'Theater'], 'Code': ['A1', 'A2']}
data_B = {'Sentence': [
    'John went to the theater',
    'Mary went to the Market and then the theater',
    'Jack decided to go to the theater now as opposed to going to the theater later'
]}

A = pd.DataFrame(data_A)
B = pd.DataFrame(data_B)

keyword_to_code = dict(zip(A.Key_Word, A.Code))

def update_components(sentence, keyword_to_code):
    codes = set()  
    for keyword, code in keyword_to_code.items():
        if keyword.lower() in sentence.lower():
            codes.add(code)
    return '+'.join(sorted(codes))

B['Components_from_A'] = B['Sentence'].apply(lambda x: update_components(x, keyword_to_code))

print(B)

这给了你

import pandas as pd

data_A = {'Key_Word': ['Market', 'Theater'], 'Code': ['A1', 'A2']}
data_B = {'Sentence': [
    'John went to the theater',
    'Mary went to the Market and then the theater',
    'Jack decided to go to the theater now as opposed to going to the theater later'
]}

A = pd.DataFrame(data_A)
B = pd.DataFrame(data_B)

keyword_to_code = dict(zip(A.Key_Word, A.Code))

def update_components(sentence, keyword_to_code):
    codes = set()  
    for keyword, code in keyword_to_code.items():
        if keyword.lower() in sentence.lower():
            codes.add(code)
    return '+'.join(sorted(codes))

B['Components_from_A'] = B['Sentence'].apply(lambda x: update_components(x, keyword_to_code))

print(B)

Answer 2

您可以制作一个正则表达式，

extractall

关键字，

merge

，然后

drop_duplicates

和

groupby.agg

：

import re

A['Key_Word'] = A['Key_Word'].str.casefold()

pattern = '(%s)' % '|'.join(map(re.escape, A['Key_Word']))
# '(Market|Theater)'

B['Components_from_A'] = (B['Sentence'].str.casefold().str.extractall(pattern)
 .droplevel('match').reset_index(0)
 .merge(A, left_on=0, right_on='Key_Word')
 .drop_duplicates(['index', 'Code'])
 .groupby('index')['Code'].agg('+'.join)
)

输出：

                                            Sentence Components_from_A
0                           John went to the theater                A2
1       Mary went to the Market and then the theater             A1+A2
2  Jack decided to go to the theater now as oppos...                A2

使用子字符串动态更新 Python DataFrame 中的列

问题描述投票：0回答：2

2个回答

最新问题

使用子字符串动态更新 Python DataFrame 中的列

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2