假设我有一个数据框,我们称之为 A
关键字_词 | 代码 |
---|---|
市场 | A1 |
剧院 | A2 |
我还有另一个数据框,我们称之为 B
句子 | 来自_A的组件 |
---|---|
约翰去了剧院 | |
玛丽去了市场,然后去了剧院 |
我想找到一种动态更新数据帧B的方法,使其循环遍历每个b.sentence,如果找到A.Key_word的实例,则记录A.Code。如果某个关键字在A中出现多次,我希望它在 B.Components_from_A 中只出现一次。如果出现多个关键字我希望将其记录为Code1+Code2+Code3.....
句子 | 来自_A的组件 |
---|---|
约翰去了剧院 | A2 |
玛丽去了市场,然后去了剧院 | A1+A2 |
杰克决定现在就去剧院,而不是稍后再去剧院 | A2 |
我认为我可以构建类似的东西
import pandas as pd
import numpy as np
A=pd.read_csv('A.csv')
B=pd.read_csv('B.csv')
Dict=dict(zip(A.Key_Word,A.Code))
我正在努力解决的是动态更新 B 中的列以及我需要按照我描述的方式进行操作的逻辑
您可以通过创建数据帧 A 的字典并迭代 B 的每个句子来完成此操作。检查是否是重要的关键字并收集信息
import pandas as pd
data_A = {'Key_Word': ['Market', 'Theater'], 'Code': ['A1', 'A2']}
data_B = {'Sentence': [
'John went to the theater',
'Mary went to the Market and then the theater',
'Jack decided to go to the theater now as opposed to going to the theater later'
]}
A = pd.DataFrame(data_A)
B = pd.DataFrame(data_B)
keyword_to_code = dict(zip(A.Key_Word, A.Code))
def update_components(sentence, keyword_to_code):
codes = set()
for keyword, code in keyword_to_code.items():
if keyword.lower() in sentence.lower():
codes.add(code)
return '+'.join(sorted(codes))
B['Components_from_A'] = B['Sentence'].apply(lambda x: update_components(x, keyword_to_code))
print(B)
这给了你
import pandas as pd
data_A = {'Key_Word': ['Market', 'Theater'], 'Code': ['A1', 'A2']}
data_B = {'Sentence': [
'John went to the theater',
'Mary went to the Market and then the theater',
'Jack decided to go to the theater now as opposed to going to the theater later'
]}
A = pd.DataFrame(data_A)
B = pd.DataFrame(data_B)
keyword_to_code = dict(zip(A.Key_Word, A.Code))
def update_components(sentence, keyword_to_code):
codes = set()
for keyword, code in keyword_to_code.items():
if keyword.lower() in sentence.lower():
codes.add(code)
return '+'.join(sorted(codes))
B['Components_from_A'] = B['Sentence'].apply(lambda x: update_components(x, keyword_to_code))
print(B)
extractall
关键字,merge
,然后drop_duplicates
和groupby.agg
:
import re
A['Key_Word'] = A['Key_Word'].str.casefold()
pattern = '(%s)' % '|'.join(map(re.escape, A['Key_Word']))
# '(Market|Theater)'
B['Components_from_A'] = (B['Sentence'].str.casefold().str.extractall(pattern)
.droplevel('match').reset_index(0)
.merge(A, left_on=0, right_on='Key_Word')
.drop_duplicates(['index', 'Code'])
.groupby('index')['Code'].agg('+'.join)
)
输出:
Sentence Components_from_A
0 John went to the theater A2
1 Mary went to the Market and then the theater A1+A2
2 Jack decided to go to the theater now as oppos... A2