提取不同表格中两列之间的常用词,python

问题描述 投票:3回答:2

我想提取df1中所有与df2匹配的单词。

df1 = pd.DataFrame(['Dog has 4 legs.It has 2 eyes.','Fish has fins','Cat has paws.It eats fish','Monkey has tail'],columns=['Description'])

df2 = pd.DataFrame(['Fish','Legs','Eyes'],columns=['Parts'])


 Df1                                             Df2
|---------------------------------|             |---------------------------------|
|         **Description**         |             |          Parts                  |     
|---------------------------------|             |---------------------------------|
|  Dog has 4 legs.It has 2 eyes.  |             | Fish                            |
|---------------------------------|             |---------------------------------|
|  Fish has fins                  |             | Legs                            | 
|---------------------------------|             |---------------------------------|
|  Cat has paws.It eats fish.     |             | Tail                            |  
|---------------------------------|             |---------------------------------| 

希望的输出。

|---------------------------------|-----------|
|         **Description**         |Parts      |
|---------------------------------|-----------|
|  Dog has 4 legs.It has 2 eyes.  |Legs,Tail  |
|---------------------------------|-----------|
|  Fish has fins                  |Fish       |   
|---------------------------------|-----------|
|  Cat has paws.It eats fish.     |Fish       | 
|---------------------------------|-----------|
|  Monkey has tail                |           |   
|---------------------------------|-----------|
python pandas dataframe text match
2个回答
2
投票

IIUC str.extractall 来收集所有火柴,然后 groupby 的索引来创建一个列表或聚合。

import re

pat = '|'.join(df2['Parts'].tolist())
#Fish|Legs|Eyes

df1['Parts'] = df1['Description'].str.extractall(f"({pat})"
                                  ,flags=re.IGNORECASE)\
                            .groupby(level=0)[0].agg(','.join)

print(df1)
                     Description      Parts
0  Dog has 4 legs.It has 2 eyes.  legs,eyes
1                  Fish has fins       Fish
2      Cat has paws.It eats fish       fish
3                Monkey has tail        NaN

1
投票

@Datanovice的解决方案更好,因为所有的东西都在Pandas里。这是一个替代方案,而且速度更快(字符串操作在Pandas中不是那么快)。

from itertools import product
from collections import defaultdict
res = df2.Parts.str.lower().array
d = defaultdict(list)
for description, word in product(df1.Description, res):
    if word in description.lower():
        d[description].append(word)

d

defaultdict(list,
            {'Dog has 4 legs.It has 2 eyes.': ['legs', 'eyes'],
             'Fish has fins': ['fish'],
             'Cat has paws.It eats fish': ['fish']})

df1['parts'] = df1.Description.map(d).str.join(',')
       Description                    parts
0   Dog has 4 legs.It has 2 eyes.   legs,eyes
1   Fish has fins                   fish
2   Cat has paws.It eats fish       fish
3   Monkey has tail 
© www.soinside.com 2019 - 2024. All rights reserved.