将一个字符串系列与一个字符串列表进行比较,并获得子串匹配。

问题描述 投票:0回答:1

我想查找一个DataFrame列与另一个数据框架中的列比较时,是否有子字符串存在。

在我的例子中 DF2['Column y'] 我想

  • 'manager' 与...对抗 'Software Developer Manager'
  • 'executive' 与...对抗 'Online Bidding Executive' 诸如此类

DF1:

      unique_values  counts  Rank  Stop_Word
0       manager    9322   1.0      False
1           for    8463   2.0       True
2     developer    7323   3.0      False
3     executive    5864   4.0      False
4      engineer    5669   5.0      False
5         sales    4492   6.0      False

DF2:

                                 ColumnX.                     Column y. 

0                                Digital Media Planner.       Nan. 
1                             Online Bidding Executive.       Executive
2                           Software Developer Manager        Manager
3                                    Technical Support.       Nan
4                    Software Test Engineer -hyderabad.       engineer
5               Opening For Adobe Analytics Specialist.       Nan
6       Sales- Fresher-for Leading Property Consultant.       Nan
7               Opportunity For Azure Devops Architect        Nan
8                                                  BDE.       Nan
9                   Technical Support/ Product Support.       Nan

* 我想 DF2['Column y'] 输出

另外,如果有多个子字符串存在,那么必须考虑排名最小的那个,就像在第2个索引值中的 DF2 : 'manager' 审议过 'developer'.

python pandas string dataframe comparison
1个回答
0
投票

我会使用apply;apply基本上只是一个映射,它将一个方法应用到每一行或每一列。输出可以放到自己的列中,如图所示。

建立数据框架......

import pandas as pd
import re

df1 = {'unique_values': ['manager', 'for', 'developer', 'executive', 'engineer', 'sales'],
       'counts': [9322, 8463, 7323, 5864, 5669, 4492],
       'Rank': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],
       'Stop_word': [False, True, False, False, False, False]}
df1 = pd.DataFrame.from_dict(df1)

df2 = {'X': ['Digital Media Planner',
            'Online Bidding Executive',
            'Software Developer Manager',
            'Technical Support',
            'Software Test Engineer -hyderabad.Software Test Engineer -hyderabad',
            'Opening For Adobe Analytics Specialist.',
            'Sales- Fresher-for Leading Property Consultant.',
            'Opportunity For Azure Devops Architect',
            'BDE',
            'Technical Support/ Product Support.']}
df2 = pd.DataFrame.from_dict(df2)

解决方案...

def method(df1, df2_value):
    num_values = len(df1)

    for row_index in range(num_values):
        row = df1.iloc[[row_index]]
        df1_value = row.iloc[0,0]
        stop_word = row.iloc[0,3]

        if bool(re.search(df1_value, df2_value, re.IGNORECASE)):
            if stop_word:
                return None
            else:
                return df1_value

df2['Y'] = df2.apply(lambda row: method(df1, row.iloc[0]), axis=1)
print(df2)

输出。

                                                X          Y
0                              Digital Media Planner       None
1                           Online Bidding Executive  executive
2                         Software Developer Manager    manager
3                                  Technical Support       None
4  Software Test Engineer -hyderabad.Software Tes...   engineer
5            Opening For Adobe Analytics Specialist.       None
6    Sales- Fresher-for Leading Property Consultant.       None
7             Opportunity For Azure Devops Architect       None
8                                                BDE       None
9                Technical Support/ Product Support.       None
© www.soinside.com 2019 - 2024. All rights reserved.