如何在python中进行模式匹配时从文本中获取单词大小写

问题描述 投票:1回答:1

我有一个包含两列Stg和Txt的数据框。任务是检查“行”列中每个Txt行中的所有单词,并将匹配的单词输出到新列中,同时保持单词大小写与Txt中一样。

Example Code:

from pandas import DataFrame

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = DataFrame(new,columns= ['Stg','Txt'])

my_list = df["Stg"].tolist()
import re

def words_in_string(word_list, a_string):
    word_set = set(word_list)
    pattern = r'\b({0})\b'.format('|'.join(word_list))
    for found_word in re.finditer(pattern, a_string):
        word = found_word.group(0)
        if word in word_set:
            word_set.discard(word)
            yield word
            if not word_set:
                raise StopIteration 

df['new'] = ''

for i,values in enumerate(df['Txt']):
    a=[]
    b = []
    for word in words_in_string(my_list, values):
        a=word
        b.append(a)
    df['new'][i] = b
    exit

上面的代码从Stg列返回大小写。有没有办法从Txt获得案例。另外,我想检查整个字符串,而不是子字符串,就像在文本“双向”的情况下一样,当前代码返回单词Way。

Current Output:

    Stg            Txt                                   new
0   way           An early term                           []
1   Early         two-way allowed                         [way, allowed]
2   phone         New Phone feature that allowed          [allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]


Expected Output:

    Stg            Txt                                   new
0   way           An early term                           [early]
1   Early         two-way allowed                         [allowed]
2   phone         New Phone feature that allowed          [Phone, allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]
python regex pandas case-insensitive
1个回答
1
投票

您应该使用Series.str.findall并带有否定性:

Series.str.findall

0
投票

我认为您过多复制了变量。您可以像下面这样简单地做:

import pandas as pd
import re

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = pd.DataFrame(new,columns= ['Stg','Txt'])

pattern = "|".join(f"\w*(?<![A-Za-z-;:,/|]){i}\\b" for i in new["Stg"])

df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)

print (df)

#
          Stg                             Txt               new
0         way                   An early term           [early]
1       Early                 two-way allowed         [allowed]
2       phone  New Phone feature that allowed  [Phone, allowed]
3     allowed                amazing universe                []
4        type                         new day                []
5  brand name         the brand name is stage      [brand name]

这会给你:

from pandas import DataFrame

new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
        'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
        }

df = DataFrame(new,columns= ['Stg','Txt'])

my_list = df["Stg"].tolist()
import re

df['new'] = ''
mystring = r"\b|\b".join(my_list)
pattern = r'\b{0}\b'.format(mystring)
print(pattern)
match_pattern = re.compile(pattern, re.IGNORECASE)
for i, values in enumerate(df['Txt']):
    matches = re.findall(match_pattern, values)
    df['new'][i] = matches
© www.soinside.com 2019 - 2024. All rights reserved.