如何在 pandas 数据帧上优化此迭代

问题描述 投票:0回答:1

我有以下数据框:

d = {'I am a sentence of words.': 'words', 'I am not a sentence of words.': 'words', 'I have no sentence with words or punctuation': 'letter', 'I am not a sentence with a letter or punctuation': 'letter'}

df = pd.Series(d).rename_axis('sentence').reset_index(name='mention')

df:
                                           sentence mention
0                         I am a sentence of words.   words
1                     I am not a sentence of words.   words
2      I have no sentence with words or punctuation  letter
3  I am not a sentence with a letter or punctuation  letter

我应用以下方法来匹配各种正则表达式模式:

def get_negated(row):
    negated = False
    
    # missed negation
    terms = ['neg',
             'negative',
             'no',
             'free of',
             'not',
             'without',
             'denies',
             'ruled out']
    
    for term in terms:
        regex_str=r"(?:\s+\S+)*\b{0}(?:\s+\S+)*\s+{1}\b".format(term, row.mention)
        if (re.search(regex_str, sentence)): #or (re.search(regex_str2, row.sentence)):
            negated = True
            break
            
    return int(negated)

通过迭代:

negated_terms=[]
for row in df.itertuples():
        negated_terms.append(get_negated(row))

然后通过以下方式向数据框添加新列:

df['negated'] = negated_terms

输出如下:

df:

                                                sentence mention  negated
0                         I am a sentence of words.   words        0
1                     I am not a sentence of words.   words        1
2      I have no sentence with words or punctuation  letter        0
3  I am not a sentence with a letter or punctuation  letter        1

这工作正常,但是数据框中有数百万行,并且其他一些方法返回其他列表以基于其他正则表达式模式创建其他新列。事实上,这需要几个小时才能运行。我正在考虑使用

apply
方法来希望加快这个过程,但考虑到有多种方法,我认为这实际上会比我当前的实现慢。我想知道是否有更有效的(例如矢量化)方法来加快速度。我一生都找不到这样的野兽。

python pandas vectorization
1个回答
0
投票

你可以试试这个:

terms = [
    'neg', 'negative', 'no', 'free of',
    'not', 'without', 'denies','ruled out'
]

pat = "(?:%s).*{mention}" % "|".join(map(re.escape, terms))

df["negated"] = [
    int(bool(re.search(pat.format(mention=m), s)))
    for s,m in df[["sentence", "mention"]].to_numpy()
]

输出:

print(df)

                                           sentence mention  negated
0                         I am a sentence of words.   words        0
1                     I am not a sentence of words.   words        1
2      I have no sentence with words or punctuation  letter        0
3  I am not a sentence with a letter or punctuation  letter        1
© www.soinside.com 2019 - 2024. All rights reserved.