我有以下数据框:
d = {'I am a sentence of words.': 'words', 'I am not a sentence of words.': 'words', 'I have no sentence with words or punctuation': 'letter', 'I am not a sentence with a letter or punctuation': 'letter'}
df = pd.Series(d).rename_axis('sentence').reset_index(name='mention')
df:
sentence mention
0 I am a sentence of words. words
1 I am not a sentence of words. words
2 I have no sentence with words or punctuation letter
3 I am not a sentence with a letter or punctuation letter
我应用以下方法来匹配各种正则表达式模式:
def get_negated(row):
negated = False
# missed negation
terms = ['neg',
'negative',
'no',
'free of',
'not',
'without',
'denies',
'ruled out']
for term in terms:
regex_str=r"(?:\s+\S+)*\b{0}(?:\s+\S+)*\s+{1}\b".format(term, row.mention)
if (re.search(regex_str, sentence)): #or (re.search(regex_str2, row.sentence)):
negated = True
break
return int(negated)
通过迭代:
negated_terms=[]
for row in df.itertuples():
negated_terms.append(get_negated(row))
然后通过以下方式向数据框添加新列:
df['negated'] = negated_terms
输出如下:
df:
sentence mention negated
0 I am a sentence of words. words 0
1 I am not a sentence of words. words 1
2 I have no sentence with words or punctuation letter 0
3 I am not a sentence with a letter or punctuation letter 1
这工作正常,但是数据框中有数百万行,并且其他一些方法返回其他列表以基于其他正则表达式模式创建其他新列。事实上,这需要几个小时才能运行。我正在考虑使用
apply
方法来希望加快这个过程,但考虑到有多种方法,我认为这实际上会比我当前的实现慢。我想知道是否有更有效的(例如矢量化)方法来加快速度。我一生都找不到这样的野兽。
你可以试试这个:
terms = [
'neg', 'negative', 'no', 'free of',
'not', 'without', 'denies','ruled out'
]
pat = "(?:%s).*{mention}" % "|".join(map(re.escape, terms))
df["negated"] = [
int(bool(re.search(pat.format(mention=m), s)))
for s,m in df[["sentence", "mention"]].to_numpy()
]
输出:
print(df)
sentence mention negated
0 I am a sentence of words. words 0
1 I am not a sentence of words. words 1
2 I have no sentence with words or punctuation letter 0
3 I am not a sentence with a letter or punctuation letter 1