我正在寻找一种方法来检查是否可以在另一个字符串中找到一个字符串。 str.contains
只采用固定的字符串模式作为参数,我更愿意在两个字符串列之间进行逐元素比较。
import pandas as pd
df = pd.DataFrame({'long': ['sometext', 'someothertext', 'evenmoretext'],
'short': ['some', 'other', 'stuff']})
# This fails:
df['short_in_long'] = df['long'].str.contains(df['short'])
预期产出:
[True, True, False]
使用列表理解与zip
:
df['short_in_long'] = [b in a for a, b in zip(df['long'], df['short'])]
print (df)
long short short_in_long
0 sometext some True
1 someothertext other True
2 evenmoretext stuff False
这是列表理解的主要用例:
# df['short_in_long'] = [y in x for x, y in df[['long', 'short']].values.tolist()]
df['short_in_long'] = [y in x for x, y in df[['long', 'short']].values]
df
long short short_in_long
0 sometext some True
1 someothertext other True
2 evenmoretext stuff False
列表推导通常比字符串方法更快,因为开销较小。见For loops with pandas - When should I care?。
如果您的数据包含NaN,则可以调用具有错误处理功能:
def try_check(haystack, needle):
try:
return needle in haystack
except TypeError:
return False
df['short_in_long'] = [try_check(x, y) for x, y in df[['long', 'short']].values]
检查numpy
,它是行方式:-)。
np.core.char.find(df.long.values.astype(str),df.short.values.astype(str))!=-1
Out[302]: array([ True, True, False])
也,
df['short_in_long'] = df['long'].str.contains('|'.join(df['short'].values))
更新:我误解了这个问题。这是更正后的版本:
df['short_in_long'] = df['long'].apply(lambda x: True if x[1] in x[0] else False, axis =1)