我有两列,我想检查它们是否匹配4个或更多字符,无论数组的位置如何,如果匹配,则创建一个如果匹配则为OK的列,否则为KO。
如何在PYTHON或SQL LITE中执行此操作?
例:
数据集;
Street 1;Street 2
ASENSIO Y TOLEDO 15;AVILA 9
AVILA 9;AVILA 9
FISTERRA S/N;FINISTERRE S/N - SAN ROQUE
PASEO DEL PUER;PASEO DEL PUERTO SN
PASEO DEL PUER;PASEO DEL PUERTO SN
LA UNION 2;LA UNION 2
ALEGRIA 14;LA UNION 2
谢谢。
https://i.stack.imgur.com/gYLcg.png
码:
def dataet():
df_dataset= pd.read_csv("C:/Users/Documents/DATASET2.CSV", sep=';')
print(df_dataset.columns.values)
query = """
SELECT INSTR(street 1, street 2)
FROM df_dataset
"""
result= pdsql.sqldf(query)
print(result)
在python中你可以使用set
s来获取字符串中的唯一字符,然后&
从Street 1
和Street 2
设置以获得它们的联合。我也从匹配列表中删除空格,你不想算它们,对吗?
df['count'] = ['OK' if len(set(x) & set(y) - set(' ')) >= 4 else 'KO' for x, y in zip(df['Street 1'].fillna(''), df['Street 2'].fillna(''))]
print(df)
输出:
Street 1 Street 2 count
0 ASENSIO Y TOLEDO 15 AVILA 9 KO
1 AVILA 9 AVILA 9 OK
2 FISTERRA S/N FINISTERRE S/N - SAN ROQUE OK
3 PASEO DEL PUER PASEO DEL PUERTO SN OK
4 PASEO DEL PUER PASEO DEL PUERTO SN OK
5 LA UNION 2 LA UNION 2 OK
6 ALEGRIA 14 LA UNION 2 KO
更新:如果您正在寻找Street 1
和Street 2
之间最长公共子串的长度:
from difflib import SequenceMatcher
z = df.fillna('')
z['count'] = [len(x[m.a:m.a+m.size].replace(' ', '')) for x, m in
[(x, SequenceMatcher(None, x, y).find_longest_match(0, len(x), 0, len(y)))
for x, y in zip(z['Street 1'], z['Street 2'])]]
z['match'] = ['OK' if x >= 4 else 'KO' for x in z['count']]
print(z)
输出:
Street 1 Street 2 count match
0 ASENSIO Y TOLEDO 15 AVILA 9 1 KO
1 AVILA 9 AVILA 9 6 OK
2 FISTERRA S/N FINISTERRE S/N - SAN ROQUE 6 OK
3 PASEO DEL PUER PASEO DEL PUERTO SN 12 OK
4 PASEO DEL PUER PASEO DEL PUERTO SN 12 OK
5 LA UNION 2 LA UNION 2 8 OK
6 ALEGRIA 14 LA UNION 2 1 KO
7 JARILLO 7 BO IZD SAN AMBROSIO 1 KO
8 STREET AVE PARRA PARRA STREET 4 6 OK
9 PARRA 4 0 KO
还使用numpy.where():
df['res'] = np.where([len(set(x) - set(y))>=4 for x, y in zip(df['Street 1'], df['Street 2'])], 'OK', 'KO')