我有以下函数想要将字符串拆分为单词。分割结果需要分隔字母、数字和一些特殊字符,如 / 或 -
import pandas as pd
import re
def split_string_with_letters_and_non_letters(input_string):
split_strings = re.split(r"([a-zA-Z] )+", input_string)
result = []
count = 0
for x in split_strings:
if x not in result and x != ' ' and x != '':
result.append(x.lower().strip())
count=count+1
delim = "|"
result_string = delim.join([str(ele) for ele in result])
return result_string
teststring= "SPRINTER2500 2WD C E-150"
print(split_string_with_letters_and_non_letters(teststring))`
我的预期回报结果是:
"SPRINTER|2500|2|WD|C|E|-|150"
问题出在你的正则表达式中。这是重新审视的代码:
import re
def split_string_with_letters_and_non_letters(input_string):
#new regex
split_strings = re.findall(r'[A-Za-z]+|\d+|[/|-]', input_string)
# Remove empty elements
result = [x for x in split_strings if x]
# Join with a delimiter
result_string = "|".join(result)
return result_string
teststring = "SPRINTER2500 2WD C E-150"
print(split_string_with_letters_and_non_letters(teststring))
结果:
SPRINTER|2500|2|WD|C|E|-|150
itertools.groupby
和自定义字符分类器的非正则表达式方法:
from itertools import groupby
def char_type(c):
if c.isalpha(): return 'alpha'
if c.isdigit(): return 'digit'
if c.isspace(): return
if c in {'-', '/'}: return 'special' # this is just an example
return 'other'
out = '|'.join(''.join(g) for k, g in groupby(input_string, char_type) if k)
输出:
'SPRINTER|2500|2|WD|C|E|-|150'
unicodedata.category
并排除 Zs
:
from itertools import groupby
import unicodedata
out = '|'.join(''.join(g) for k, g in
groupby(input_string, unicodedata.category)
if k!='Zs')
注意。这会将小写/大写字符分为不同的组,但是您可以使用
lambda x: unicodedata.category(x.casefold())
代替 unicodedata.category
来避免这种情况。