有没有办法在不分割转义字符的情况下分割字符串?例如,我有一个字符串,想用 ':' 分割,而不是用 '\:'
http\://www.example.url:ftp\://www.example.url
结果应如下所示:
['http\://www.example.url' , 'ftp\://www.example.url']
有一种更简单的方法,使用带有 负后向断言的正则表达式:
re.split(r'(?<!\\):', str)
正如伊格纳西奥所说,是的,但不是一次性的。问题是您需要回顾以确定是否处于转义分隔符处,而基本的
string.split
不提供该功能。
如果这不在紧密循环内,因此性能不是一个重大问题,您可以通过首先拆分转义分隔符,然后执行拆分,然后合并来实现。丑陋的演示代码如下:
# Bear in mind this is not rigorously tested!
def escaped_split(s, delim):
# split by escaped, then by not-escaped
escaped_delim = '\\'+delim
sections = [p.split(delim) for p in s.split(escaped_delim)]
ret = []
prev = None
for parts in sections: # for each list of "real" splits
if prev is None:
if len(parts) > 1:
# Add first item, unless it's also the last in its section
ret.append(parts[0])
else:
# Add the previous last item joined to the first item
ret.append(escaped_delim.join([prev, parts[0]]))
for part in parts[1:-1]:
# Add all the items in the middle
ret.append(part)
prev = parts[-1]
return ret
s = r'http\://www.example.url:ftp\://www.example.url'
print (escaped_split(s, ':'))
# >>> ['http\\://www.example.url', 'ftp\\://www.example.url']
或者,如果您只是手动拆分字符串,可能会更容易遵循逻辑。
def escaped_split(s, delim):
ret = []
current = []
itr = iter(s)
for ch in itr:
if ch == '\\':
try:
# skip the next character; it has been escaped!
current.append('\\')
current.append(next(itr))
except StopIteration:
pass
elif ch == delim:
# split! (add current to the list and reset it)
ret.append(''.join(current))
current = []
else:
current.append(ch)
ret.append(''.join(current))
return ret
请注意,第二个版本在遇到双转义后跟分隔符时的行为略有不同:此函数允许转义转义字符,因此
escaped_split(r'a\\:b', ':')
返回 ['a\\\\', 'b']
,因为第一个 \
转义了第二个,留下了:
被解释为真正的分隔符。所以这是需要注意的事情。
Henry的答案的编辑版本与Python3兼容性,测试并修复了一些问题:
def split_unescape(s, delim, escape='\\', unescape=True):
"""
>>> split_unescape('foo,bar', ',')
['foo', 'bar']
>>> split_unescape('foo$,bar', ',', '$')
['foo,bar']
>>> split_unescape('foo$$,bar', ',', '$', unescape=True)
['foo$', 'bar']
>>> split_unescape('foo$$,bar', ',', '$', unescape=False)
['foo$$', 'bar']
>>> split_unescape('foo$', ',', '$', unescape=True)
['foo$']
"""
ret = []
current = []
itr = iter(s)
for ch in itr:
if ch == escape:
try:
# skip the next character; it has been escaped!
if not unescape:
current.append(escape)
current.append(next(itr))
except StopIteration:
if unescape:
current.append(escape)
elif ch == delim:
# split! (add current to the list and reset it)
ret.append(''.join(current))
current = []
else:
current.append(ch)
ret.append(''.join(current))
return ret
基于@user629923的建议,但比其他答案简单得多:
import re
DBL_ESC = "!double escape!"
s = r"Hello:World\:Goodbye\\:Cruel\\\:World"
map(lambda x: x.replace(DBL_ESC, r'\\'), re.split(r'(?<!\\):', s.replace(r'\\', DBL_ESC)))
这是一个有效的解决方案,可以正确处理双重转义,即任何后续的分隔符都不会被转义。它会忽略作为字符串最后一个字符的不正确的单转义。
它非常高效,因为它只迭代输入字符串一次,操作索引而不是复制字符串。它不构造列表,而是返回一个生成器。
def split_esc(string, delimiter):
if len(delimiter) != 1:
raise ValueError('Invalid delimiter: ' + delimiter)
ln = len(string)
i = 0
j = 0
while j < ln:
if string[j] == '\\':
if j + 1 >= ln:
yield string[i:j]
return
j += 1
elif string[j] == delimiter:
yield string[i:j]
i = j + 1
j += 1
yield string[i:j]
要允许比单个字符长的分隔符,只需在“elif”情况下将 i 和 j 前移分隔符的长度即可。这假设单个转义字符转义整个分隔符,而不是单个字符。
使用 Python 3.5.1 进行测试。
没有内置函数。 这是一个高效、通用且经过测试的函数,甚至支持任意长度的分隔符:
def escape_split(s, delim):
i, res, buf = 0, [], ''
while True:
j, e = s.find(delim, i), 0
if j < 0: # end reached
return res + [buf + s[i:]] # add remainder
while j - e and s[j - e - 1] == '\\':
e += 1 # number of escapes
d = e // 2 # number of double escapes
if e != d * 2: # odd number of escapes
buf += s[i:j - d - 1] + s[j] # add the escaped char
i = j + 1 # and skip it
continue # add more to buf
res.append(buf + s[i:j - d])
i, buf = j + len(delim), '' # start after delim
我认为像解析这样的简单 C 会更加简单和健壮。
def escaped_split(str, ch):
if len(ch) > 1:
raise ValueError('Expected split character. Found string!')
out = []
part = ''
escape = False
for i in range(len(str)):
if not escape and str[i] == ch:
out.append(part)
part = ''
else:
part += str[i]
escape = not escape and str[i] == '\\'
if len(part):
out.append(part)
return out
我创建了这个方法,其灵感来自 Henry Keiter 的答案,但具有以下优点:
这是代码:
def _split_string(self, string: str, delimiter: str, escape: str) -> [str]:
result = []
current_element = []
iterator = iter(string)
for character in iterator:
if character == self.release_indicator:
try:
next_character = next(iterator)
if next_character != delimiter and next_character != escape:
# Do not copy the escape character if it is inteded to escape either the delimiter or the
# escape character itself. Copy the escape character if it is not in use to escape one of these
# characters.
current_element.append(escape)
current_element.append(next_character)
except StopIteration:
current_element.append(escape)
elif character == delimiter:
# split! (add current to the list and reset it)
result.append(''.join(current_element))
current_element = []
else:
current_element.append(character)
result.append(''.join(current_element))
return result
这是指示行为的测试代码:
def test_split_string(self):
# Verify normal behavior
self.assertListEqual(['A', 'B'], list(self.sut._split_string('A+B', '+', '?')))
# Verify that escape character escapes the delimiter
self.assertListEqual(['A+B'], list(self.sut._split_string('A?+B', '+', '?')))
# Verify that the escape character escapes the escape character
self.assertListEqual(['A?', 'B'], list(self.sut._split_string('A??+B', '+', '?')))
# Verify that the escape character is just copied if it doesn't escape the delimiter or escape character
self.assertListEqual(['A?+B'], list(self.sut._split_string('A?+B', '\'', '?')))
我真的知道这是一个老问题,但我最近需要一个这样的功能,但没有找到任何符合我要求的功能。
规则:
/
并且转义符是 \
那么 (\a\b\c/abc
bacame ['\a\b\c', 'abc']
\\
变成了\
)因此,作为记录,如果有人看起来类似,这里是我的功能建议:
def str_escape_split(str_to_escape, delimiter=',', escape='\\'):
"""Splits an string using delimiter and escape chars
Args:
str_to_escape ([type]): The text to be splitted
delimiter (str, optional): Delimiter used. Defaults to ','.
escape (str, optional): The escape char. Defaults to '\'.
Yields:
[type]: a list of string to be escaped
"""
if len(delimiter) > 1 or len(escape) > 1:
raise ValueError("Either delimiter or escape must be an one char value")
token = ''
escaped = False
for c in str_to_escape:
if c == escape:
if escaped:
token += escape
escaped = False
else:
escaped = True
continue
if c == delimiter:
if not escaped:
yield token
token = ''
else:
token += c
escaped = False
else:
if escaped:
token += escape
escaped = False
token += c
yield token
为了理智起见,我做了一些测试:
# The structure is:
# 'string_be_split_escaped', [list_with_result_expected]
tests_slash_escape = [
('r/casa\\/teste/g', ['r', 'casa/teste', 'g']),
('r/\\/teste/g', ['r', '/teste', 'g']),
('r/(([0-9])\\s+-\\s+([0-9]))/\\g<2>\\g<3>/g',
['r', '(([0-9])\\s+-\\s+([0-9]))', '\\g<2>\\g<3>', 'g']),
('r/\\s+/ /g', ['r', '\\s+', ' ', 'g']),
('r/\\.$//g', ['r', '\\.$', '', 'g']),
('u///g', ['u', '', '', 'g']),
('s/(/[/g', ['s', '(', '[', 'g']),
('s/)/]/g', ['s', ')', ']', 'g']),
('r/(\\.)\\1+/\\1/g', ['r', '(\\.)\\1+', '\\1', 'g']),
('r/(?<=\\d) +(?=\\d)/./', ['r', '(?<=\\d) +(?=\\d)', '.', '']),
('r/\\\\/\\\\\\/teste/g', ['r', '\\', '\\/teste', 'g'])
]
tests_bar_escape = [
('r/||/|||/teste/g', ['r', '|', '|/teste', 'g'])
]
def test(test_array, escape):
"""From input data, test escape functions
Args:
test_array ([type]): [description]
escape ([type]): [description]
"""
for t in test_array:
resg = str_escape_split(t[0], '/', escape)
res = list(resg)
if res == t[1]:
print(f"Test {t[0]}: {res} - Pass!")
else:
print(f"Test {t[0]}: {t[1]} != {res} - Failed! ")
def test_all():
test(tests_slash_escape, '\\')
test(tests_bar_escape, '|')
if __name__ == "__main__":
test_all()
一个正确和更简单的版本:
def split(s: str) -> list[str]:
parts = re.split(r'(\\.)|:', s)
parts = [p and p.removeprefix('\\') for p in parts]
segments = itertools.groupby(parts, key=lambda p: p is None)
return [''.join(segment) for is_delimiter, segment in segments if not is_delimiter]
>>> split(r'http\://www.example.url:ftp\://www.example.url')
['http://www.example.url', 'ftp://www.example.url']
>>> split('')
['']
>>> split('::')
['', '', '']
>>> split('a:')
['a', '']
>>> split(r':\:\\:\\\:\\\\:\\\\\:\\\\\\:')
['', ':\\', '\\:\\\\', '\\\\:\\\\\\', '']
图示:
None
)处拆分为 parts
。parts
) 将列表 None
拆分为段。请注意 : 似乎不是需要转义的字符。
我能想到的实现此目的的最简单方法是拆分角色,然后在转义时将其添加回来。
示例代码(非常需要一些整理。):
def splitNoEscapes(string, char):
sections = string.split(char)
sections = [i + (char if i[-1] == "\\" else "") for i in sections]
result = ["" for i in sections]
j = 0
for s in sections:
result[j] += s
j += (1 if s[-1] != char else 0)
return [i for i in result if i != ""]