假设这是字符串:
The fox jumped over the log.
它将导致:
The fox jumped over the log.
最简单的1-2衬管可以做到这一点?无需拆分并进入列表...
>>> import re
>>> re.sub(' +', ' ', 'The quick brown fox')
'The quick brown fox'
>>> import re
>>> s="The fox jumped over the log."
>>> print re.sub('\s+',' ', s)
The fox jumped over the log.
但是如果您希望使用正则表达式,可以通过以下方式完成:
df['message'] = (df['message'].str.split()).str.join(' ')
尽管必须进行一些预处理才能删除尾随空格。
that字符的单个实例替换每个空白字符的连续出现。您可以使用带有反向引用的正则表达式来做到这一点。
import re
string = re.sub('[ \t\n]+', ' ', 'The quick brown \n\n \t fox')
与任何空白字符匹配,后跟一个或多个该字符。现在,您需要做的就是指定第一个组(str1=' I live on earth '
' '.join(str1.split())
)作为比赛的替换。将其包装在函数中:
re.sub('\s+', ' ', str1)
(\s)\1{1,}
\1
import re
def normalize_whitespace(string):
return re.sub(r'(\s)\1{1,}', r'\1', string)
说明:将整个字符串分成列表。
- 从列表中过滤空元素。
- 将其余元素*重新合并为一个空格
>>> normalize_whitespace('The fox jumped over the log.')
'The fox jumped over the log.'
>>> normalize_whitespace('First line\t\t\t \n\n\nSecond line')
'First line\t \nSecond line'
其中变量>>> import re
>>> str = 'this is a string with multiple spaces and tabs'
>>> str = re.sub('[ \t]+' , ' ', str)
>>> print str
this is a string with multiple spaces and tabs
代表您的字符串。
sentence = " The fox jumped over the log. "
sentence = ' '.join(filter(None,sentence.split(' ')))
while " " in s:
s = s.replace(" ", " ")
短路使它比s
快一点。如果您追求效率,并严格寻求除掉多余的空格,则请使用此空格。
如果您要处理的是空格,则在None上分割将不会在返回值中包含空字符串。
def unPretty(S): # Given a dictionary, JSON, list, float, int, or even a string... # return a string stripped of CR, LF replaced by space, with multiple spaces reduced to one. return ' '.join(str(S).replace('\n', ' ').replace('\r', '').split())
if ' ' in text: while ' ' in text: text = text.replace(' ', ' ')
:结果
这是一个充满空格和水龙头的字符串
要删除空格,请考虑单词之间的前导,尾随和多余空格,请使用:[第一个5.6.1. String Methods, str.split()处理前导空白,第二个
string = 'This is a string full of spaces and taps' string = string.split(' ') while '' in string: string.remove('') string = ' '.join(string) print(string)
处理字符串前导空白,最后一个处理尾随空白。有关使用证明,此链接将为您提供测试。
(?<=\s) +|^ +(?=\s)| (?= +[\n\0])
将与
or
功能一起使用。我没有在其他示例中读很多书,但是我刚刚创建了用于合并多个连续空格字符的方法。它不使用任何库,并且尽管在脚本长度方面相对较长,但它不是一个复杂的实现:
or
我有大学时期使用的简单方法。这将用单个空格替换每个双空格,并将执行1000次。这意味着您可以有2000个额外的空间,并且仍然可以使用。 :)
我有一个不分裂的简单方法:Python开发人员解决方案:def spaceMatcher(command): """ Function defined to consolidate multiple whitespace characters in strings to a single space """ # Initiate index to flag if more than one consecutive character iteration space_match = 0 space_char = "" for char in command: if char == " ": space_match += 1 space_char += " " elif (char != " ") & (space_match > 1): new_command = command.replace(space_char, " ") space_match = 0 space_char = "" elif char != " ": space_match = 0 space_char = "" return new_command command = None command = str(input("Please enter a command ->")) print(spaceMatcher(command)) print(list(spaceMatcher(command)))
输出:
line = "I have a nice day." end = 1000 while end != 0: line.replace(" ", " ") end -= 1
foo
是您的字符串:
" ".join(foo.split())
尽管会删除“所有空白字符(空格,制表符,换行符,返回符,换页符)”,但仍会受到警告(由于[C0],请参见注释)。也就是说,hhsaffar将有效地以"this is \t a test\n"
结尾。
"this is a test"
或
import re
s = "The fox jumped over the log."
re.sub("\s\s+" , " ", s)
因为逗号前的空格在re.sub("\s\s+", " ", s)
中列为pet peeve,在注释中列为PEP 8。
使用带有“ \ s”的正则表达式并执行简单的string.split()会也删除其他空格,例如换行符,回车符,制表符。除非需要这样做,否则要为[[only做multispaces,我将提供这些示例。
我用mentioned by user Martin Thoma进行了真实的时间测试,并在整个过程中使用了随机长度的多余空格:11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum
单线本质上会划分任何前导/后缀空格,并保留前导/后缀空格(但仅;-)。ONE
original_string = ''.join(word + (' ' * random.randint(1, 10)) for word in lorem_ipsum.split(' '))
# setup = ''' import re def while_replace(string): while ' ' in string: string = string.replace(' ', ' ') return string def re_replace(string): return re.sub(r' {2,}' , ' ', string) def proper_join(string): split_string = string.split(' ') # To account for leading/trailing spaces that would simply be removed beg = ' ' if not split_string[ 0] else '' end = ' ' if not split_string[-1] else '' # versus simply ' '.join(item for item in string.split(' ') if item) return beg + ' '.join(item for item in split_string if item) + end original_string = """Lorem ipsum ... no, really, it kept going... malesuada enim feugiat. Integer imperdiet erat.""" assert while_replace(original_string) == re_replace(original_string) == proper_join(original_string) #'''
# while_replace_test new_string = original_string[:] new_string = while_replace(new_string) assert new_string != original_string
# re_replace_test new_string = original_string[:] new_string = re_replace(new_string) assert new_string != original_string
NOTE: “
# proper_join_test new_string = original_string[:] new_string = proper_join(new_string) assert new_string != original_string
版本”复制了while
,因为我相信一旦在第一次运行中进行了修改,连续运行就会更快(如果只是一点点的话)。随着时间的增加,我将此字符串副本添加到其他两个字符串中,以便时间仅在逻辑上显示差异。original_string
;我执行此操作的原始方法是,Keep in mind that the mainstmt
ontimeit
instances will only be executed once循环在相同的标签stmt
上工作,因此第二次运行将无所事事。现在设置的方式,使用两个不同的标签调用函数,这没有问题。我向所有工作人员添加了timeit
语句,以验证我们每次迭代都会更改某些内容(对于那些可能会怀疑的人)。例如,更改为它并中断:while
original_string
assert
对于琐碎的字符串,似乎while循环是最快的,其次是Pythonic字符串拆分/连接,然后是正则表达式将其拉到后面。,但最好总是到对于非平凡的字符串,似乎还有更多需要考虑的地方。 32位2.7?正则表达式可以解救! 2.7 64位?
最后,可以提高性能# while_replace_test new_string = original_string[:] new_string = while_replace(new_string) assert new_string != original_string # will break the 2nd iteration while ' ' in original_string: original_string = original_string.replace(' ', ' ')
循环最好,但要有一个不错的选择。 32位3.2,使用“适当的”Tests run on a laptop with an i5 processor running Windows 7 (64-bit). timeit.Timer(stmt = test, setup = setup).repeat(7, 1000) test_string = 'The fox jumped over\n\t the log.' # trivial Python 2.7.3, 32-bit, Windows test | minum | maximum | average | median ---------------------+------------+------------+------------+----------- while_replace_test | 0.001066 | 0.001260 | 0.001128 | 0.001092 re_replace_test | 0.003074 | 0.003941 | 0.003357 | 0.003349 proper_join_test | 0.002783 | 0.004829 | 0.003554 | 0.003035 Python 2.7.3, 64-bit, Windows test | minum | maximum | average | median ---------------------+------------+------------+------------+----------- while_replace_test | 0.001025 | 0.001079 | 0.001052 | 0.001051 re_replace_test | 0.003213 | 0.004512 | 0.003656 | 0.003504 proper_join_test | 0.002760 | 0.006361 | 0.004626 | 0.004600 Python 3.2.3, 32-bit, Windows test | minum | maximum | average | median ---------------------+------------+------------+------------+----------- while_replace_test | 0.001350 | 0.002302 | 0.001639 | 0.001357 re_replace_test | 0.006797 | 0.008107 | 0.007319 | 0.007440 proper_join_test | 0.002863 | 0.003356 | 0.003026 | 0.002975 Python 3.3.3, 64-bit, Windows test | minum | maximum | average | median ---------------------+------------+------------+------------+----------- while_replace_test | 0.001444 | 0.001490 | 0.001460 | 0.001459 re_replace_test | 0.011771 | 0.012598 | 0.012082 | 0.011910 proper_join_test | 0.003741 | 0.005933 | 0.004341 | 0.004009
。 64位3.3,进行test_string = lorem_ipsum # Thanks to http://www.lipsum.com/ # "Generated 11 paragraphs, 1000 words, 6665 bytes of Lorem Ipsum" Python 2.7.3, 32-bit test | minum | maximum | average | median ---------------------+------------+------------+------------+----------- while_replace_test | 0.342602 | 0.387803 | 0.359319 | 0.356284 re_replace_test | 0.337571 | 0.359821 | 0.348876 | 0.348006 proper_join_test | 0.381654 | 0.395349 | 0.388304 | 0.388193 Python 2.7.3, 64-bit test | minum | maximum | average | median ---------------------+------------+------------+------------+----------- while_replace_test | 0.227471 | 0.268340 | 0.240884 | 0.236776 re_replace_test | 0.301516 | 0.325730 | 0.308626 | 0.307852 proper_join_test | 0.358766 | 0.383736 | 0.370958 | 0.371866 Python 3.2.3, 32-bit test | minum | maximum | average | median ---------------------+------------+------------+------------+----------- while_replace_test | 0.438480 | 0.463380 | 0.447953 | 0.446646 re_replace_test | 0.463729 | 0.490947 | 0.472496 | 0.468778 proper_join_test | 0.397022 | 0.427817 | 0.406612 | 0.402053 Python 3.3.3, 64-bit test | minum | maximum | average | median ---------------------+------------+------------+------------+----------- while_replace_test | 0.284495 | 0.294025 | 0.288735 | 0.289153 re_replace_test | 0.501351 | 0.525673 | 0.511347 | 0.508467 proper_join_test | 0.422011 | 0.448736 | 0.436196 | 0.440318
循环。再次。如果/在哪里/何时需要
while
:join
比起使用正则表达式要好得多。我的测量结果(Linux和Python 2.5)显示,split-then-join几乎比执行“ re.sub(...)”快五倍,如果预编译一次regex并执行,则仍然快三倍。多次操作。无论如何,它都易于理解-
much更多Pythonic。
while
' '.join(the_string.split())
>>> import re
>>> s = "The fox jumped over the log."
>>> re.sub('\s{2,}', ' ', s)
'The fox jumped over the log.'
这将删除所有选项卡,新行和具有单个空格的多个空格。