我有一篇包含单词和数字的文本。我将给出一个有代表性的文本示例:
string = "This is a 1example of the text. But, it only is 2.5 percent of all data"
我想将其转换为:
"This is a 1 example of the text But it only is 2.5 percent of all data"
因此删除 标点符号(可以是
.
,
或 string.punctuation
中的任何其他标点符号),并在连接时在数字和单词之间放置一个空格。但在我的示例中保持浮点数为 2.5。
我使用了以下代码:
item = "This is a 1example of the text. But, it only is 2.5 percent of all data"
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
# This a start but not there yet !
#item = ' '.join([x.strip(string.punctuation) for x in item.split() if x not in string.digits])
item = ' '.join(re.split(r'(\d+)', item) )
print item
结果是:
>> "This is a 1 example of the text. But, it only is 2 . 5 percent of all data"
我快到了,但无法弄清楚最后的平静。
您可以像这样使用正则表达式查找:
(?<!\d)[.,;:](?!\d)
这个想法是让一个字符类收集您想要替换的标点符号,并使用环视来匹配周围没有数字的标点符号
regex = r"(?<!\d)[.,;:](?!\d)"
test_str = "This is a 1example of the text. But, it only is 2.5 percent of all data"
result = re.sub(regex, "", test_str, 0)
结果是:
This is a 1example of the text But it only is 2.5 percent of all data
好吧,伙计们,这是一个答案(最好的?我不知道,但它似乎有效):
item = "This is a 1example 2Ex of the text.But, it only is 2.5 percent of all data?"
#if there is two strings contatenated with the second starting with capital letter
item = ' '.join(re.sub( r"([A-Z])", r" \1", item).split())
#if a word starts with a digit like "1example"
item = ' '.join(re.split(r'(\d+)([A-Za-z]+)', item) )
#Magical line that removes punctuation apart from floats
item = re.sub('\S+', lambda m: re.match(r'^\W*(.*\w)\W*$', m.group()).group(1), item)
item = item.replace(" "," ")
print item
我对Python不太了解,但对正则表达式有一些了解。 我建议使用 or? 我会使用这个正则表达式:
"(\d+)([a-zA-Z])|([a-zA-Z])(\d+)"
,然后作为替换字符串使用: "\1 \2"
我尝试过这个,效果非常好。
a = "This is a 1example of the text. But, it only is 2.5 percent of all data"
a.replace(". ", " ").replace(", "," ")
请注意,在替换函数中,标点符号后面有空格。我只是用空格替换了标点符号和空格。
代码:
from itertools import groupby
s1 = "This is a 1example of the text. But, it only is 2.5 percent of all data"
s2 = [''.join(g) for _, g in groupby(s1, str.isalpha)]
s3 = ' '.join(s2).replace(" ", " ").replace(" ", " ")
#you can keep adding a replace for each ponctuation
s4 = s3.replace(". ", " ").replace(", "," ").replace("; "," ").replace(", "," ").replace("- "," ").replace("? "," ").replace("! "," ").replace(" ("," ").replace(") "," ").replace('" '," ").replace(' "'," ").replace('... '," ").replace('/ '," ").replace(' “'," ").replace('” '," ").replace('] '," ").replace(' ['," ")
s5 = s4.replace(" ", " ")
print(s5)
输出:
'This is a 1 example of the text But it only is 2.5 percent of all data'
P.s.:您可以查看标点符号并继续在
.replace()
函数中添加它们。
这是一种正则表达式方法
([^ ]?)(?:[^\P{punct}.]|(?<!\d)\.(?!\d))([^ ]?)
在回调中替换:
如果 $1 长度 > 0 并且 $2 长度 > 0
替换为 $1 + 空格 + $2
别的
替换为 $1$2
扩大
( [^ ]? ) # (1)
(?:
[^\P{punct}.]
|
(?<! \d )
\.
(?! \d )
)
( [^ ]? ) # (2)
如果您不想对与 punct 相邻的字符使用逻辑
使用
(?:[^\P{punct}.]|(?<!\d)\.(?!\d))
并替换为任何内容。
要完成接受的答案,您可以通过在正则表达式中使用 OR 来强制仅采用数字之间的标点符号:
import re
test_str = "This is a .12 23. 1example of the text. But, it only is 2.5 percent of all data"
punctuation = ".,()/"
regex_expr = '(?<!\d)?[{}](?!\d)|(?<!\d)[{}](?!\d)?'.format(punctuation, punctuation)
result = re.sub(regex, "", test_str, 0)
print(result)
输出:
'This is a 12 23 1example of the text But it only is 2.5 percent of all data'