python 正则表达式删除后面不跟某些标点符号且不以换页符开头的换行符

问题描述 投票:0回答:1

我需要处理扫描和ocr生成的文本,如下所示:

他们受到了无微不至的照顾。之间展开了激烈的竞争 那些在自己的领域追求卓越的资深人士和最佳表现者 表现。通常,最好的射手可以达到 800 的距离 码或更多,具体取决于风向。如果幸运的话, 风向可能会增加几码。在这里,边波弓箭手 非常积极地支持他们的雇主。他们的密切 守卫技术是通过匹配来平衡弓和箭。 弓的强度与箭的重量的关系。

需要删除不跟在后面的换行符。或 " 或 ! 或 ? 并且其下一行不以换页符 ( ) 开头。

所以我写了下面的代码:

processed_text = regex.sub(r'(?<![."!?])-*\n(?!\f)', ' ', processed_text)

它工作得很好,直到我发现一些以换页符开头的行应该被排除。我尝试了多种方法但都失败了。然后我就陷入困境,不知道了。

有什么帮助吗?

大卫

python regex-lookarounds regexp-replace
1个回答
0
投票

您可以修改正则表达式模式以排除以换页符 ( ) 开头的行。

import re

text = """
they were looked after lavishly. There was fierce competition among the senior and best performers who all sought excellence in their performance. Usually, the best shooters would reach a distance of 800 yards or more depending on the wind direction. If one was lucky, the wind direction might add several yards. Here, the Phenpo archers were very much involved in supporting their employers. Their closely guarded technique was in balancing the bow and arrow by matching the strength of the bow to the weight of the arrow.

\fThis line starts with formfeed. It should not be removed.
they were looked after lavishly. There was fierce competition among the senior and best performers who all sought excellence in their performance. Usually, the best shooters would reach a distance of 800 yards or more depending on the wind direction. If one was lucky, the wind direction might add several yards. Here, the Phenpo archers were very much involved in supporting their employers. Their closely guarded technique was in balancing the bow and arrow by matching the strength of the bow to the weight of the arrow.
"""

processed_text = re.sub(r'(?<![\."!?])\n(?!\f)', ' ', text)
print(processed_text)
© www.soinside.com 2019 - 2024. All rights reserved.