我想在python中分割字符串。
示例字符串:
嗨,这是ACTI。场景1和SCENE 2,这是ACT II。场景1和场景2及更多
..进入列表:
['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE2', 'and this is', 'ACT II. SCENE 1', 'and' , 'SCENE 2', 'and more']
有人可以帮我建立正则表达式吗?我建立的是
(ACT [A-Z]+.\sSCENE\s[0-9]+)]?(.*)(SCENE [0-9]+)
但是这不能正常工作。
这是一个有效的脚本,尽管有点黑:
inp = "Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more"
parts = re.findall(r'[A-Z]{2,}(?: [A-Z0-9.]+)*|(?![A-Z]{2})\w+(?: (?![A-Z]{2})\w+)*', inp)
print(parts)
此打印:
['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1',
'and', 'SCENE 2', 'and more']
对正则表达式逻辑的解释,它使用一种替代来匹配两种情况之一:
[A-Z]{2,} match TWO or more capital letters
(?: [A-Z0-9.]+)* followed by zero or more words, consisting only of
capital letters, numbers, or period
| OR
(?![A-Z]{2})\w+ match a word which does NOT start with two capital letters
(?: (?![A-Z]{2})\w+)* then match zero or more similar terms