我一直想创建一个正则表达式模式,允许捕获单引号和双引号之间包含的文本,考虑到结束引号必须与开始引号相同,并且必须包含嵌套引号。
'\'this text must be captured\' "this one too" \'and this "nested" too\' \'this should not be captured"'
['this text must be captured', 'this one too', 'and this "nested" too']
我做了一些,但都有一些问题
pattern = r'"(.*?)"|\'(.*?)\''
pattern = r'"([^"]*)"|\'([^\']*)\''
结果:
[('', 'this text must be captured'), ('this one too', ''), ('', 'and this "nested" too')]
这里,两种替代情况之一正确捕获,但另一种捕获空的情况
pattern = r'(?P<unquoted>(?:"(?:\\.|[^"\\])*"|\'(?:\\.|[^\'\\])*\'))'
结果:
["'this text must be captured'", '"this one too"', '\'and this "nested" too\'']
此处捕获单个组,但包含不应包含的原始引用
我能得到的最接近的是这个:
输入:
'this text must be captured' "this one too" 'and this "nested" too' 'this should not be captured"'
正则表达式模式:
/['](.*?)[']|"(.*?)"/gm
此正则表达式的缺点是它不会覆盖带有单撇号的字符串,并返回带有额外双引号的字符串,但您可以通过检查匹配项是否只有 1 个单/双引号来将其从结果集中过滤出来性格..
对于转义引号,有 用于带有转义引号的带引号字符串的正则表达式