捕获引号中的文本，包括嵌套引号

Question

我一直想创建一个正则表达式模式，允许捕获单引号和双引号之间包含的文本，考虑到结束引号必须与开始引号相同，并且必须包含嵌套引号。

'\'this text must be captured\' "this one too" \'and this "nested" too\' \'this should not be captured"'

['this text must be captured', 'this one too', 'and this "nested" too']

我做了一些，但都有一些问题

pattern = r'"(.*?)"|\'(.*?)\''
pattern = r'"([^"]*)"|\'([^\']*)\''

结果：

[('', 'this text must be captured'), ('this one too', ''), ('', 'and this "nested" too')]

这里，两种替代情况之一正确捕获，但另一种捕获空的情况

pattern = r'(?P<unquoted>(?:"(?:\\.|[^"\\])*"|\'(?:\\.|[^\'\\])*\'))'

结果：

["'this text must be captured'", '"this one too"', '\'and this "nested" too\'']

此处捕获单个组，但包含不应包含的原始引用

Answer 1

我能得到的最接近的是这个：

输入：

'this text must be captured' "this one too" 'and this "nested" too' 'this should not be captured"'

正则表达式模式：

/['](.*?)[']|"(.*?)"/gm

此正则表达式的缺点是它不会覆盖带有单撇号的字符串，并返回带有额外双引号的字符串，但您可以通过检查匹配项是否只有 1 个单/双引号来将其从结果集中过滤出来性格..