如何从文本中解析参数？

Question

我有一个看起来像这样的文字：

ENGINE = CollapsingMergeTree (
    first_param
    ,(
        second_a
        ,second_b, second_c,
        ,second d), third, fourth)

引擎可以是不同的（而不是CollapsingMergeTree，可以有不同的单词，ReplacingMergeTree，SummingMergeTree ...）但文本始终采用ENGINE = word（）格式。围绕“=”符号，可以是空格，但不是强制性的。括号内的几个参数通常是单个单词和逗号，但有些参数在括号中，如上例中的第二个。换行可以在任何地方。行可以以逗号，括号或其他任何内容结尾。

我需要提取n个参数（我不知道提前多少）。在上面的例子中，有4个参数：

first = first_param
second =（second_a，second_b，second_c，second_d）[用括号提取]
第三=第三
第四=第四

如何使用python（正则表达式或其他任何东西）？

Answer 1

对于任何语言，您可能都希望使用正确的解析器（以便查找如何为简单语言手动翻译解析器），但是由于您在此处显示的内容与Python兼容，因此您可以将其解析为如果是Python使用ast模块（来自标准库）然后操纵结果。

Answer 2

我想出了一个针对你的问题的正则表达式解决方案。我试图将正则表达式模式保持为“通用”，因为我不知道文本中是否总会有换行符和空格，这意味着该模式会选择大量的空格，然后将其删除。

#Import the module for regular expressions
import re

#Text to search. I CORRECTED IT A BIT AS YOUR EXAMPLE SAID second d AND second_c WAS FOLLOWED BY TWO COMMAS. I am assuming those were typos.
text = '''ENGINE = CollapsingMergeTree (
    first_param
    ,(
        second_a
        ,second_b, second_c
        ,second_d), third, fourth)'''

#Regex search pattern. re.S means . which represents ANY character, includes \n (newlines)
pattern = re.compile('ENGINE = CollapsingMergeTree \((.*?),\((.*?)\),(.*?), (.*?)\)', re.S) #ENGINE = CollapsingMergeTree \((.*?),\((.*?)\), (.*?), (.*?)\)

#Apply the pattern to the text and save the results in variable 'result'. result[0] would return whole text.
#The items you want are sub-expressions which are enclosed in parentheses () and can be accessed by using result[1] and above
result = re.match(pattern, text)

#result[1] will get everything after theparenteses after CollapsingMergeTree until it reaches a , (comma), but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
first = re.sub('\s', '', result[1])

#result[2] will get second a-d, but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
second = re.sub('\s', '', result[2])

third = re.sub('\s', '', result[3])

fourth = re.sub('\s', '', result[4])

print(first)
print(second)
print(third)
print(fourth)

OUTPUT：

first_param
second_a,second_b,second_c,second_d
third
fourth

正则表达式解释：\ =转义一个控制字符，这是一个正则表达式将解释为特殊的字符。更多here。

\（=逃脱括号

（）=将括号中的表达式标记为子组。见结果[1]等。

。 =匹配任何字符（包括换行符，因为re.S）

* =匹配前面表达式的0次或更多次出现。

？ =匹配前面表达式的0或1次出现。

注意： *？ combined被称为非重复，意味着前面的表达只匹配一次，而不是一遍又一遍。

我不是专家，但我希望我得到正确的解释。

我希望这有帮助。

如何从文本中解析参数？

问题描述投票：-1回答：2

2个回答

最新问题

如何从文本中解析参数？

问题描述 投票：-1回答：2

2个回答

最新问题

问题描述投票：-1回答：2