使用嵌套引号解析字符串

问题描述 投票:1回答:5

我需要解析一个看起来像这样的字符串:

"prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"

我希望得到如下列表:

['field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']

我已经尝试使用正则表达式,但它在伪SQL语句的子字符串中不起作用。

我怎样才能获得该列表?

python quotes string-parsing
5个回答
2
投票

如果您知道SQL字符串应该是什么样子,那么这是一种脆弱的方法。

我们匹配SQL字符串,并将其余部分拆分为开始和结束字符串。

然后我们匹配更简单的字段模式,并从该模式的开始建立一个列表,在SQL匹配中添加回来,然后在结束字符串中添加字段。

sqlmatch = 'select .* LIMIT 0'
fieldmatch = "'(|\w+)'"
match = re.search(sqlmatch, mystring)
startstring = mystring[:match.start()]
sql = mystring[match.start():match.end()]
endstring = mystring[match.end():]
result = []
for found in re.findall(fieldmatch, startstring):
    result.append(found)

result.append(sql)
for found in re.findall(fieldmatch, endstring):
    result.append(found)

然后结果列表如下所示:

['field1',
 '',
 'field2',
 'field3',
 'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\') 
OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\')  LIMIT 0',
 'field5',
 'field6',
 'field7',
 'field8',
 'field9',
 '',
 'field10']

1
投票

由于字段数是固定的,而非sql字段没有嵌入式引号,因此有一个简单的三行解决方案:

prefix, other = string.partition(' ')[::2]
fields = other.strip('\'').split('\', \'')
fields[4:-7] = [''.join(fields[4:-7])]

print(fields)

输出:

['field1', '', 'field2', 'field3', "select ... where (column1 = '2017') and ((('literal1literal2literal3literal4literal5literal6literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ", 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']

0
投票

有人指出你的字符串格式不正确,我使用了这个:

mystr = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"

found = [a.replace("'", '').replace(',', '') for a in mystr.split(' ') if "'" in a]

哪个回报:

['field1',
 '',
 'field2',
 'field3',
 'select',
 '2017)',
 '(((literal1',
 'literal2',
 'literal3',
 'literal4',
 'literal5',
 'literal6',
 'literal7)',
 '(literal8)',
 'literal9)',
 '',
 'field5',
 'field6',
 'field7',
 'field8',
 'field9',
 '',
 'field10']

0
投票

如果字段数不变,您可以执行以下操作:

def splitter(string):
    strip_chars = "\"' "
    string = string[len('prefix '):] # remove the prefix
    left_parts = string.split(',', 4) # only split up to 4 times
    for i in left_parts[:-1]:
        yield i.strip(strip_chars) # return what we've found so far
    right_parts = left_parts[-1].rsplit(',', 7) # only split up to 7 times starting from the right
    for i in right_parts:
        yield i.strip(strip_chars) # return the rest

mystr = """prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"""
result = list(splitter(mystr))
print(repr(result))


# result:
[
    'field1',
    '',
    'field2',
    'field3',
    'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\')  LIMIT 0',
    'field5',
    'field6',
    'field7',
    'field8',
    'field9',
    '',
    'field10'
]

0
投票

实际位于字段之间的逗号分隔符将处于偶数引用级别。因此,通过将这些逗号更改为\ n字符,您可以在字符串上应用简单的.split(“\ n”)来获取字段值。然后,您只需要清理字段值以删除前导/尾随空格和引号。

from itertools import accumulate

string      = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
prefix,data = string.split(" ",1)                   # remove prefix
quoteLevels = accumulate( c == "'" for c in data )  # compute quote levels for each character
fieldData   = "".join([ "\n" if c=="," and q%2 == 0 else c for c,q in zip(data,quoteLevels) ]) # comma to /n at even quote levels
fields      = [ f.strip().strip("'") for f in fieldData.split("'\n '") ] # split and clean content

for i,field in enumerate(fields): print(i,field)

这将打印:

0 field1
1 
2 field2
3 field3
4 select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 
5 field5
6 field6
7 field7
8 field8
9 field9
10 
11 field10
© www.soinside.com 2019 - 2024. All rights reserved.