如何将这个正则表达式变成Megaparsec解析器而又不会弄得一团糟?

问题描述 投票:2回答:1

考虑此正则表达式:

^foo/[^=]+/baz=(.*),[^,]*$

如果我在foo/bar/baz=one,two上运行它,它将匹配并且子组将捕获one。如果我在foo/bar/baz/bar/baz=three,four,five上运行它,它将匹配并且子组将捕获three,four

我知道如何将其转换为regex-applicative解析器或ReadP解析器:

import Text.Regex.Applicative
match (string "foo/" *> some (psym (/= '=')) *> string "/baz=" *> many anySym <* sym ',' <* many (psym (/= ','))) <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Just "one",Just "three,four"]
import Text.ParserCombinators.ReadP
readP_to_S (string "foo/" *> many1 (satisfy (/= '=')) *> string "/baz=" *> many get <* char ',' <* many (satisfy (/= ',')) <* eof) <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [[("one","")],[("three,four","")]]

而且这两种方式都按照我希望他们的方式工作。但是,当我尝试将其直接音译为Megaparsec时,效果很差:

import Text.Megaparsec
parse (chunk "foo/" *> some (anySingleBut '=') *> chunk "/baz=" *> many anySingle <* single ',' <* many (anySingleBut ',') <* eof) "" <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Left (ParseErrorBundle {bundleErrors = TrivialError 11 (Just (Tokens ('=' :| "one,"))) (fromList [Tokens ('/' :| "baz=")]) :| [], bundlePosState = PosState {pstateInput = "foo/bar/baz=one,two", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}}),Left (ParseErrorBundle {bundleErrors = TrivialError 19 (Just (Tokens ('=' :| "thre"))) (fromList [Tokens ('/' :| "baz=")]) :| [], bundlePosState = PosState {pstateInput = "foo/bar/baz/bar/baz=three,four,five", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})]

我知道这是由于Megaparsec默认不回溯。我试图通过仅将try粘贴在多个不同的位置来解决此问题,但我无法使其正常工作。最终,我得到了notFollowedBy的支持:

import Text.Megaparsec
parse (chunk "foo/" *> some (noneOf "=/" <|> try (single '/' <* notFollowedBy (chunk "baz="))) *> chunk "/baz=" *> many (try (anySingle <* notFollowedBy (many (anySingleBut ',') <* eof))) <* single ',' <* many (anySingleBut ',') <* eof) "" <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Right "one",Right "three,four"]

但是那看起来很烂!特别是,我不喜欢我实际上必须两次指定很多模式。从技术上讲,这不等于正则表达式^foo/(?:[^=/]|/(?!baz=))+/baz=((?:.(?![^,]*$))*),[^,]*$,而不是我的初始正则表达式吗?必须有一个更好的方法来编写该解析器。我该怎么办?


编辑:我也这样尝试过,它也可以工作:

import Text.Megaparsec
parse (chunk "foo/" *> (some . try $ many (noneOf "=/") *> single '/') *> chunk "baz=" *> ((++) <$> many (anySingleBut ',') <*> (concat <$> manyTill ((:) <$> single ',' <*> many (anySingleBut ',')) (try $ single ',' *> many (anySingleBut ',') *> eof)))) "" <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
[Right "one",Right "three,four"]

虽然看起来也很混乱,但是manyTill表示它实际上不再映射到任何正则表达式。

regex haskell backtracking parser-combinators megaparsec
1个回答
0
投票

[如果没有仔细阅读,我想这是给您带来麻烦的部分:

(.*),[^,]*

如果是,则考虑使用

sepBy (many (noneOf ",")) (string ",")

它将解析逗号分隔的列表。然后,以纯代码重新插入该列表中除最后一个元素之外的所有元素之间的逗号(例如,放置在适当位置的fmap)。

从评论看来,您在此部分上也遇到了麻烦:

/[^=]+/baz=

您可以将类似的内容视为翻译:

slashPath = string "/" <++> path
path = string "baz=" <|> (many (noneOf "=/") <++> slashPath)
(<++>) = liftA2 (++)
© www.soinside.com 2019 - 2024. All rights reserved.