我正在为类似 Markdown 的文档格式编写一个解析器。我希望能够在语法定义中匹配像
^[some *formatted* text]
这样的脚注。这是一个最小的例子:
{- cabal:
build-depends: base, text, megaparsec, parser-combinators, hspec, hspec-megaparsec
-}
{-# LANGUAGE ImportQualifiedPost #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.Text (Text)
import Data.Void (Void)
import Test.Hspec
import Test.Hspec.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Text.Megaparsec.Char.Lexer qualified as L
type Parser = Parsec Void Text
data Words
= PlainText Text
| BoldText Text
| MagicText [Words]
deriving (Show, Eq)
text_ :: Parser Words
text_ =
choice
[
MagicText <$> between (string "^[") (char ']') (manyTill (text_ <* optional space) (char ']')),
BoldText <$> between (char '*') (char '*') (takeWhile1P (Just "bold text") (/= '*')),
PlainText <$> takeWhile1P (Just "plain text") (\c -> c /= ' ' && c /= '\n')
]
main :: IO ()
main = hspec $ do
context "for basic one-word-at-a-time input" $ do
it "parses plain text" $ parse text_ "" "hello" `shouldParse` PlainText "hello"
it "parses bold text" $ parse text_ "" "*hello*" `shouldParse` BoldText "hello"
context "parses nested \"MagicText\"" $ do
it "on it's own with just one word inside" $
parse text_ "" "^[hello]" `shouldParse` MagicText [PlainText "hello"]
it "on it's own with bold text inside" $
parse text_ "" "^[*hello*]" `shouldParse` MagicText [BoldText "hello"]
最后两个测试用例失败并出现以下错误:
~/sandbox > cabal run ParseBetween.hs
for basic one-word-at-a-time input
parses plain text [✔]
parses bold text [✔]
parses nested "MagicText"
on it's own with just one word inside [✘]
on it's own with bold text inside [✘]
Failures:
/home/gideon/sandbox/ParseBetween.hs:43:33:
1) parses nested "MagicText" on it's own with just one word inside
expected: MagicText [PlainText "hello"]
but parsing failed with error:
1:9:
|
1 | ^[hello]
| ^
unexpected end of input
expecting "^[", '*', ']', plain text, or white space
To rerun use: --match "/parses nested \"MagicText\"/on it's own with just one word inside/" --seed 100639639
/home/gideon/sandbox/ParseBetween.hs:46:35:
2) parses nested "MagicText" on it's own with bold text inside
expected: MagicText [BoldText "hello"]
but parsing failed with error:
1:11:
|
1 | ^[*hello*]
| ^
unexpected end of input
expecting ']'
To rerun use: --match "/parses nested \"MagicText\"/on it's own with bold text inside/" --seed 100639639
从manyTill_的定义中,我希望它首先匹配结尾
]
,因此不会遇到这个“意外的输入结束”错误,但我不知道如何进行这种嵌套解析以有效的方式行事。
通过检查我看不出你的粗体文本示例有什么问题。但
"[hello]"
的问题很简单。您开始解析 MagicText
,它消耗 [
并再次委托给 text_
,计划随后消耗 ]
。但是 PlainText
内部的解析器不知道它应该留下 ]
字符。它很高兴地一直消耗到字符串的末尾,因为它永远不会遇到其停止字符之一,' '
或'\n'
。然后就完成了,上面的MagicText
很沮丧,找不到它的结束语]
。
处理此类问题的常见方法是使用一种对其概念进行更明确分离的语法,并以层次结构进行编码。 A
MagicText
不包含“任何文本,包括魔术、粗体或纯文本”:它包括“粗体文本或纯文本”。 A BoldText
不包含“任何文本,包括魔术、粗体或纯文本”:它仅包含纯文本。并且 PlainText
明确拒绝被视为其上方级别的分隔符/元字符的字符。大致是这样的:
text_ :: Parser Words
text_ =
choice
[
MagicText <$> between (string "^[") (char ']') (nonMagicText `sepBy1` space),
nonMagicText
]
nonMagicText =
choice
[
BoldText <$> between (char '*') (char '*') plainText,
PlainText <$> plainText
]
plainText =
takeWhile1P (Just "plaintext") (`notElem` "*^[] \n")