使用 Parsec 解析包含元组列表的字符串

问题描述 投票:0回答:1

我正在尝试解析(使用 parsec)表示我定义的某种数据类型的字符串。因此需要将字符串解析为我的数据类型。字符串的一个例子是,

[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]

这将解析为以下内容,

[(Reg 1, [Cmp (Jlt, Intv (0, 0)), Op (Mod, Intv (-4,4))]), (Reg 2, [Cmp (Jge, (4,4))])]

现在这使用了一些自定义数据类型,

newtype Reg = Reg Int deriving (Eq, Show, Ord)
data LF = Op (BinAlu, Interval) | Cmp (Jcmp, Interval) | Invalid
    deriving (Eq, Show, Ord)
data BinAlu
  = Add
  | Sub
  | Mul
  | Div
  | Or
  | And
  | Lsh
  | Rsh
  | Mod
  | Xor
  | Mov
  | Arsh
  deriving (Eq, Show, Ord, Enum)
data Jcmp = Jeq | Jgt | Jge | Jlt | Jle | Jset | Jne | Jsgt | Jsge | Jslt | Jsle
  deriving (Eq, Show, Ord, Enum)
data Interval = Bot | Intv (Int, Int)
  deriving (Eq, Show, Ord)

因此我想将字符串解析为以下类型

[(Reg, [LF])]

现在我完全不知道如何真正做到这一点。我想我有一个想法,但我发现这个想法很难实现。

我的想法是先用

between (symbol "[") (symbol "]")
,希望能给我
[
]
之间的内容。然后我需要为括号做类似的事情但重复它。然后当然是解析括号内的内容。

我基本上是在寻找有关如何设置此解析器的任何建议。以及一般如何构建这样的解析器。

非常感谢任何帮助!

parsing haskell functional-programming parsec
1个回答
0
投票

以下应该让你开始。我们需要一些进口:

module TupleParser where
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String

为了正确处理空白,您应该首先编写一些组合器来处理“词素”,即期望从非空白字符开始、解析某些内容并丢弃尾随空白的解析器。虽然 Parsec 在

Text.Parsec.Token
中有一些词位支持,但它设计过度且难以使用。这是一个基于 Megaparsec 方法的简化替代方案:

-- a lexeme starts on non-whitespace, parses something,
-- and discards trailing whitespace
lexeme :: Parser a -> Parser a
lexeme p = p <* spaces

-- a symbol is a verbatim string, treated as a lexeme
symbol :: String -> Parser String
symbol s = lexeme (string s)

以下是用于解析数字的非常标准的词位:

-- an unsigned decimal number, treated as a lexeme
decimal :: (Read n, Integral n) => Parser n
decimal = read <$> many1 digit

-- combinator for signed numbers; replace "string" with
-- "symbol" if you want to allow space between dash and
-- first digit
signed :: (Num n) => Parser n -> Parser n
signed p = option id (negate <$ string "-") <*> p

还有一些其他非常标准的词位/组合器:

-- some standard names
comma :: Parser String
comma = symbol ","

parens :: Parser p -> Parser p
parens = between (symbol "(") (symbol ")")

brackets :: Parser p -> Parser p
brackets = between (symbol "[") (symbol "]")

这里有一个列表助手,因为你会在几个地方使用它。

-- a list is a bracket-delimited, comma-separated list
listOf :: Parser p -> Parser [p]
listOf p = brackets (p `sepBy` comma)

现在,我们应该定义语法的最低级“原子”:

-- (insert your data types here)

reg :: Parser Reg
reg = Reg <$> decimal

lf :: Parser LF
lf = parens
  $   Op <$> ((,) <$> binalu <* comma <*> interval)
  <|> Cmp <$> ((,) <$> jcmp <* comma <*> interval)
  <|> Invalid <$ symbol "???"

-- I don't really understand your interval syntax, so
-- I'm just parsing any number "n" into "Intv (n,n)"
interval :: Parser Interval
interval = (\x -> Intv (x,x)) <$> signed decimal

对于

binalu
jcmp
,一个简单的第一次尝试可能是这样的:

binalu :: Parser BinAlu
binalu
  =   Mod <$ symbol "%"
  -- etc.

jcmp :: Parser Jcmp
jcmp
  =   Jlt <$ symbol "<"
  <|> Jge <$ symbol ">="
  -- etc.

这足以解析您的示例输入。但是,当您使用所有所需的运算符充实这些内容时,就会出现问题。例如,解析器

symbol "<"
会很乐意解析
"<="
的第一个字符,而当您接下来尝试解析逗号时,留下
"="
会导致错误。如果您订购替代品先尝试
"<="

jcmp :: Parser Jcmp
jcmp
  =   Jle <$ symbol "<="
  <|> Jlt <$ symbol "<"
  -- etc.

这仍然不够,因为

symbol "<="
会很乐意 start 解析一个
"<"
后跟一个
"="
然后“在消耗输入后失败”,这会阻止尝试任何以后的替代方案。无论如何,您都可以使用
try
组合器继续:

jcmp :: Parser Jcmp
jcmp
  =   try (Jle <$ symbol "<=")
  <|> Jlt <$ symbol "<"
  -- etc.

但这很乏味。通常的解决方案是定义一个“运算符字符”列表:

-- include every character the appears in one of your operators
opChars :: String
opChars = "+-*/|&<=>%^!" 

并定义一个

operator
组合子(注意:Parsec 称这个组合子为
reservedOp
),它解析一个运算符后跟一个运算符字符以外的东西:

operator :: String -> Parser String
operator s = lexeme $ try (string s <* notFollowedBy (oneOf opChars))

现在,您可以按任意顺序列出运算符,它们会正常工作:

jcmp :: Parser Jcmp
jcmp
  =   Jle <$ operator "<="
  <|> Jlt <$ operator "<"
  <|> Jgt <$ operator ">"
  <|> Jge <$ operator ">="
  -- etc.

最后,我们可以为您的高级结构定义语法。请注意,最顶层的解析器应忽略前导空格,因为所有词素解析器都希望以非空格开头,并检查输入结束。

type Program = [Statement]
type Statement = (Reg, [LF])

program :: Parser Program
program = spaces *> listOf statement <* eof

statement :: Parser Statement
statement = parens $ (,) <$> reg <* comma <*> listOf lf

这是对您建议的输入的测试:

main = parseTest program "[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]"

应该产生输出:

[(Reg 1,[Cmp (Jlt,Intv (0,0)),Op (Mod,Intv (4,4))]),(Reg 2,[Cmp (Jge,Intv (4,4))])]

完整代码:

module TupleParser where

import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String

lexeme :: Parser a -> Parser a
lexeme p = p <* spaces

symbol :: String -> Parser String
symbol s = lexeme (string s)

 -- characters appearing in operators
opChars :: String
opChars = "+-*/|&<=>%^!"

-- parse an operator
operator :: String -> Parser String
operator s = lexeme $ try (string s <* notFollowedBy (oneOf opChars))

decimal :: (Read n, Integral n) => Parser n
decimal = read <$> many1 digit

signed :: (Num n) => Parser n -> Parser n
signed p = option id (negate <$ string "-") <*> p

comma :: Parser String
comma = symbol ","

parens :: Parser p -> Parser p
parens = between (symbol "(") (symbol ")")

brackets :: Parser p -> Parser p
brackets = between (symbol "[") (symbol "]")

listOf :: Parser p -> Parser [p]
listOf p = brackets (p `sepBy` comma)

newtype Reg = Reg Int deriving (Eq, Show, Ord)

data LF = Op (BinAlu, Interval) | Cmp (Jcmp, Interval) | Invalid
    deriving (Eq, Show, Ord)

data BinAlu
  = Add
  | Sub
  | Mul
  | Div
  | Or
  | And
  | Lsh
  | Rsh
  | Mod
  | Xor
  | Mov
  | Arsh
  deriving (Eq, Show, Ord, Enum)

data Jcmp = Jeq | Jgt | Jge | Jlt | Jle | Jset | Jne | Jsgt | Jsge | Jslt | Jsle
  deriving (Eq, Show, Ord, Enum)

data Interval = Bot | Intv (Int, Int)
  deriving (Eq, Show, Ord)

reg :: Parser Reg
reg = Reg <$> decimal

lf :: Parser LF
lf = parens
  $   Op <$> ((,) <$> binalu <* comma <*> interval)
  <|> Cmp <$> ((,) <$> jcmp <* comma <*> interval)
  <|> Invalid <$ symbol "???"

binalu :: Parser BinAlu
binalu
  =   Mod <$ operator "%"
  -- etc.

jcmp :: Parser Jcmp
jcmp
  =   Jlt <$ operator "<"
  <|> Jge <$ operator ">="
  -- etc.

-- I don't really understand your interval syntax, so
-- I'm just parsing any number "n" into "Intv (n,n)"
interval :: Parser Interval
interval = (\x -> Intv (x,x)) <$> signed decimal

type Program = [Statement]
type Statement = (Reg, [LF])

program :: Parser Program
program = spaces *> listOf statement <* eof

statement :: Parser Statement
statement = parens $ (,) <$> reg <* comma <*> listOf lf

main = parseTest program "[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]"
© www.soinside.com 2019 - 2024. All rights reserved.