如何在OCaml lexer中预先形成'lookahead'/如何回滚lexeme？

Question

好吧，我正在编写我的第一个解析器，在OCaml中，我立刻以某种方式设法制作一个无限循环。

特别值得注意的是，我正在尝试根据Scheme规范的规则来识别标识符（我不知道我在做什么，显然） - 并且那里有一些关于标识符的语言要求它们后面跟着一个分隔符。我的方法，现在，是有一个delimited_identifier正则表达式，其中包括一个delimiter字符，不应该由主词法分子消耗...然后一旦匹配，那个词的读数被Sedlexing.rollback（嗯，my wrapper thereof）还原），在传递给只吃实际标识符的sublexer之前，希望将分隔符留在缓冲区中，由父词法分子作为不同的词汇进行处理。

我正在使用Menhir和Sedlex，主要是从@smolkaj的ocaml-parsing example-repo和RWO's parsing chapter合成的例子;这是我目前的parser和lexer最简单的减少：

%token LPAR RPAR LVEC APOS TICK COMMA COMMA_AT DQUO SEMI EOF
%token <string> IDENTIFIER
(* %token <bool> BOOL *)
(* %token <int> NUM10 *)
(* %token <string> STREL *)

%start <Parser.AST.t> program

%%

program:
  | p = list(expression); EOF { p }
  ;

expression:
  | i = IDENTIFIER { Parser.AST.Atom i }

%%

......而且......

(** Regular expressions *)
let newline = [%sedlex.regexp? '\r' | '\n' | "\r\n" ]
let whitespace = [%sedlex.regexp? ' ' | newline ]
let delimiter = [%sedlex.regexp? eof | whitespace | '(' | ')' | '"' | ';' ]

let digit = [%sedlex.regexp? '0'..'9']
let letter = [%sedlex.regexp? 'A'..'Z' | 'a'..'z']

let special_initial = [%sedlex.regexp?
   '!' | '$' | '%' | '&' | '*' | '/' | ':' | '<' | '=' | '>' | '?' | '^' | '_' | '~' ]
let initial = [%sedlex.regexp? letter | special_initial ]

let special_subsequent = [%sedlex.regexp? '+' | '-' | '.' | '@' ]
let subsequent = [%sedlex.regexp? initial | digit | special_subsequent ]

let peculiar_identifier = [%sedlex.regexp? '+' | '-' | "..." ]
let identifier = [%sedlex.regexp? initial, Star subsequent | peculiar_identifier ]
let delimited_identifier = [%sedlex.regexp? identifier, delimiter ]


(** Swallow whitespace and comments. *)
let rec swallow_atmosphere buf =
   match%sedlex buf with
   | Plus whitespace -> swallow_atmosphere buf
   | ";" -> swallow_comment buf
   | _ -> ()

and swallow_comment buf =
   match%sedlex buf with
   | newline -> swallow_atmosphere buf
   | any -> swallow_comment buf
   | _ -> assert false

(** Return the next token. *)
let rec token buf =
   swallow_atmosphere buf;
   match%sedlex buf with
   | eof -> EOF

   | delimited_identifier ->
     Sedlexing.rollback buf;
     identifier buf

   | '(' -> LPAR
   | ')' -> RPAR

   | _ -> illegal buf (Char.chr (next buf))

and identifier buf =
   match%sedlex buf with
   | _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)

（是的，这基本上是一个无操作/最简单的事情。我正在努力学习！:x）

不幸的是，这种组合导致解析自动机中的无限循环：

State 0:
Lookahead token is now IDENTIFIER (1-1)
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER 
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
Lookahead token is now IDENTIFIER (1-1)
Reducing production expression -> IDENTIFIER 
State 5:
Shifting (IDENTIFIER) to state 1
State 1:
...

我刚接触解析和lexing以及所有这些;任何的建议都受欢迎。我确定这只是一个新手的错误，但......

谢谢！

Answer 1

如前所述，在词法分析器中实现太多逻辑是一个坏主意。但是，无限循环不是来自回滚，而是来自identifier的定义：

 identifier buf =
   match%sedlex buf with
   | _ -> IDENTIFIER (Sedlexing.Utf8.lexeme buf)

在这个定义中，_匹配由所有可能的字符组成的语言中最短的单词。换句话说，_总是匹配空字μ而不消耗其输入的任何部分，将解析器发送到无限循环。

如何在OCaml lexer中预先形成'lookahead'/如何回滚lexeme？

问题描述投票：1回答：1

1个回答

最新问题

如何在OCaml lexer中预先形成'lookahead'/如何回滚lexeme？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1