为什么我的词法分析器似乎忽略换行符

问题描述 投票:0回答:1

我在下面编写了一个 Java 词法分析器

Token.java 看起来像这样

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public enum Token {

    TK_MINUS ("-"), 
    TK_PLUS ("\\+"), 
    TK_MUL ("\\*"), 
    TK_DIV ("/"), 
    TK_NOT ("~"), 
    TK_AND ("&"),  
    TK_OR ("\\|"),  
    TK_LESS ("<"),
    TK_LEG ("<="),
    TK_GT (">"),
    TK_GEQ (">="), 
    TK_EQ ("=="),
    TK_ASSIGN ("="),
    TK_OPEN ("\\("),
    TK_CLOSE ("\\)"), 
    TK_SEMI (";"), 
    TK_COMMA (","), 
    TK_KEY_DEFINE ("define"), 
    TK_KEY_AS ("as"),
    TK_KEY_IS ("is"),
    TK_KEY_IF ("if"), 
    TK_KEY_THEN ("then"), 
    TK_KEY_ELSE ("else"), 
    TK_KEY_ENDIF ("endif"),
    OPEN_BRACKET ("\\{"),
    CLOSE_BRACKET ("\\}"),
    

  STRING ("\"[^\"]+\""), 
    TK_FLOAT ("[+-]?([0-9]*[.])?[0-9]+"),
    TK_DECIMAL("(?:0|[1-9](?:_*[0-9])*)[lL]?"),
    TK_OCTAL("0[0-7](?:_*[0-7])*[lL]?"),
    TK_HEXADECIMAL("0x[a-fA-F0-9](?:_*[a-fA-F0-9])*[lL]?"),
    TK_BINARY("0[bB][01](?:_*[01])*[lL]?"),
    IDENTIFIER ("\\w+");
   
    private final Pattern pattern;

    Token(String regex) {
        pattern = Pattern.compile("^" + regex);
    }

    int endOfMatch(String s) {
        Matcher m = pattern.matcher(s);

        if (m.find()) {
            return m.end();
        }
        return -1;
    }
}

Lexer 类看起来像这样 --> Lexer.java

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.stream.Stream;

public class Lexer {
    private StringBuilder input = new StringBuilder();
    private Token token;
    private String lexema;
    private boolean exausthed = false;
    private String errorMessage = "";
    private Set<Character> blankChars = new HashSet<Character>();

    public Lexer(String filePath) {
        try (Stream<String> st = Files.lines(Paths.get(filePath))) {
            st.forEach(input::append);
        } catch (IOException ex) {
            exausthed = true;
            errorMessage = "Could not read file: " + filePath;
            return;
        }

        blankChars.add('\r');
        blankChars.add('\n');
        blankChars.add((char) 8);
        blankChars.add((char) 9);
        blankChars.add((char) 11);
        blankChars.add((char) 12);
        blankChars.add((char) 32);

        moveAhead();
    }

    public void moveAhead() {
        if (exausthed) {
            return;
        }

        if (input.length() == 0) {
            exausthed = true;
            return;
        }

        ignoreWhiteSpaces();

        if (findNextToken()) {
            return;
        }

        exausthed = true;

        if (input.length() > 0) {
            errorMessage = "Unexpected symbol: '" + input.charAt(0) + "'";
        }
    }

    private void ignoreWhiteSpaces() {
        int charsToDelete = 0;

        while (blankChars.contains(input.charAt(charsToDelete))) {
            charsToDelete++;
        }

        if (charsToDelete > 0) {
            input.delete(0, charsToDelete);
        }
    }

    private boolean findNextToken() {
        for (Token t : Token.values()) {
            int end = t.endOfMatch(input.toString());

            if (end != -1) {
                token = t;
                lexema = input.substring(0, end);
                input.delete(0, end);
                return true;
            }
        }

        return false;
    }

    public Token currentToken() {
        return token;
    }

    public String currentLexema() {
        return lexema;
    }

    public boolean isSuccessful() {
        return errorMessage.isEmpty();
    }

    public String errorMessage() {
        return errorMessage;
    }

    public boolean isExausthed() {
        return exausthed;
    }
}

我创建了一个可用于测试名为 Try.java 的词法分析器的类

package draft;

public class Try {

    public static void main(String[] args) {

        Lexer lexer = new Lexer("C:/Users/eimom/Documents/Input.txt");

        System.out.println("Lexical Analysis");
        System.out.println("-----------------");
        while (!lexer.isExausthed()) {
            System.out.printf("%-18s :  %s \n",lexer.currentLexema() , lexer.currentToken());
            lexer.moveAhead();
        }

        if (lexer.isSuccessful()) {
            System.out.println("Ok! :D");
        } else {
            System.out.println(lexer.errorMessage());
        }
    }
}

假设Input.txt文件包含

>= 
 0x10
 ()
11001100
 -433
 0125
 0x3B

比我期望的输出是

>=  TK_GEQ
 0x10  TK_HEXADECIMAL
 ( TK_OPEN ,
  )  TK_CLOSE 
11001100 TK_BINARY
 -433 TK_DECIMAL
 0125 TK_OCTAL
 0x3B TK_BINARY

相反我得到

Lexical Analysis
------------------

>                   :TK_GT
=                   :TK_ASSIGN
0                   :TK_FLOAT 
x10                 :IDENTIFIER
(                   :TK_OPEN
)                   :TK_CLOSE
11001100            :TK_FLOAT
-                   :TK_MINUS
43301250            :TK_FLOAT
x3B                 :IDENTIFIER

我能做些什么来纠正这些问题,因为看起来代码并没有在一行结束,而是继续并在另一行使用下一个字符。

java lexical-analysis
1个回答
1
投票

问题是

Files.lines
为每一行返回一个字符串流 without 换行符。所以,当你把这些都加到
input
中时,你实际上有字符串内容
>= 0x10 ()11001100 -433 0125 0x3B
(所以,没有换行符)。

相反,使用

Files.readString
一次读取整个文件,或者使用
Reader
代替,这样您可以逐字符读取(如果您需要读取非常大的文件而不需要内存中的整个文件,或来自文件以外的其他来源)。

© www.soinside.com 2019 - 2024. All rights reserved.