Scanner.findAll()和Matcher.results()在相同的输入文本和模式下工作方式不同

问题描述 投票:1回答:2

我已经在使用正则表达式分割属性字符串的过程中看到了这个有趣的事情。我找不到根本原因。

我有一个字符串,其中包含诸如属性key = value对之类的文本。我有一个正则表达式,它根据=位置将字符串分成键/值。它将第一个=视为分割点。值中也可以包含=。

我尝试使用Java中的两种不同方法来做到这一点。

  1. 使用Scanner.findAll()方法

    这不符合预期。它应该根据模式提取并打印所有键。但是我发现它的行为很奇怪。我有一个键值对,如下所示

    SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important .....}

应提取的密钥是SectionError.ErrorMessage =,但它也将errorlevel =作为密钥。

有趣的一点是,如果我从传递的属性String中删除字符之一,它的行为很好,并且仅提取SectionError.ErrorMessage =键。

  1. 使用Matcher.results()方法

    这很好。没问题,无论我们在属性字符串中放入什么。

我尝试过的示例代码:

import java.util.Scanner;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;

import static java.util.regex.Pattern.MULTILINE;

public class MessageSplitTest {

    static final Pattern pattern = Pattern.compile("^[a-zA-Z0-9._]+=", MULTILINE);

    public static void main(String[] args) {
        final String properties =
                "SectionOne.KeyOne=first value\n" + // removing one char from here would make the scanner method print expected keys
                        "SectionOne.KeyTwo=second value\n" +
                        "SectionTwo.UUIDOne=379d827d-cf54-4a41-a3f7-1ca71568a0fa\n" +
                        "SectionTwo.UUIDTwo=384eef1f-b579-4913-a40c-2ba22c96edf0\n" +
                        "SectionTwo.UUIDThree=c10f1bb7-d984-422f-81ef-254023e32e5c\n" +
                        "SectionTwo.KeyFive=hello-world-sample\n" +
                        "SectionThree.KeyOne=first value\n" +
                        "SectionThree.KeyTwo=second value additional text just to increase the length of the text in this value still not enough adding more strings here n there\n" +
                        "SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important message This is very important messageThis is very important message This is very important message This is very important message This is very important message This is very important message}\n" +
                        "SectionFour.KeyOne=sixth value\n" +
                        "SectionLast.KeyOne=Country";

        printKeyValuesFromPropertiesUsingScanner(properties);
        System.out.println();
        printKeyValuesFromPropertiesUsingMatcher(properties);
    }

    private static void printKeyValuesFromPropertiesUsingScanner(String properties) {
        System.out.println("===Using Scanner===");
        try (Scanner scanner = new Scanner(properties)) {
            scanner
                    .findAll(pattern)
                    .map(MatchResult::group)
                    .forEach(System.out::println);
        }
    }

    private static void printKeyValuesFromPropertiesUsingMatcher(String properties) {
        System.out.println("===Using Matcher===");
        pattern.matcher(properties).results()
                .map(MatchResult::group)
                .forEach(System.out::println);

    }
}

输出输出:

===Using Scanner===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
errorlevel=
SectionFour.KeyOne=
SectionLast.KeyOne=

===Using Matcher===
SectionOne.KeyOne=
SectionOne.KeyTwo=
SectionTwo.UUIDOne=
SectionTwo.UUIDTwo=
SectionTwo.UUIDThree=
SectionTwo.KeyFive=
SectionThree.KeyOne=
SectionThree.KeyTwo=
SectionError.ErrorMessage=
SectionFour.KeyOne=
SectionLast.KeyOne=

这可能是根本原因?扫描仪的findAllmatcher的工作方式不同吗?

请让我知道是否需要更多信息。

java regex pattern-matching java.util.scanner java-9
2个回答
2
投票

Scanner的文档经常提到“缓冲”一词。这表明Scanner并不知道要从中读取的整个字符串,而一次只在缓冲区中保留一小部分。这是有道理的,因为Scanner也被设计为从流中读取,所以从流中读取所有内容可能会花费很长时间(或永远!)并占用大量内存。

Scanner的源代码中,确实存在CharBuffer

// Internal buffer used to hold input
private CharBuffer buf;

由于字符串的长度和内容,扫描程序决定将所有内容加载到...

SectionError.ErrorMessage=errorlevel=Warning {HelpMessage:This is very...
                          ^
                    somewhere here
(It could be anywhere in the word "errorlevel")

...进入缓冲区。然后,在读取了字符串的一半之后,字符串的另一半开始如下所示:

errorlevel=Warning {HelpMessage:This is very...

[errorLevel=现在是字符串的开头,导致模式匹配。

Related Bug?

Matcher不使用缓冲区。它将与之匹配的整个字符串存储在字段中:

/**
 * The original string being matched.
 */
CharSequence text;

因此在Matcher中未观察到此行为。


2
投票

Sweepers answer正确,这是Scanner的缓冲区不包含整个字符串的问题。我们可以简化示例以具体触发问题:

static final Pattern pattern = Pattern.compile("^ABC.", Pattern.MULTILINE);
public static void main(String[] args) {
    String testString = "\nABC1\nXYZ ABC2\nABC3ABC4\nABC4";
    String properties = "X".repeat(1024 - testString.indexOf("ABC4")) + testString;

    String s1 = usingScanner(properties);
    System.out.println("Using Scanner: "+s1);
    String m = usingMatcher(properties);
    System.out.println("Using Matcher: "+m);

    if(!s1.equals(m)) System.out.println("mismatch");
    if(s1.equals(usingScannerNoStream(properties)))
        System.out.println("Not a stream issue");
}
private static String usingScanner(String source) {
    return new Scanner(source)
        .findAll(pattern)
        .map(MatchResult::group)
        .collect(Collectors.joining(" + "));
}
private static String usingScannerNoStream(String source) {
    Scanner s = new Scanner(source);
    StringJoiner sj = new StringJoiner(" + ");
    for(;;) {
        String match = s.findWithinHorizon(pattern, 0);
        if(match == null) return sj.toString();
        sj.add(match);
    }
}
private static String usingMatcher(String source) {
    return pattern.matcher(source).results()
        .map(MatchResult::group)
        .collect(Collectors.joining(" + "));
}

打印:

Using Scanner: ABC1 + ABC3 + ABC4 + ABC4
Using Matcher: ABC1 + ABC3 + ABC4
mismatch
Not a stream issue

此示例为前缀添加了X字符,以使假阳性匹配的开始与缓冲区的大小对齐。 Scanner的初始缓冲区大小为1024,但在需要时可能会变大。

由于findAll忽略了扫描程序的分隔符,就像findWithinHorizon一样,此代码还显示,手动使用findWithinHorizon循环显示相同的行为,换句话说,这不是所使用的Stream API的问题。

由于Scanner将在需要时扩大缓冲区,因此我们可以通过使用match操作来解决该问题,该操作会在执行预期的match操作之前将全部内容读入缓冲区,例如]]

private static String usingScanner(String source) {
    Scanner s = new Scanner(source);
    s.useDelimiter("(?s).*").hasNext();
    return s
        .findAll(pattern)
        .map(MatchResult::group)
        .collect(Collectors.joining(" + "));
}

此特定的hasNext()具有使用整个字符串的定界符,将强制完全缓冲字符串,而不会提前位置。随后的findAll()操作将忽略定界符和hasNext()检查的结果,但由于缓冲区已完全填充,因此不再遭受该问题的困扰。

当然,这在解析实际流时会破坏Scanner的优势。

© www.soinside.com 2019 - 2024. All rights reserved.