斯坦福 NLP 注释 pipeline.annotate 导致 Java 中的 OutOfMemoryError

Question

因此，我们使用斯坦福 NLP 来注释输入文本，而这些输入文本小得可笑。下面是一个相同的例子。

“您能给我有关穆罕默德·西瓦·约翰（Mohammad Siva John）的详细信息和标识符吗？ 6745-3876-1354-8790 和 313-31-333"

下面是要注释的Java代码片段。

    final Properties properties = new Properties();
    properties.setProperty("annotators", "tokenize, ssplit, pos, lemma");
    final StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
    final Annotation document = new Annotation(text);
    pipeline.annotate(document);

下面是Maven依赖。

    <dependency>
        <groupId>edu.stanford.nlp</groupId>
        <artifactId>stanford-corenlp</artifactId>
        <version>4.5.4</version>
    </dependency>

这工作正常，但几天后，JVM 因核心转储而崩溃。 Coredump 分析显示以下行导致 OutOfMemoryError

pipeline.annotate(document);

对于如何解决这个问题有什么想法吗？类中没有字段级别变量，所有变量都是方法级别的，因此一旦执行完成就应该“释放”。因此，首先不应该出现 OutOfMemoryError。

相当令人困惑。有什么想法吗？

Answer 1

所以在 void edu.stanford.nlp.pipeline.StanfordCoreNLP.clearAnnotatorPool()

的StanfordNLP Javadoc 中找到了这个

如果您不再使用StanfordCoreNLP并且想要释放与注释器相关的内存，请调用此方法。

调用此方法本身几乎没有影响，但真正的游戏规则改变者是调用“System.gc()”。下面是对此的修复，只需在 annotate 之后调用 clearAnnotatorPool 和 gc。

final Properties properties = new Properties();
properties.setProperty("annotators", "tokenize, ssplit, pos, lemma");
final StanfordCoreNLP pipeline = new StanfordCoreNLP(properties);
final Annotation document = new Annotation(text);
pipeline.annotate(document);

// Below two calls would fix memory issue.
StanfordCoreNLP.clearAnnotatorPool();
System.gc();

这就是仅StanfordCoreNLP.clearAnnotatorPool()调用的内存使用情况。请注意，这是没有 System.gc() 调用。请注意，对于循环中 1000 次“注释”调用，使用的内存会超过 2500 MB，然后降至 500 MB 以下。当我们让 JVM 调用 gc 时就是这种情况。

但是，当我同时调用 StanfordCoreNLP.clearAnnotatorPool(); 时和 System.gc(); 结果截然不同。请注意，无论注释池是否已清除，已用内存都在 132 到 133 MB 的范围内。

在 150 次 pipeline.annotate 点击后，人们开始看到真正的“StanfordCoreNLP.clearAnnotatorPool()”。下面是对 pipeline.annotate 超过 1000 次点击观察到的内存利用率。观察到使用“StanfordCoreNLP.clearAnnotatorPool() 和 System.gc()”，内存利用率徘徊在 40 MB 以上。

如果您理解正确，这仅表示 JVM 默认 gc 执行在某种程度上没有释放与进行显式调用时一样多的内存。我知道会有时间差异等等，这仅意味着需要进一步研究默认 gc 和正在使用的 JDK（在 JDK 21 上，IDE 强制执行 JDK 17 合规级别。当也直接在 JDK 17 上运行）以及这种情况如何随着各种 gc 策略的变化而变化。

但我对此很满意并得出结论！希望这对某人有帮助。

斯坦福 NLP 注释 pipeline.annotate 导致 Java 中的 OutOfMemoryError

问题描述投票：0回答：1

1个回答

最新问题

斯坦福 NLP 注释 pipeline.annotate 导致 Java 中的 OutOfMemoryError

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1