Apache Tika - NoSuchMethodError TarArchiveInputStream.getNextEntry()

问题描述 投票:0回答:1

我正在使用的版本:

  • SpringBoot:
    3.2.4
  • Java:
    JDK 17

Pom 在 docs 中使用并基于我的依赖树:

  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.9.2</version>
  </dependency>
  <!-- Uses commons-compress lib with version 1.26.1 -->
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers-standard-package</artifactId>
    <version>2.9.2</version>
  </dependency>
  <!-- Uses commons-compress lib with version 1.24.0 -->
  <dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <scope>test</scope>
    <version>4.13.2</version>
  </dependency>
  <!-- Uses commons-compress lib with version 1.24.0 -->
  <dependency>
    <groupId>org.testcontainers</groupId>
    <artifactId>testcontainers</artifactId>
    <scope>test</scope>
    <version>1.19.7</version>
  </dependency>
 

使用代码:

String htmlFile = "<!DOCTYPE html><html><head><!-- HTML Codes by Quackit.com --><title></title><meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"><style>body {background-color:#ffffff;background-repeat:no-repeat;background-position:top left;background-attachment:fixed;}h1{font-family:Arial, sans-serif;color:#000000;background-color:#ffffff;}p {font-family:Georgia, serif;font-size:14px;font-style:normal;font-weight:normal;color:#000000;background-color:#ffffff;}</style></head><body><h1>test</h1><p>test2</p></body></html>";

Tika tika = new Tika();

// this throw exception
text = tika.parseToString(TikaInputStream.get(htmlFile.getBytes())); 

// This is not working too - same exception
text = tika.parseToString(new ByteArrayInputStream(htmlFile.getBytes()));

异常 - 相关部分:

java.lang.NoSuchMethodError: 'org.apache.commons.compress.archivers.tar.TarArchiveEntry org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry()'
at org.apache.tika.detect.zip.TikaArchiveStreamFactory.detect(TikaArchiveStreamFactory.java:293) ~[tika-parser-zip-commons-2.9.2.jar:2.9.2]
at org.apache.tika.detect.zip.DefaultZipContainerDetector.detectArchiveFormat(DefaultZipContainerDetector.java:124) ~[tika-parser-zip-commons-2.9.2.jar:2.9.2]
at org.apache.tika.detect.zip.DefaultZipContainerDetector.detect(DefaultZipContainerDetector.java:175) ~[tika-parser-zip-commons-2.9.2.jar:2.9.2]
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) ~[tika-core-2.9.2.jar:2.9.2]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:177) ~[tika-core-2.9.2.jar:2.9.2]
at org.apache.tika.Tika.parseToString(Tika.java:525) ~[tika-core-2.9.2.jar:2.9.2]
at org.apache.tika.Tika.parseToString(Tika.java:495) ~[tika-core-2.9.2.jar:2.9.2]
at org.apache.tika.Tika.parseToString(Tika.java:557) ~[tika-core-2.9.2.jar:2.9.2]
at ... -> tika.parseToString(...

我尝试使用不同的方法来使用 tika 解析,但我在调试时遇到了相同的异常 - 第一种方法在 tika 库中使用了第二种方法 - 我尝试了第二种方法:

AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(TikaInputStream.get(htmlFile.getBytes()), handler, metadata);
text = handler.toString();

我尝试使用 Tika 版本

2.9.1
一切正常 和几个月前一样好。

但我想要没有漏洞的最新版本。所以我尝试从 pom 中删除

tika-parsers-standard-package
,当我调试 tika 库时,它进一步但只返回空文本,因为逻辑上缺少类似 HTML 文件的解析器(正如我在 tika 库中调试的那样)。那么库中存在一些 tika bug 还是我做错了什么?

java apache-tika spring-boot-3
1个回答
0
投票

所以我会为其他有同样问题的人发布答案。这对我有帮助:

    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-core</artifactId>
    <version>2.9.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers-standard-package</artifactId>
      <version>2.9.2</version>
    </dependency>

    <!-- By adding this the problem is solved -->
    <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-compress</artifactId>
      <version>1.26.1</version>
    </dependency>

我还使用以下命令调查了为什么会发生这种情况:

mvn dependency:tree -Dverbose | grep 'omitted for conflict' | grep 'commons-compress'

我的情况下的命令响应:

[INFO] |  |  |  +- (org.apache.commons:commons-compress:jar:1.25.0:compile - omitted for conflict with 1.26.1)
[INFO] |  |  |  \- (org.apache.commons:commons-compress:jar:1.26.1:compile - omitted for conflict with 1.24.0)
[INFO] |  |  |  +- (org.apache.commons:commons-compress:jar:1.25.0:compile - omitted for conflict with 1.24.0)
[INFO] |  |  \- (org.apache.commons:commons-compress:jar:1.26.1:compile - omitted for conflict with 1.24.0)

这意味着 Tika 使用了

1.24.0
版本,因为
tika-parsers-standard-package
junit
/
testcontainers
依赖项之间的传递依赖项存在冲突。通过将
commons-compress
库硬编码到
1.26.1
版本,所有依赖项都使用了
1.26.1
commons-compress
,因此 Tika 也使用了正确的版本,并且现在正在按预期工作。

© www.soinside.com 2019 - 2024. All rights reserved.