如何使用java从PDF文件中读取两个单词之间的多行内容?

问题描述 投票:2回答:1

我有一个要求,我必须从PDF文件中获取数据,该文件在单词“IN:”之后和单词“OUT:”之前,并且在文件中有很多这样的事件。

问题陈述是它也可以是多行,并且它的格式没有定义。

我甚至通过设置一些条件来尝试它,比如以特定字符开头或结尾,但是这样我就必须编写太多的条件,并且这样的格式确实存在于获取后的“OUT:”字之后。

请告诉我如何解决问题。

以下是示例数据格式:

格式1:

IN: {
"abc": "valueabc",
"def": "valuedef",
"ghi":
[
{"jkl": valuejkl, "mno": valuemno, "pqr":
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"}
],
"id": "1"
}
OUT: {"abc": "valueabc", "id": "1", "def": {}}

格式2:

IN: {"abc": "valueabc", "def": "valuedef", "id": "1"}
OUT: {"abc": "valueabc", "id": "1", "ghi": "valueghi"}

格式3:

IN: {"abc": "valueabc", "def": "valuedef", "jkl":
["valuejkl"], "id": "1"}
OUT: {"abc": "valueabc", "id": "1", "ghi": {}}

下面是我尝试的解决方案代码的核心逻辑,在if语句中还有需要获取的单独数据,之后它是在“IN:”之后和“OUT:”之前获取数据的逻辑。

for(String line:lines)
            {
                String pattern = "^[0-9]+[\\.][0-9]+[\\.][0-9]+[\\.].*";
                boolean matches = Pattern.matches(pattern, line);
                if(matches)
                {
                    String subString1 = line.split("\\.")[3].trim();
                    String subString2 = line.split("\\.")[4].trim();
                    String finalString = subString1+"."+subString2+",";
                    System.out.println();
                    System.out.print(finalString); 
                }
                else if(line.startsWith("IN:"))
                {
                    String finalString = line.substring(3).trim();
                    System.out.print(finalString);
                }
                else if(!(line.startsWith("IN:")||line.startsWith("OUT:"))&&((line.trim().length()>1)&&(line.endsWith("}"))))
                {
                    String finalString = line.trim();
                    System.out.print(finalString);
                }
                else if(!(line.startsWith("IN:")||line.startsWith("OUT:"))&&((line.trim().length()>1)&&(line.startsWith("\""))))
                {
                    String finalString = line.trim();
                    System.out.print(finalString);
                }
                else
                {
                    continue;
                }
            }
java multiline data-extraction
1个回答
2
投票

这个怎么样?如果你想要一个IN:OUT:之间的值,你能试试这段代码吗?

StringBuilder sb = new StringBuilder();
boolean targetFound = false;
for (String line : lines) {
    if (line.startsWith("IN:")) {
        line = line.replace("IN:", "");
        targetFound = false;
    } else if (line.startsWith("OUT:")) {
        targetFound = true;
    }

    if (targetFound && !line.equals("OUT:")) {
        // Print
        System.out.println(sb.toString());
        sb.setLength(0);
    } else {
        sb.append(line.trim());
    }
}

输入文本:

IN: {
"abc": "valueabc",
"def": "valuedef",
"ghi":
[
"valuepqr"},
{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":
"valuepqr"}
],
"id": "1"
}
OUT: {"abc": "valueabc", "~"}

结果:

{"abc": "valueabc","def": "valuedef","ghi":["valuepqr"},{"jkl": valuejkl, "mno": valuemno, "stu": "valuestu", "pqr":"valuepqr"}],"id": "1"}
© www.soinside.com 2019 - 2024. All rights reserved.