如果单元格有换行符，我如何解析用 Jira 标记编写的表格？

Question

我需要解析用 Jira 标记编写的表格。到目前为止，我已经设法解析了标题和单元格（包含文本或图像或两者的组合的单元格）的简单排列的表格。

当涉及到带有换行符的单元格时，我有点卡住了。我的方法执行以下操作：

    String JIRA_TABLE_REGEX = "(\\|\\|.*\\|\\|(\\n|\\r\\n|\\r))*(\\|.*\\|(\\n|\\r\\n|\\r)?)+";
    Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
    Matcher matcher = TABLE_PATTERN.matcher(input);

        // TODO - fix rows with new lines
        while (matcher.find()) {
            String jiraTable = matcher.group();
            // Split the input string into rows
            String[] rowArray = jiraTable.split("(\\n|\\r\\n|\\r)");
            int size = 0;
            List<String> headers = new ArrayList<>();
            List<String> cells = new ArrayList<>();

            for (String row : rowArray) {
                // If the row starts with "||" it's a header row
                if (row.startsWith("||")) {
                    headers.addAll(List.of(row.substring(2, row.length() - 2).split("\\|\\|")));
                    size = headers.size();
                } else if (row.startsWith("|")) {
                    cells.addAll(List.of(row.substring(1, row.length() - 1).split("\\|")));
                    if (size == 0) {
                        size = cells.size();
                    }
                }
            }

正则表达式是：

```
\\|\\|.*\\|\\|
```
- 表格标题行。
```
(\\n|\\r\\n|\\r)
```
- 换行符。
```
)*
```
- 零个或多个标题行。
```
\\|.*\\|
```
- 表格内容行。
```
?
```
- 零个或一个换行符。
```
)+
```
- 一个或多个内容行。

该代码适用于以下表格：

||Heading 1 Table 1||Heading 2 Table 1||
|_*BOLD AND ITALIC*_|_Italic_|
|Normal key|Normal Value|
|Third row, *just half BOLD*| |

给出表格中每个单元格的结果：

但它悲惨地失败了：

||Heading 1 Table 2||Heading 2 Table 2||
|Col A1|Col A2|
|SECOND ROW|Second row too|
|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|
|Row with image and text |smol  !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|
|Row with text and new lines|Text
  
 New Line
  
 More text
 Much more text|

结果缺少最后一个单元格：

更新 1

我尝试了建议的方法“用 \| 将输入拆分成行” 相反”通过更改拆分正则表达式以及与表匹配的正则表达式：

    private static final String JIRA_TABLE_REGEX_ALTERNATIVE = "(\\|\\|.*\\|\\|(\\n))*(\\|.*\\|(\\n)?)+";

...

rowArray = jiraTable.split("\\|\\n");

它仍然遗漏了所有换行符的最后一个单元格。

Answer 1

我只是将标记读入一个字符串数组。数组的第一行是标题标记，其余是行标记。我会处理标题标记，然后标记每一行。

public class Main
{
    public static void main(String[] args) {
       String markup = "||heading 1||heading 2||heading 3||\n|col A1|col A2|col A3|\n|col B1|col B2|col B3|";
        String[] lines = markup.split("\n");
        String[] headers = lines[0].split("[|]+");
        String[][] tbl = new String[lines.length][headers.length-1];
        for(int i=0; i<headers.length-1; i++)
           tbl[0][i] = headers[i+1];
        for(int i=1; i<lines.length; i++){
           String[] rows = lines[i].split("[|]");
           for(int j=1; j<rows.length; j++)
              tbl[i][j-1] = rows[j];
        }
        for(int i=0; i<tbl.length; i++){
           for(int j=0; j<tbl.length; j++)
              System.out.printf("|%-10s",tbl[i][j]);
           System.out.println("|");
        }
    }
}

输出

|heading 1 |heading 2 |heading 3 |
|col A1    |col A2    |col A3    |
|col B1    |col B2    |col B3    |

Answer 2

对不起，因为在我最初的回答中我没有意识到你当然正在处理一个完整的 Jira 页面，而不仅仅是你举例说明的文本片段。

我不确定它是否适用于所有情况，但我测试了您的代码并且 - 在正则表达式让我发疯之后 - 我想我想出了一个可能的解决方案：

String input1 = "||Heading 1 Table 1||Heading 2 Table 1||\n" +
    "|_*BOLD AND ITALIC*_|_Italic_|\n" +
    "|Normal key|Normal Value|\n" +
    "|Third row, *just half BOLD*| |";

String input2 = "||Heading 1 Table 2||Heading 2 Table 2||\n" +
    "|Col A1|Col A2|\n" +
    "|SECOND ROW|Second row too|\n" +
    "|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|\n" +
    "|Row with image and text |smol  !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|\n" +
    "|Row with text and new lines|Text\n" +
    "  \n" +
    " New Line\n" +
    "  \n" +
    " More text\n" +
    " Much more text|\n";

String input = input1 + "\nA bunch of text.\n" + input2;

String JIRA_TABLE_REGEX = "\\|\\|(.|\\R[^\\|])+\\|\\|\\R((\\|(?!\\|))(.|\\R[^\\|])+(\\|(?!\\|)\\R?))+";
Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
Matcher matcher = TABLE_PATTERN.matcher(input);

// TODO - fix rows with new lines
int nt = 0;
while (matcher.find()) {
  System.out.printf("Table #%d\n", ++nt);
  String jiraTable = matcher.group();
  // Split the input string into rows
  String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
  int size = 0;
  List<String> headers = new ArrayList<>();
  List<String> cells = new ArrayList<>();

  for (String row : rowArray) {
    // If the row starts with "||" it's a header row
    if (row.startsWith("||")) {
      headers.addAll(Arrays.asList(row.substring(2).split("\\|\\|")));
      size = headers.size();
    } else if (row.startsWith("|")) {
      cells.addAll(Arrays.asList(row.substring(1).split("\\|")));
      if (size == 0) {
        size = cells.size();
      }
    }
  }

  System.out.println("Headers:" + headers);
  System.out.println("Cells:" + cells);
}

产生：

Table #1
Headers:[Heading 1 Table 1, Heading 2 Table 1]
Cells:[_*BOLD AND ITALIC*_, _Italic_, Normal key, Normal Value, Third row, *just half BOLD*,  ]
Table #2
Headers:[Heading 1 Table 2, Heading 2 Table 2]
Cells:[Col A1, Col A2, SECOND ROW, Second row too, Row with image, !image-2023-04-24-17-51-07-167.png, width=359,height=253!, Row with image and text , smol  !image-2023-05-02-12-42-16-942.png, width=347,height=231! kitten, Row with text and new lines, Text
  
 New Line
  
 More text
 Much more text]

原答案：

以下代码应该可以正常工作：

String input1 = "||Heading 1 Table 1||Heading 2 Table 1||\n" +
    "|_*BOLD AND ITALIC*_|_Italic_|\n" +
    "|Normal key|Normal Value|\n" +
    "|Third row, *just half BOLD*| |";

String input2 = "||Heading 1 Table 2||Heading 2 Table 2||\n" +
    "|Col A1|Col A2|\n" +
    "|SECOND ROW|Second row too|\n" +
    "|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|\n" +
    "|Row with image and text |smol  !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|\n" +
    "|Row with text and new lines|Text\n" +
    "  \n" +
    " New Line\n" +
    "  \n" +
    " More text\n" +
    " Much more text|";

String JIRA_TABLE_REGEX = "^(\\|\\|)(.|\\R)*(\\|)$";
Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
Matcher matcher = TABLE_PATTERN.matcher(input2);

// TODO - fix rows with new lines
if (matcher.find()) {
  String jiraTable = matcher.group();
  // Split the input string into rows
  String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
  int size = 0;
  List<String> headers = new ArrayList<>();
  List<String> cells = new ArrayList<>();

  for (String row : rowArray) {
    // If the row starts with "||" it's a header row
    if (row.startsWith("||")) {
      headers.addAll(Arrays.asList(row.substring(2).split("\\|\\|")));
      size = headers.size();
    } else if (row.startsWith("|")) {
      cells.addAll(Arrays.asList(row.substring(1).split("\\|")));
      if (size == 0) {
        size = cells.size();
      }
    }
  }

  System.out.println("Headers:" + headers);
  System.out.println("Cells:" + cells);
}

基本上这是您的原始代码，有两个小改动：

一方面，全表正则表达式已简化如下：

String JIRA_TABLE_REGEX = "^(\\|\\|)(.|\\R)*(\\|)$";

我们搜索表头的开头和最后一行的结尾，不管其中包含的文本。

另一方面，我们拆分每一行搜索一个（行）或两个（标题行）
```
|
```
字符，后跟任何类型的行终止符（自 Java 8 以来由正则表达式
```
\R
```
表示）：

String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");

请注意，我只测试了您提供的两个示例：您可以尝试稍微调整正则表达式以处理没有标题的表格等。

使用

input2

运行示例提供以下输出：

Headers:[Heading 1 Table 2, Heading 2 Table 2]
Cells:[Col A1, Col A2, SECOND ROW, Second row too, Row with image, !image-2023-04-24-17-51-07-167.png, width=359,height=253!, Row with image and text , smol  !image-2023-05-02-12-42-16-942.png, width=347,height=231! kitten, Row with text and new lines, Text
  
 New Line
  
 More text
 Much more text]

我修改了每行处理中使用的

substring

s，以避免丢失一些结束字符。

如果单元格有换行符，我如何解析用 Jira 标记编写的表格？

问题描述投票：0回答：2

更新 1

2个回答

最新问题

如果单元格有换行符，我如何解析用 Jira 标记编写的表格？

问题描述 投票：0回答：2

更新 1

2个回答

最新问题

问题描述投票：0回答：2