我需要解析用 Jira 标记编写的表格。到目前为止,我已经设法解析了标题和单元格(包含文本或图像或两者的组合的单元格)的简单排列的表格。
当涉及到带有换行符的单元格时,我有点卡住了。我的方法执行以下操作:
String JIRA_TABLE_REGEX = "(\\|\\|.*\\|\\|(\\n|\\r\\n|\\r))*(\\|.*\\|(\\n|\\r\\n|\\r)?)+";
Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
Matcher matcher = TABLE_PATTERN.matcher(input);
// TODO - fix rows with new lines
while (matcher.find()) {
String jiraTable = matcher.group();
// Split the input string into rows
String[] rowArray = jiraTable.split("(\\n|\\r\\n|\\r)");
int size = 0;
List<String> headers = new ArrayList<>();
List<String> cells = new ArrayList<>();
for (String row : rowArray) {
// If the row starts with "||" it's a header row
if (row.startsWith("||")) {
headers.addAll(List.of(row.substring(2, row.length() - 2).split("\\|\\|")));
size = headers.size();
} else if (row.startsWith("|")) {
cells.addAll(List.of(row.substring(1, row.length() - 1).split("\\|")));
if (size == 0) {
size = cells.size();
}
}
}
正则表达式是:
\\|\\|.*\\|\\|
- 表格标题行。(\\n|\\r\\n|\\r)
- 换行符。)*
- 零个或多个标题行。\\|.*\\|
- 表格内容行。?
- 零个或一个换行符。)+
- 一个或多个内容行。该代码适用于以下表格:
||Heading 1 Table 1||Heading 2 Table 1||
|_*BOLD AND ITALIC*_|_Italic_|
|Normal key|Normal Value|
|Third row, *just half BOLD*| |
但它悲惨地失败了:
||Heading 1 Table 2||Heading 2 Table 2||
|Col A1|Col A2|
|SECOND ROW|Second row too|
|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|
|Row with image and text |smol !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|
|Row with text and new lines|Text
New Line
More text
Much more text|
我尝试了建议的方法“用 \| 将输入拆分成行” 相反”通过更改拆分正则表达式以及与表匹配的正则表达式:
private static final String JIRA_TABLE_REGEX_ALTERNATIVE = "(\\|\\|.*\\|\\|(\\n))*(\\|.*\\|(\\n)?)+";
...
rowArray = jiraTable.split("\\|\\n");
它仍然遗漏了所有换行符的最后一个单元格。
我只是将标记读入一个字符串数组。数组的第一行是标题标记,其余是行标记。我会处理标题标记,然后标记每一行。
public class Main
{
public static void main(String[] args) {
String markup = "||heading 1||heading 2||heading 3||\n|col A1|col A2|col A3|\n|col B1|col B2|col B3|";
String[] lines = markup.split("\n");
String[] headers = lines[0].split("[|]+");
String[][] tbl = new String[lines.length][headers.length-1];
for(int i=0; i<headers.length-1; i++)
tbl[0][i] = headers[i+1];
for(int i=1; i<lines.length; i++){
String[] rows = lines[i].split("[|]");
for(int j=1; j<rows.length; j++)
tbl[i][j-1] = rows[j];
}
for(int i=0; i<tbl.length; i++){
for(int j=0; j<tbl.length; j++)
System.out.printf("|%-10s",tbl[i][j]);
System.out.println("|");
}
}
}
输出
|heading 1 |heading 2 |heading 3 |
|col A1 |col A2 |col A3 |
|col B1 |col B2 |col B3 |
对不起,因为在我最初的回答中我没有意识到你当然正在处理一个完整的 Jira 页面,而不仅仅是你举例说明的文本片段。
我不确定它是否适用于所有情况,但我测试了您的代码并且 - 在正则表达式让我发疯之后 - 我想我想出了一个可能的解决方案:
String input1 = "||Heading 1 Table 1||Heading 2 Table 1||\n" +
"|_*BOLD AND ITALIC*_|_Italic_|\n" +
"|Normal key|Normal Value|\n" +
"|Third row, *just half BOLD*| |";
String input2 = "||Heading 1 Table 2||Heading 2 Table 2||\n" +
"|Col A1|Col A2|\n" +
"|SECOND ROW|Second row too|\n" +
"|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|\n" +
"|Row with image and text |smol !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|\n" +
"|Row with text and new lines|Text\n" +
" \n" +
" New Line\n" +
" \n" +
" More text\n" +
" Much more text|\n";
String input = input1 + "\nA bunch of text.\n" + input2;
String JIRA_TABLE_REGEX = "\\|\\|(.|\\R[^\\|])+\\|\\|\\R((\\|(?!\\|))(.|\\R[^\\|])+(\\|(?!\\|)\\R?))+";
Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
Matcher matcher = TABLE_PATTERN.matcher(input);
// TODO - fix rows with new lines
int nt = 0;
while (matcher.find()) {
System.out.printf("Table #%d\n", ++nt);
String jiraTable = matcher.group();
// Split the input string into rows
String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
int size = 0;
List<String> headers = new ArrayList<>();
List<String> cells = new ArrayList<>();
for (String row : rowArray) {
// If the row starts with "||" it's a header row
if (row.startsWith("||")) {
headers.addAll(Arrays.asList(row.substring(2).split("\\|\\|")));
size = headers.size();
} else if (row.startsWith("|")) {
cells.addAll(Arrays.asList(row.substring(1).split("\\|")));
if (size == 0) {
size = cells.size();
}
}
}
System.out.println("Headers:" + headers);
System.out.println("Cells:" + cells);
}
产生:
Table #1
Headers:[Heading 1 Table 1, Heading 2 Table 1]
Cells:[_*BOLD AND ITALIC*_, _Italic_, Normal key, Normal Value, Third row, *just half BOLD*, ]
Table #2
Headers:[Heading 1 Table 2, Heading 2 Table 2]
Cells:[Col A1, Col A2, SECOND ROW, Second row too, Row with image, !image-2023-04-24-17-51-07-167.png, width=359,height=253!, Row with image and text , smol !image-2023-05-02-12-42-16-942.png, width=347,height=231! kitten, Row with text and new lines, Text
New Line
More text
Much more text]
原答案:
以下代码应该可以正常工作:
String input1 = "||Heading 1 Table 1||Heading 2 Table 1||\n" +
"|_*BOLD AND ITALIC*_|_Italic_|\n" +
"|Normal key|Normal Value|\n" +
"|Third row, *just half BOLD*| |";
String input2 = "||Heading 1 Table 2||Heading 2 Table 2||\n" +
"|Col A1|Col A2|\n" +
"|SECOND ROW|Second row too|\n" +
"|Row with image|!image-2023-04-24-17-51-07-167.png|width=359,height=253!|\n" +
"|Row with image and text |smol !image-2023-05-02-12-42-16-942.png|width=347,height=231! kitten|\n" +
"|Row with text and new lines|Text\n" +
" \n" +
" New Line\n" +
" \n" +
" More text\n" +
" Much more text|";
String JIRA_TABLE_REGEX = "^(\\|\\|)(.|\\R)*(\\|)$";
Pattern TABLE_PATTERN = Pattern.compile(JIRA_TABLE_REGEX);
Matcher matcher = TABLE_PATTERN.matcher(input2);
// TODO - fix rows with new lines
if (matcher.find()) {
String jiraTable = matcher.group();
// Split the input string into rows
String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
int size = 0;
List<String> headers = new ArrayList<>();
List<String> cells = new ArrayList<>();
for (String row : rowArray) {
// If the row starts with "||" it's a header row
if (row.startsWith("||")) {
headers.addAll(Arrays.asList(row.substring(2).split("\\|\\|")));
size = headers.size();
} else if (row.startsWith("|")) {
cells.addAll(Arrays.asList(row.substring(1).split("\\|")));
if (size == 0) {
size = cells.size();
}
}
}
System.out.println("Headers:" + headers);
System.out.println("Cells:" + cells);
}
基本上这是您的原始代码,有两个小改动:
String JIRA_TABLE_REGEX = "^(\\|\\|)(.|\\R)*(\\|)$";
我们搜索表头的开头和最后一行的结尾,不管其中包含的文本。
|
字符,后跟任何类型的行终止符(自 Java 8 以来由正则表达式 \R
表示):String[] rowArray = jiraTable.split("(\\|(\\|)?)\\R");
请注意,我只测试了您提供的两个示例:您可以尝试稍微调整正则表达式以处理没有标题的表格等。
使用
input2
运行示例提供以下输出:
Headers:[Heading 1 Table 2, Heading 2 Table 2]
Cells:[Col A1, Col A2, SECOND ROW, Second row too, Row with image, !image-2023-04-24-17-51-07-167.png, width=359,height=253!, Row with image and text , smol !image-2023-05-02-12-42-16-942.png, width=347,height=231! kitten, Row with text and new lines, Text
New Line
More text
Much more text]
我修改了每行处理中使用的
substring
s,以避免丢失一些结束字符。