字幕中 SRT 和 VTT 语法的正则表达式匹配

Question

我有 srt 和 vtt 格式的字幕，我需要匹配和删除特定于格式的语法，并获得干净的文本行。

我想出了这个正则表达式：

/\n?\d*?\n?^.* --> [012345]{2}:.*$/m

示例内容（混合 srt 和 vtt）：

1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2

这与https://regex101.com/r/zRsRMR/2/

中模拟的预期字幕编号和时间相匹配

但是当在代码本身中使用时（即使直接使用从https://regex101.com生成的代码片段），它只会匹配时间，而不匹配字幕编号。

查看输出：

array (5)
0 => array (1)
0 => "00:00:04,019 --> 00:00:07,299
" (30)
1 => array (1)
0 => "
00:00:07,414 --> 00:00:09,155
" (31)
2 => array (1)
0 => "
00:00:09,276 --> 00:00:11,429
" (31)
3 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
4 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)

可以测试：http://sandbox.onlinephpfunctions.com/code/dec294251b879144f40a6d1bdd516d2050321242

目标是匹配字幕编号，例如第一个预期匹配应该是：

1
00:00:04,019 --> 00:00:07,299

Answer 1

我不太确定这是否是您想要捕捉的。但是，原因是您可能希望用捕获组包裹字符串，以便于轻松获取。例如，这个表达式示例了捕获组如何围绕您所需的字符工作：

^([0-9]+\n|)([0-9:,->\s]+)

这可能不是这样做的方式，也不是最好的表达方式。然而，它可能会给你一个以不同的方式解决问题的想法。

我猜您可能想捕获日期时间行和之前的行，其中可能有也可能没有数字。

图表

此图显示了表达式的工作原理，您可以在此链接中可视化其他表达式：

在将数据发送到正则表达式引擎之前，您可能需要编写一个脚本来清理数据，这样您就可以得到一个简单的表达式。

使用 JavaScript 进行测试示例

const regex = /^([0-9]+\n|)([0-9:,->\s]+)/mg;
const str = `1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

PHP 测试

这可能不会生成您想要的输出，这只是一个示例：

$re = '/^([0-9]+\n|)([0-9:,->\s]+)/m';
$str = '1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

foreach ($matches[0] as $key => $value) {
    if ($value == "") {
        unset($matches[0][$key]);
    } else {
        $matches[0][$key] = trim($value);
    }

}

var_dump($matches[0]);

性能测试

此 JavaScript 片段使用简单的 100 万次

for

循环显示了该表达式的性能。

repeat = 1000000;
start = Date.now();

for (var i = repeat; i >= 0; i--) {
	var string = '2  \n00:00:07,414 --> 00:00:09,155';
	var regex = /(.*)([0-9:,->\s]+)/gm;
	var match = string.replace(regex, "$2");
}

end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

如果您希望在一个变量中捕获所有所需的输出，您只需在整个表达式周围添加一个捕获组，然后使用

$1

调用它。

如果需要，您还可以添加或减少边界，例如这个。

^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$

使用 JavaScript 测试第二个表达式的示例

const regex = /^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$/gm;
const str = `1
00:00:04,019 --> 00:00:07,299
- cdcdc
- cddcd

2
00:00:07,414 --> 00:00:09,155
54564

00:00:09,276 --> 00:00:11,429
- 445454 - ccd
- cdscdcdcd

00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Answer 2

您可以将表达式的这一部分设为

\n?\d*?\n?

一个可选组，以匹配 1 个以上数字，后跟换行符。字符类

[012345]

也可以写为

[0-5]

您可以将表达式更新为：

^(?:\d+\n)?.*\h+-->\h+[0-5]{2}:.*$

```
^
```
字符串开头
```
(?:\d+\n)?
```
可选 1+ 位数字和换行符

.*\h+-->\h+ Match 0+ times any char except newline, 1+ horizontal whitespace chars,

-->` 和 1+ 水平空白字符

```
[0-5]{2}:
```
匹配 2 次 0-5
```
.*
```
匹配除换行符之外的任何字符 0+ 次
```
$
```
字符串结束

正则表达式演示 | Php 演示

Answer 3

Vtt 格式可以有样式。此外，人们手动编辑这些文件，通常会犯不同的格式错误（例如错误的时间戳格式、额外的新行、空格...）。这使得编写正则表达式几乎不可能。

如果您想正确解析字幕，最好的选择之一是使用库：

$srt = '
   1
   00:00:04,019 --> 00:00:07,299
   line1
   line2

';
echo Subtitles::loadFromString($srt)->content('txt'); 
// Output: 
// line1
// line2

您可以通过这种方式解析 .srt 和 .vtt 文件。

https://github.com/mantas-done/subtitles

字幕中 SRT 和 VTT 语法的正则表达式匹配

问题描述投票：0回答：3

3个回答

图表

使用 JavaScript 进行测试示例

PHP 测试

性能测试

使用 JavaScript 测试第二个表达式的示例

最新问题

字幕中 SRT 和 VTT 语法的正则表达式匹配

问题描述 投票：0回答：3

3个回答

图表

使用 JavaScript 进行测试示例

PHP 测试

性能测试

使用 JavaScript 测试第二个表达式的示例

最新问题

问题描述投票：0回答：3