PHP Preg_match 模式从字幕 srt 文件中删除时间

Question

我需要一个 preg_match 表达式来删除 .srt 字幕文件（作为字符串导入）中的所有计时，但我永远无法完全理解正则表达式模式。例如，它会改变：

5
00:05:50,141 --> 00:05:54,771
This is what was said

到

This is what was said

Answer 1

不确定你被困在哪里，它只是 \d+ 和冒号/逗号。

$re = '/\d+.\d+:\d+:\d+,\d+\s-->\s\d+:\d+:\d+,\d+./s';
//$re = '\d+.[0-9:,]+\s-->\s[\d+:,]+./s'; //slightly compacter version of the regex
$str = '5
00:05:50,141 --> 00:05:54,771
This is what was said';
$subst = '';

$result = preg_replace($re, $subst, $str);

echo $result;

工作演示在这里。
有了更紧凑的图案，它看起来像：https://regex101.com/r/QY9QXG/2

只是为了乐趣和挑战。这是一个非正则表达式的答案。 https://3v4l.org/r7hbO

$str = "1
00:05:50,141 --> 00:05:54,771
This is what was said1

2
00:05:50,141 --> 00:05:54,771
This is what was said2

3
00:05:50,141 --> 00:05:54,771
This is what was said3

4
00:05:50,141 --> 00:05:54,771
This is what was said4
LLLL

5
00:05:50,141 --> 00:05:54,771
This is what was said5";


$count = explode(PHP_EOL.PHP_EOL, $str);

foreach($count as &$line){
    $line =  implode(PHP_EOL, array_slice(explode(PHP_EOL, $line), 2));
}

echo implode(PHP_EOL.PHP_EOL, $count);

非正则表达式将首先拆分为双新行，这意味着每个新的字幕组都是数组中的一个新项目。
然后循环它们并在新行再次爆炸。
前两行不需要，数组将它们切掉。
如果副标题超过一行，我需要将它们合并。在新线上进行内爆即可做到这一点。

然后作为最后一步，通过在双新行上内爆再次重建字符串。

正如卡西米尔在下面的评论中所写，我使用 PHP_EOL 作为新行，并且在示例中有效。
但是当在真实的 srt 文件上使用时，新行可能会有所不同。
如果代码未按预期工作，请尝试将 PHP_EOL 替换为其他新行。

Answer 2

由于 srt 文件始终具有相同的格式，因此您可以跳过每个行块的前两行，并在到达空行后返回结果。要做到这一点并避免将整个文件加载到内存中，您可以逐行读取文件并使用生成器：

function getSubtitleLine($handle) {
    $flag = 0;
    $subtitle = '';
    while ( false !== $line = stream_get_line($handle, 1024, "\n") ) {
        $line = rtrim($line);
        if ( empty($line) ) {
            yield $subtitle;
            $subtitle = '';
            $flag = 0;
        } elseif ( $flag == 2 ) {
            $subtitle .= empty($subtitle) ? $line : "\n$line";
        } else {
           $flag++;
        }
    }

    if ( !empty($subtitle) )
        yield $subtitle;
}

if ( false !== $handle = fopen('./test.srt', 'r') ) {
    foreach (getSubtitleLine($handle) as $line) {
        echo $line, PHP_EOL;
    }
}

Answer 3

PHP 代码：

$str = '5
00:05:50,141 --> 00:05:54,771
This is what was said';
$reg = '/(.{0,}[0,1]{0,}\s{0,}[0-9]{0,}.{0,}[0-9]+[0-9]+:[0-9]{0,}.{0,})/';
echo(trim(preg_replace($reg, '', $str)));

Answer 4

因此考虑到

This is what was said

以大写字母开头并且可以是带有标点符号的文本，我建议如下：

$re = '/.*([A-Z]{1}[A-Za-z0-9 _.,?!"\/\'$]*)/';

$str = '5
00:05:50,141 --> 00:05:54,771
This is what was said.';

preg_match_all($re, $str, $matches, PREG_OFFSET_CAPTURE, 0);

// Print the entire match result
var_dump($matches);

Answer 5

如果您的 .srt 文件来自不同的地方并且格式不正确，您可以使用可以正确解析它们并提取文本的库：

$srt = '
   5
   00:05:50,141 --> 00:05:54,771
   This is what was said
';
echo Subtitles::loadFromString($srt)->content('txt'); // Output: This is what was said

https://github.com/mantas-done/subtitles

PHP Preg_match 模式从字幕 srt 文件中删除时间

问题描述投票：0回答：5

5个回答

最新问题

PHP Preg_match 模式从字幕 srt 文件中删除时间

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5