(提前抱歉这个长问题 - 问题实际上很简单 - 但解释起来可能并不那么简单)
我的 PHP 新手技能受到了挑战:
输入 2 个 TXT 文件,其结构如下:
$rowidentifier //number,letter,string etc..
$some semi-fixed-string $somedelimiter $semi-fixed-string
$content //with unknown length or strings or lines number.
阅读上面的内容,我在“半固定字符串”中的含义意味着它是一个具有已知结构但未知内容的字符串..
举一个实际的例子,让我们看一个SRT文件(我只是将它用作豚鼠,因为结构与我需要的非常相似):
1
00:00:12,759 --> 00:00:17,458
"some content here "
that continues here
2
00:00:18,298 --> 00:00:20,926
here we go again...
3
00:00:21,368 --> 00:00:24,565
...and this can go forever...
4
.
.
.
我想要做的,是从一个文件中取出 $content 部分,并将其放在第二个文件的正确位置。
回到示例 SRT ,有:
//file1
1
00:00:12,759 --> 00:00:17,458
"this is the italian content "
which continues in italian here
2
00:00:18,298 --> 00:00:20,926
here we go talking italian again ...
和
//file2
1
00:00:12,756 --> 00:00:17,433
"this is the spanish, chinese, or any content "
which continues in spanish, or chinese here
2
00:00:16,293 --> 00:00:20,96
here we go talking spanish, chinese or german again ...
将导致
//file3
1
00:00:12,756 --> 00:00:17,433
"this is the italian content "
which continues in italian here
"this is the spanish, chinese, or any content "
which continues in spanish, or chinese here
2
00:00:16,293 --> 00:00:20,96
here we go talking italian again ...
here we go talking spanish, chinese or german again ...
或更多类似 php 的内容:
$rowidentifier //unchanged
$some semi-fixed-string $somedelimiter $semi-fixed-string //unchanged, except maybe an option to choose if to keep file1 or file2 ...
$content //from file 1
$content //from file 2
所以,在所有这些介绍之后 - 这就是我所拥有的(实际上什么都没有..)
$first_file = file('file1.txt'); // no need to comment right ?
$second_file = file('file2.txt'); // see above comment
$result_array = array(); /construct array
foreach($first_file as $key=>$value) //loop array and....
$result_array[]= trim($value).'/r'.trim($second_file[$key]); //..here is my problem ...
// $Value is $content - but LINE BY LINE , and in our case, it could be 2-3- or even 4 lines
// should i go by delimiters /n/r ?? (not a good idea - how can i know they are there ?? )
// or should i go for regex to lookup for string patterns ? that is insane , no ?
$fp = fopen('merge.txt', 'w+'); fwrite($fp, join("\r\n", $result_array); fclose($fp);
这将逐行进行 - 这不是我需要的。我需要条件.. 另外 - 我确信这不是一个智能代码,或者有很多更好的方法可以实现 - 所以任何帮助将不胜感激......
您真正想要做的是并行迭代两个文件,然后组合属于彼此的部分。
但是您不能使用行号,因为它们可能会有所不同。所以你需要使用条目(块)的编号。因此,您需要给它一个“数字”或更精确的值,才能从文件中一个接一个地取出条目。
因此,您需要一个针对相关数据的迭代器,能够将某些行转换为块。
所以代替:
foreach($first_file as $number => $line)
是
foreach($first_file_blocks as $number => $block)
这可以通过编写自己的迭代器来完成,该迭代器将文件的行作为输入,然后将行动态转换为块。为此,您需要解析数据,这是一个基于状态的解析器的小示例,可以将行转换为块:
$state = 0;
$blocks = array();
foreach($lines as $line)
{
switch($state)
{
case 0:
unset($block);
$block = array();
$blocks[] = &$block;
$block['number'] = $line;
$state = 1;
break;
case 1:
$block['range'] = $line;
$state = 2;
break;
case 2:
$block['text'] = '';
$state = 3;
# fall-through intended
case 3:
if ($line === '') {
$state = 0;
break;
}
$block['text'] .= ($block['text'] ? "\n" : '') . $line;
break;
default:
throw new Exception(sprintf('Unhandled %d.', $state));
}
}
unset($block);
它只是沿着线运行并改变它的状态。基于该状态,每一行都作为其块的一部分进行处理。如果一个新块开始,它将被创建。它适用于您在问题中概述的 SRT 文件,demo。
为了使其使用更加灵活,请将其转换为一个迭代器,该迭代器在其构造函数中接受
$lines
并在迭代时提供块。这需要一些解析器如何让行工作的方式,但它的工作原理通常是相同的。
class SRTBlocks implements Iterator
{
private $lines;
private $current;
private $key;
public function __construct($lines)
{
if (is_array($lines))
{
$lines = new ArrayIterator($lines);
}
$this->lines = $lines;
}
public function rewind()
{
$this->lines->rewind();
$this->current = NULL;
$this->key = 0;
}
public function valid()
{
return $this->lines->valid();
}
public function current()
{
if (NULL !== $this->current)
{
return $this->current;
}
$state = 0;
$block = NULL;
while ($this->lines->valid() && $line = $this->lines->current())
{
switch($state)
{
case 0:
$block = array();
$block['number'] = $line;
$state = 1;
break;
case 1:
$block['range'] = $line;
$state = 2;
break;
case 2:
$block['text'] = '';
$state = 3;
# fall-through intended
case 3:
if ($line === '') {
$state = 0;
break 2;
}
$block['text'] .= ($block['text'] ? "\n" : '') . $line;
break;
default:
throw new Exception(sprintf('Unhandled %d.', $state));
}
$this->lines->next();
}
if (NULL === $block)
{
throw new Exception('Parser invalid (empty).');
}
$this->current = $block;
$this->key++;
return $block;
}
public function key()
{
return $this->key;
}
public function next()
{
$this->lines->next();
$this->current = NULL;
}
}
基本用法如下,输出可以在Demo中看到:
$blocks = new SRTBlocks($lines);
foreach($blocks as $index => $block)
{
printf("Block #%d:\n", $index);
print_r($block);
}
现在可以迭代 SRT 文件中的所有块。现在剩下的唯一事情就是并行迭代两个 SRT 文件。从 PHP 5.3 开始,SPL 附带了
MultipleIterator
来执行此操作。现在非常简单,例如我使用相同的行两次:
$multi = new MultipleIterator();
$multi->attachIterator(new SRTBlocks($lines));
$multi->attachIterator(new SRTBlocks($lines));
foreach($multi as $blockPair)
{
list($block1, $block2) = $blockPair;
echo $block1['number'], "\n", $block1['range'], "\n",
$block1['text'], "\n", $block2['text'], "\n\n";
}
将字符串(而不是输出)存储到文件中是相当简单的,所以我将其排除在答案之外。
那么要注意什么呢?首先,可以在循环和某种状态中轻松解析顺序数据(例如文件中的行)。这不仅适用于文件中的行,也适用于字符串。
其次,为什么我在这里建议使用迭代器?首先它很容易使用。从处理一个文件到并行处理两个文件只是一小步。除此之外,迭代器实际上也可以对另一个迭代器进行操作。例如
SPLFileObject
类。它提供了一个遍历文件中所有行的迭代器。如果您有大文件,则可以仅使用 SPLFileObject
(而不是数组),并且无需先将两个文件加载到数组中,然后对 SRTBlocks
进行少量添加,从末尾删除尾随 EOL 字符每行:
$line = rtrim($line, "\n\r");
它确实有效:
$multi = new MultipleIterator();
$multi->attachIterator(new SRTBlocks(new SplFileObject($file1)));
$multi->attachIterator(new SRTBlocks(new SplFileObject($file2)));
foreach($multi as $blockPair)
{
list($block1, $block2) = $blockPair;
echo $block1['number'], "\n", $block1['range'], "\n",
$block1['text'], "\n", $block2['text'], "\n\n";
}
完成后,您甚至可以使用(几乎)相同的代码处理非常大的文件。灵活,不是吗? 完整演示.
更短的解决方案:
$subtitles1 = \Done\Subtitles\Subtitles::loadFromString('file1.srt');
$subtitles2 = \Done\Subtitles\Subtitles::loadFromString('file2.srt');
foreach ($subtitles2->getInternalFormat() as $block) {
$subtitles1->add($block['start'], $block['end'], $block['lines']);
}
echo $subtitles1->content('srt'); // merged and sorted srt file
此解决方案的优点是解析 .srt 文件时存在边缘情况。不同的时间戳格式、额外的新行等等。所有这些都会自动处理。