PHP - 根据条件合并两个 TXT 文件

问题描述 投票:0回答:2

(提前抱歉这个长问题 - 问题实际上很简单 - 但解释起来可能并不那么简单)

我的 PHP 新手技能受到了挑战:

输入 2 个 TXT 文件,其结构如下:

$rowidentifier //number,letter,string etc..
$some semi-fixed-string $somedelimiter $semi-fixed-string
$content //with unknown length or strings or lines number.

阅读上面的内容,我在“半固定字符串”中的含义意味着它是一个具有已知结构但未知内容的字符串..

举一个实际的例子,让我们看一个SRT文件(我只是将它用作豚鼠,因为结构与我需要的非常相似):

1
00:00:12,759 --> 00:00:17,458
"some content here "
that continues here

2
00:00:18,298 --> 00:00:20,926
here we go again...

3
00:00:21,368 --> 00:00:24,565
...and this can go forever...

4
.
.
.

我想要做的,是从一个文件中取出 $content 部分,并将其放在第二个文件的正确位置。

回到示例 SRT ,有:

//file1 

    1
    00:00:12,759 --> 00:00:17,458
    "this is the italian content "
    which continues in italian here

    2
    00:00:18,298 --> 00:00:20,926
    here we go talking italian again ...

//file2 

    1
    00:00:12,756 --> 00:00:17,433
    "this is the spanish, chinese, or any content "
    which continues in spanish, or chinese here

    2
    00:00:16,293 --> 00:00:20,96
    here we go talking spanish, chinese or german again ...

将导致

//file3 

        1
        00:00:12,756 --> 00:00:17,433
        "this is the italian content "
        which continues in italian here
        "this is the spanish, chinese, or any content "
        which continues in spanish, or chinese here

        2
        00:00:16,293 --> 00:00:20,96
        here we go talking italian again ...
        here we go talking spanish, chinese or german again ...

或更多类似 php 的内容:

$rowidentifier //unchanged
$some semi-fixed-string $somedelimiter $semi-fixed-string //unchanged, except maybe an option to choose if to keep file1 or file2 ...
$content //from file 1
$content //from file 2

所以,在所有这些介绍之后 - 这就是我所拥有的(实际上什么都没有..)

$first_file = file('file1.txt'); // no need to comment right ?
$second_file = file('file2.txt'); // see above comment
$result_array = array(); /construct array
foreach($first_file as $key=>$value) //loop array and.... 
$result_array[]= trim($value).'/r'.trim($second_file[$key]); //..here is my problem ...

// $Value is $content - but LINE BY LINE , and in our case, it could be 2-3- or even 4 lines
// should i go by delimiters /n/r ??  (not a good idea - how can i know they are there ?? )
// or should i go for regex to lookup for string patterns ? that is insane , no ?

$fp = fopen('merge.txt', 'w+'); fwrite($fp, join("\r\n", $result_array); fclose($fp);

这将逐行进行 - 这不是我需要的。我需要条件.. 另外 - 我确信这不是一个智能代码,或者有很多更好的方法可以实现 - 所以任何帮助将不胜感激......

php arrays text-files
2个回答
3
投票

您真正想要做的是并行迭代两个文件,然后组合属于彼此的部分。

但是您不能使用行号,因为它们可能会有所不同。所以你需要使用条目(块)的编号。因此,您需要给它一个“数字”或更精确的值,才能从文件中一个接一个地取出条目。

因此,您需要一个针对相关数据的迭代器,能够将某些行转换为块。

所以代替:

foreach($first_file as $number => $line)

foreach($first_file_blocks as $number => $block)

这可以通过编写自己的迭代器来完成,该迭代器将文件的行作为输入,然后将行动态转换为块。为此,您需要解析数据,这是一个基于状态的解析器的小示例,可以将行转换为块:

$state = 0;
$blocks = array();
foreach($lines as $line)
{
    switch($state)
    {
        case 0:
            unset($block);
            $block = array();
            $blocks[] = &$block;
            $block['number'] = $line;
            $state = 1;
            break;
        case 1:
            $block['range'] = $line;
            $state = 2;
            break;
        case 2:
            $block['text'] = '';
            $state = 3;
            # fall-through intended
        case 3:
            if ($line === '') {
                $state = 0;
                break;
            }
            $block['text'] .= ($block['text'] ? "\n" : '') . $line;
            break;
        default:
            throw new Exception(sprintf('Unhandled %d.', $state));
    }
}
unset($block);

它只是沿着线运行并改变它的状态。基于该状态,每一行都作为其块的一部分进行处理。如果一个新块开始,它将被创建。它适用于您在问题中概述的 SRT 文件,demo

为了使其使用更加灵活,请将其转换为一个迭代器,该迭代器在其构造函数中接受

$lines
并在迭代时提供块。这需要一些解析器如何让行工作的方式,但它的工作原理通常是相同的。

class SRTBlocks implements Iterator
{
    private $lines;
    private $current;
    private $key;
    public function __construct($lines)
    {
        if (is_array($lines))
        {
            $lines = new ArrayIterator($lines);
        }
        $this->lines = $lines;
    }
    public function rewind()
    {
        $this->lines->rewind();
        $this->current = NULL;
        $this->key = 0;
    }
    public function valid()
    {
        return $this->lines->valid();
    }
    public function current()
    {
        if (NULL !== $this->current)
        {
            return $this->current;
        }
        $state = 0;
        $block = NULL;
        while ($this->lines->valid() && $line = $this->lines->current())
        {
            switch($state)
            {
                case 0:
                    $block = array();
                    $block['number'] = $line;
                    $state = 1;
                    break;
                case 1:
                    $block['range'] = $line;
                    $state = 2;
                    break;
                case 2:
                    $block['text'] = '';
                    $state = 3;
                    # fall-through intended
                case 3:
                    if ($line === '') {
                        $state = 0;
                        break 2;
                    }
                    $block['text'] .= ($block['text'] ? "\n" : '') . $line;
                    break;
                default:
                    throw new Exception(sprintf('Unhandled %d.', $state));
            }
            $this->lines->next();
        }
        if (NULL === $block)
        {
            throw new Exception('Parser invalid (empty).');
        }
        $this->current = $block;
        $this->key++;
        return $block;
    }
    public function key()
    {
        return $this->key;
    }
    public function next()
    {
        $this->lines->next();
        $this->current = NULL;
    }
}

基本用法如下,输出可以在Demo中看到:

$blocks = new SRTBlocks($lines); 
foreach($blocks as $index => $block)
{
    printf("Block #%d:\n", $index);
    print_r($block);
}

现在可以迭代 SRT 文件中的所有块。现在剩下的唯一事情就是并行迭代两个 SRT 文件。从 PHP 5.3 开始,SPL 附带了

MultipleIterator
来执行此操作。现在非常简单,例如我使用相同的行两次:

$multi = new MultipleIterator();
$multi->attachIterator(new SRTBlocks($lines));
$multi->attachIterator(new SRTBlocks($lines));

foreach($multi as $blockPair)
{
    list($block1, $block2) = $blockPair;
    echo $block1['number'], "\n", $block1['range'], "\n", 
        $block1['text'], "\n", $block2['text'], "\n\n";
}

将字符串(而不是输出)存储到文件中是相当简单的,所以我将其排除在答案之外。

那么要注意什么呢?首先,可以在循环和某种状态中轻松解析顺序数据(例如文件中的行)。这不仅适用于文件中的行,也适用于字符串。

其次,为什么我在这里建议使用迭代器?首先它很容易使用。从处理一个文件到并行处理两个文件只是一小步。除此之外,迭代器实际上也可以对另一个迭代器进行操作。例如

SPLFileObject
类。它提供了一个遍历文件中所有行的迭代器。如果您有大文件,则可以仅使用
SPLFileObject
(而不是数组),并且无需先将两个文件加载到数组中,然后对
SRTBlocks
进行少量添加,从末尾删除尾随 EOL 字符每行:

$line = rtrim($line, "\n\r");

它确实有效:

$multi = new MultipleIterator();
$multi->attachIterator(new SRTBlocks(new SplFileObject($file1)));
$multi->attachIterator(new SRTBlocks(new SplFileObject($file2)));

foreach($multi as $blockPair)
{
    list($block1, $block2) = $blockPair;
    echo $block1['number'], "\n", $block1['range'], "\n", 
        $block1['text'], "\n", $block2['text'], "\n\n";
}

完成后,您甚至可以使用(几乎)相同的代码处理非常大的文件。灵活,不是吗? 完整演示.


0
投票

更短的解决方案:

$subtitles1 = \Done\Subtitles\Subtitles::loadFromString('file1.srt');
$subtitles2 = \Done\Subtitles\Subtitles::loadFromString('file2.srt');

foreach ($subtitles2->getInternalFormat() as $block) {
    $subtitles1->add($block['start'], $block['end'], $block['lines']);
}
    
echo $subtitles1->content('srt'); // merged and sorted srt file

此解决方案的优点是解析 .srt 文件时存在边缘情况。不同的时间戳格式、额外的新行等等。所有这些都会自动处理。

用于答案的库:https://github.com/mantas-done/subtitles

© www.soinside.com 2019 - 2024. All rights reserved.