用真实段落修复 subtitle.srt 文件的算法

问题描述 投票:0回答:1

我有一个 subtitle.srt 文件,但其内容并不准确。与此同时,我还有一组准确但时间不同步的段落。

造成不准确的原因有多种,包括,

  • 大小写不匹配,
  • 额外的单词或字符,
  • 缺少单词或字符,
  • 缺少标点符号
  • 等等

通过哪种方法我可以用真实文本修复 srt 文件?任何算法建议都会很好地独立于编码语言。

我真的很感谢您提供的任何帮助。

示例:

字幕.srt

1
00:00:00,000 --> 00:00:04,320
Heat wave is expect to continue for the next a few

2
00:00:04,320 --> 00:00:07,920
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating

3
00:00:07,920 --> 00:00:13,760
change, the need to take action to reduce greenhouse gas emission.

真实文本:

The heat wave is expected to continue for the next few days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate change, and the need to take action to reduce greenhouse gas emissions.

这是预期的:subtitle_ Corrected.srt

1
00:00:00,000 --> 00:00:04,320
The heat wave is expected to continue for the next few

2
00:00:04,320 --> 00:00:07,920
days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate

3
00:00:07,920 --> 00:00:13,760
change, and the need to take action to reduce greenhouse gas emissions.
string-comparison similarity subtitle error-correction
1个回答
0
投票

这个任务称为对齐,它是生物学(比较两个DNA序列)和自然语言处理(例如当前具有两个并行字幕源的示例)等领域的常见任务。

该任务已经研究了很多年(可以追溯到 20 世纪 70 年代),并且已经开发了许多算法。这些算法已在所有主要编程语言中实现。

例如,Python 库

text_alignment_tool
,它实现了动态规划算法 Smith-Waterman 和 Needleman-Wunsch。下面的代码展示了如何在字幕上使用 Smith-Waterman 算法(在库中称为
LocalAlignmentAlgorithm
)。该算法使用正确的文本作为查询和不准确的文本作为目标产生以下类型的对齐:

syntax: position, character in query > position, character in target
119 T > 114 t
120 h > 115 h
121 e > 116 e
122   > 117  
123 h > 118 h
124 e > 119 e
125 a > 120 a
126 t > 121 t
127   > 122  
128 w > 123 w
129 a > 124 a
130 v > 125 v
131 e > 126 e
[...]
162 o > 157 o
163 f > 158 f
164   > 159  
165 c > 160 c
166 l > 161 l
167 i > 162 i
168 m > 163 m
169 a > 164 a
170 t > 165 t
171 e > 168 g

大部分代码都是簿记,用于生成不准确字幕的纯文本版本,同时跟踪字符位置和时间戳,然后重建字幕格式。

# Import the tool and necessary classes
from text_alignment_tool import (
    TextAlignmentTool,
    StringTextLoader,
    LocalAlignmentAlgorithm,
)

subtitle_srt = """1
00:00:00,000 --> 00:00:04,320
Heat wave is expect to continue for the next a few

2
00:00:04,320 --> 00:00:07,920
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating

3
00:00:07,920 --> 00:00:13,760
change, the need to take action to reduce greenhouse gas emission."""

correct_text = """The heat wave is expected to continue for the next few days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate change, and the need to take action to reduce greenhouse gas emissions."""

# list with information about each subtitle fragment
fragments_info = []
# keep track of character positions for each sentence in the original subtitles 
current_pos = 0
# collect a list of just the text without the number and timestamp
all_lines = list()

# split on two newlines to get each block of nr+timestamp_sentence
for fragment in subtitle_srt.split("\n\n"):
    # split each block into number, timestamp and sentence
    (fragment_nr, timestamp, fragment_txt) = fragment.splitlines()
    # add the sentence to the list of sentences
    all_lines.append(fragment_txt)
    # keep track of new position: old position plus length of current sentence
    newpos = current_pos + len(fragment_txt)
    # add number, timestamp and position to the list with information about fragments
    fragments_info.append({"number": fragment_nr, "timestamp": timestamp, "end_position": newpos})
    # update position variable to use in next iteration
    current_pos = newpos + 1

# create a multi-line string with only the sentences to use for alignment
target_text = "\n".join(all_lines)

print(target_text)
print("---------------------------")
print(fragments_info)
print("---------------------------")

# load the two text strings for use in the alignment library
query_1 = StringTextLoader(correct_text)
target_1 = StringTextLoader(target_text)
# initialize the alignment for the two texts
aligner_1 = TextAlignmentTool(query_1, target_1)
# select an alignment algorithm
local_alignment_algorithm = LocalAlignmentAlgorithm()
# perform the actual alignment
aligner_1.align_text(local_alignment_algorithm)

# extract character-level alignment positions
alm = aligner_1.collect_all_alignments()
alm_idxs = alm[0][0]

# reconstruct the subtitles using the alignment

# keep track of the fragment number and the position in the correct text
fragment_nr = 0
start_pos = 0
# loop over each aligned character pair
for x in alm_idxs.query_to_target_mapping.alignments:
    # if the position in the original subtitle (=target) is the end of a fragment
    # then write a subtitle line using the position in the correct text (=query) 
    if x.target_idx >= fragments_info[fragment_nr]["end_position"]-1:
        print(fragments_info[fragment_nr]["number"])
        print(fragments_info[fragment_nr]["timestamp"])
        print(correct_text[start_pos:x.query_idx+1])
        # update the start position and fragment number for the next fragment
        start_pos = x.query_idx + 2
        fragment_nr += 1

代码的输出,显示不准确字幕的纯文本版本、包含每个片段信息的列表以及重建的字幕:

Heat wave is expect to continue for the next a few
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating
change, the need to take action to reduce greenhouse gas emission.
---------------------------
[{'number': '1', 'timestamp': '00:00:00,000 --> 00:00:04,320', 'end_position': 50}, {'number': '2', 'timestamp': '00:00:04,320 --> 00:00:07,920', 'end_position': 169}, {'number': '3', 'timestamp': '00:00:07,920 --> 00:00:13,760', 'end_position': 236}]
---------------------------
1
00:00:00,000 --> 00:00:04,320
The heat wave is expected to continue for the next few

2
00:00:04,320 --> 00:00:07,920
days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate

3
00:00:07,920 --> 00:00:13,760
change, and the need to take action to reduce greenhouse gas emissions.

此代码使用 Python 编写,并且

text_alignment_tool
库有一些特殊性,并且没有很好的文档记录(免责声明:我不以任何方式隶属于该库)。该代码可作为概念证明,但它可能不是所有情况下的最佳解决方案。

但是,如上所述,这些算法在许多不同编程语言的库中广泛使用,因此使用正确的搜索词(对齐、Needleman-Wunsch),您应该能够编写适合您需求的类似内容。

© www.soinside.com 2019 - 2024. All rights reserved.