Python:在给定字符限制的情况下,将长 TTS 输入文本拆分为字符串块

问题描述 投票:0回答:4

Google 文本转语音 (TTS) 有 5000 个字符的限制,而我的文本约为 50k 个字符。我需要根据给定的限制对字符串进行分块,而不切断单词。

“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

如何将上面的字符串分成不超过 20 个字符的字符串列表而不切断单词?

我查看了

NLTK
库分块部分,但没有看到任何内容。

python string list
4个回答
7
投票

这与 Green Cloak Guy 的想法类似,但使用生成器而不是创建列表。对于大文本来说,这应该对内存更友好,并且允许您惰性地迭代块。您可以使用

list()
将其转换为列表,或者在需要迭代器的任何地方使用 is:

s = "Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news."

def get_chunks(s, maxlength):
    start = 0
    end = 0
    while start + maxlength  < len(s) and end != -1:
        end = s.rfind(" ", start, start + maxlength + 1)
        yield s[start:end]
        start = end +1
    yield s[start:]

chunks = get_chunks(s, 25)

#Make list with line lengths:
[(n, len(n)) for n in chunks]

结果

[('Well, Prince, so Genoa', 22),
 ('and Lucca are now just', 22),
 ('family estates of the', 21),
 ('Buonapartes. But I warn', 23),
 ('you, if you don’t tell me', 25),
 ('that this means war, if', 23),
 ('you still try to defend', 23),
 ('the infamies and horrors', 24),
 ('perpetrated by that', 19),
 ('Antichrist—I really', 19),
 ('believe he is', 13),
 ('Antichrist—I will have', 22),
 ('nothing more to do with', 23),
 ('you and you are no longer', 25),
 ('my friend, no longer my', 23),
 ('‘faithful slave,’ as you', 24),
 ('call yourself! But how do', 25),
 ('you do? I see I have', 20),
 ('frightened you—sit down', 23),
 ('and tell me all the news.', 25)]

5
投票

基于 Python 的方法会向前查找 20 个字符,找到最后一位可能的空白,然后在那里剪掉该行。这不是一个非常优雅的实现,但它应该可以完成工作:

orig_string = “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”
list_of_lines = []
max_length = 20
while len(orig_string) > max_length:
    line_length = orig_string[:max_length].rfind(' ')
    list_of_lines.append(orig_string[:line_length])
    orig_string = orig_string[line_length + 1:]
list_of_lines.append(orig_string)

0
投票

基于马克的回答,在处理搜索结束时,代码中似乎有一个小错误,类似这样的事情可能会起作用:

    def text_to_chunks(s, maxlength):
        start = 0
        end   = 0
        while start + maxlength  < len(s) and end != -1:
            end = s.rfind(" ", start, start + maxlength + 1)
            if end == -1: break
            yield s[start:end]
            start = end +1
        yield s[start:]

-2
投票

您可以使用以下

nltk.tokenize
方法:

import nltk

corpus = '''
Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.” 
'''

tokens = nltk.tokenize.word_tokenize(corpus)

sent_tokens = nltk.tokenize.sent_tokenize(corpus)
© www.soinside.com 2019 - 2024. All rights reserved.