根据空格字符和指定的字符长度将长文档有效地分割为多个短文档

问题描述 投票:0回答:1

我需要编写一个函数,根据预先指定的字符长度,将长文档拆分为空白字符(\s)上的较短文档。

例如,举例来说,我有一个文本文档,其中包含175,000,000个字符(包括所有标点和空白字符)。我想将文档拆分为每个约100,000个字符的较短文档。

当然,发生分割的位置将不完全位于第100,000 / 200,00 / 300,000 ...个字符,因为空白字符可能不完全位于这些位置。如果空白字符未在所需的分割点处(例如,如果第100,000个字符不是空白字符),则该函数将查找最靠近left的白字符并将其分割。以下是我对该功能的尝试,但看来该功能非常慢。

whitespace_regex = re.compile(r"\s")

def foo(text):
    # If a document is 100000 character long or shorter
    # no splitting is needed
    if len(text) <= 100000:
        yield text
    # Splitting if a document is longer than 100000 characters
    elif len(text) > 100000:
        # A while loop until there is nothing left to be split
        while len(text) > 100000:
            # Split a document into two segments: 
            #     left: 100000 character long
            #     text: the rest of the document
            left, text = text[:100000], text[100000:]

            # Look for the rightmost whitespace character in the 'left'
            # segment by first reversing the string so that the whitespace
            # returned by the regex search is the rightmost whitespace
            whitespace = whitespace_regex.search(left[::-1])

            # Get the start index of the returned whitespace. If -index
            # is 0, then that means pro
            index = whitespace.start()
            index = -index
            # if the whitespace is not exactly at the desired position,
            # yield the part to the left of the whitespace character, and
            # combine the part of the left segment to the right of the
            # whitespace character with the rest of the remaining text
            if index < 0:
                text = left[index:] + text
                left = left[:index]
            yield left
        if text:
            yield text        

我测试了具有175,000,000个字符的文档的速度,并且花了将近6分钟的时间来完成文档的拆分:

a = "John did what others told him to do" * 5000000
print(f"Document's length is {len(a)}")
#Document's length is 175000000

start_time = time.time()
segs = [x for x in foo(a)]
print(time.time() - start_time)
#344.3530957698822

我想知道是否有一种方法可以编写更有效的函数来执行此操作。

python performance
1个回答
0
投票

问题随着文档的大小严重扩展。例如,在0.3秒内在我的机器上运行了10倍小的文档,而在144.8中则是您的尺寸。我认为问题在于,每当您从左侧切下一块时,其余的文本都会移动。因此,一种解决方案可能是从背面开始切纸。另一种解决方案是将数组切成较小的块,然后对它们运行函数:

a = "John did what others told him to do" * 5000000
print(f"Document's length is {len(a)}")
#Document's length is 1750000000

start_time = time()
segs = []
block_size = 17500000
start_ind = 0
num_blocks = int(len(a)/block_size) + 1
for i in range(num_blocks):
  if len(a)-start_ind > block_size:
      block = a[start_ind : start_ind + block_size]
  else:
      block = a[start_ind :]

  block_segs = [x for x in foo(block)]

  start_ind += block_size - len(block_segs[-1])

  segs += block_segs[:-1]
  if len(a)-start_ind > block_size: segs += block_segs[-1]


print(time() - start_time)

这大约需要3.0秒。我想又快又脏...

© www.soinside.com 2019 - 2024. All rights reserved.