根据空格字符和指定的字符长度将长文档有效地分割为多个短文档

Question

我需要编写一个函数，根据预先指定的字符长度，将长文档拆分为空白字符（\s）上的较短文档。

例如，举例来说，我有一个文本文档，其中包含175,000,000个字符（包括所有标点和空白字符）。我想将文档拆分为每个约100,000个字符的较短文档。

当然，发生分割的位置将不完全位于第100,000 / 200,00 / 300,000 ...个字符，因为空白字符可能不完全位于这些位置。如果空白字符未在所需的分割点处（例如，如果第100,000个字符不是空白字符），则该函数将查找最靠近left的白字符并将其分割。以下是我对该功能的尝试，但看来该功能非常慢。

whitespace_regex = re.compile(r"\s")

def foo(text):
    # If a document is 100000 character long or shorter
    # no splitting is needed
    if len(text) <= 100000:
        yield text
    # Splitting if a document is longer than 100000 characters
    elif len(text) > 100000:
        # A while loop until there is nothing left to be split
        while len(text) > 100000:
            # Split a document into two segments: 
            #     left: 100000 character long
            #     text: the rest of the document
            left, text = text[:100000], text[100000:]

            # Look for the rightmost whitespace character in the 'left'
            # segment by first reversing the string so that the whitespace
            # returned by the regex search is the rightmost whitespace
            whitespace = whitespace_regex.search(left[::-1])

            # Get the start index of the returned whitespace. If -index
            # is 0, then that means pro
            index = whitespace.start()
            index = -index
            # if the whitespace is not exactly at the desired position,
            # yield the part to the left of the whitespace character, and
            # combine the part of the left segment to the right of the
            # whitespace character with the rest of the remaining text
            if index < 0:
                text = left[index:] + text
                left = left[:index]
            yield left
        if text:
            yield text

我测试了具有175,000,000个字符的文档的速度，并且花了将近6分钟的时间来完成文档的拆分：

a = "John did what others told him to do" * 5000000
print(f"Document's length is {len(a)}")
#Document's length is 175000000

start_time = time.time()
segs = [x for x in foo(a)]
print(time.time() - start_time)
#344.3530957698822

我想知道是否有一种方法可以编写更有效的函数来执行此操作。

Answer 1

问题随着文档的大小严重扩展。例如，在0.3秒内在我的机器上运行了10倍小的文档，而在144.8中则是您的尺寸。我认为问题在于，每当您从左侧切下一块时，其余的文本都会移动。因此，一种解决方案可能是从背面开始切纸。另一种解决方案是将数组切成较小的块，然后对它们运行函数：

a = "John did what others told him to do" * 5000000
print(f"Document's length is {len(a)}")
#Document's length is 1750000000

start_time = time()
segs = []
block_size = 17500000
start_ind = 0
num_blocks = int(len(a)/block_size) + 1
for i in range(num_blocks):
  if len(a)-start_ind > block_size:
      block = a[start_ind : start_ind + block_size]
  else:
      block = a[start_ind :]

  block_segs = [x for x in foo(block)]

  start_ind += block_size - len(block_segs[-1])

  segs += block_segs[:-1]
  if len(a)-start_ind > block_size: segs += block_segs[-1]


print(time() - start_time)

这大约需要3.0秒。我想又快又脏...

根据空格字符和指定的字符长度将长文档有效地分割为多个短文档

问题描述投票：0回答：1

1个回答

最新问题

根据空格字符和指定的字符长度将长文档有效地分割为多个短文档

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1