我需要编写一个函数,根据预先指定的字符长度,将长文档拆分为空白字符(\s
)上的较短文档。
例如,举例来说,我有一个文本文档,其中包含175,000,000个字符(包括所有标点和空白字符)。我想将文档拆分为每个约100,000个字符的较短文档。
当然,发生分割的位置将不完全位于第100,000 / 200,00 / 300,000 ...个字符,因为空白字符可能不完全位于这些位置。如果空白字符未在所需的分割点处(例如,如果第100,000个字符不是空白字符),则该函数将查找最靠近left的白字符并将其分割。以下是我对该功能的尝试,但看来该功能非常慢。
whitespace_regex = re.compile(r"\s")
def foo(text):
# If a document is 100000 character long or shorter
# no splitting is needed
if len(text) <= 100000:
yield text
# Splitting if a document is longer than 100000 characters
elif len(text) > 100000:
# A while loop until there is nothing left to be split
while len(text) > 100000:
# Split a document into two segments:
# left: 100000 character long
# text: the rest of the document
left, text = text[:100000], text[100000:]
# Look for the rightmost whitespace character in the 'left'
# segment by first reversing the string so that the whitespace
# returned by the regex search is the rightmost whitespace
whitespace = whitespace_regex.search(left[::-1])
# Get the start index of the returned whitespace. If -index
# is 0, then that means pro
index = whitespace.start()
index = -index
# if the whitespace is not exactly at the desired position,
# yield the part to the left of the whitespace character, and
# combine the part of the left segment to the right of the
# whitespace character with the rest of the remaining text
if index < 0:
text = left[index:] + text
left = left[:index]
yield left
if text:
yield text
我测试了具有175,000,000个字符的文档的速度,并且花了将近6分钟的时间来完成文档的拆分:
a = "John did what others told him to do" * 5000000
print(f"Document's length is {len(a)}")
#Document's length is 175000000
start_time = time.time()
segs = [x for x in foo(a)]
print(time.time() - start_time)
#344.3530957698822
我想知道是否有一种方法可以编写更有效的函数来执行此操作。
问题随着文档的大小严重扩展。例如,在0.3
秒内在我的机器上运行了10倍小的文档,而在144.8
中则是您的尺寸。我认为问题在于,每当您从左侧切下一块时,其余的文本都会移动。因此,一种解决方案可能是从背面开始切纸。另一种解决方案是将数组切成较小的块,然后对它们运行函数:
a = "John did what others told him to do" * 5000000
print(f"Document's length is {len(a)}")
#Document's length is 1750000000
start_time = time()
segs = []
block_size = 17500000
start_ind = 0
num_blocks = int(len(a)/block_size) + 1
for i in range(num_blocks):
if len(a)-start_ind > block_size:
block = a[start_ind : start_ind + block_size]
else:
block = a[start_ind :]
block_segs = [x for x in foo(block)]
start_ind += block_size - len(block_segs[-1])
segs += block_segs[:-1]
if len(a)-start_ind > block_size: segs += block_segs[-1]
print(time() - start_time)
这大约需要3.0
秒。我想又快又脏...