[count()对n-gram的python优化

问题描述 投票:0回答:1

我正在尝试使用count()函数对字符串列表中的项目进行计数,并将结果从最大到最小排序。尽管该函数在较小的列表上执行得很好,但根本无法很好地扩展,如下面的小型实验所示,只有5个周期将输入长度加倍(第6个周期的等待时间太长)。有没有一种方法可以优化第一列表的理解能力,或者是可以更好地扩展count()的替代方法?

import nltk
from operator import itemgetter
import time

t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."

unigrams = nltk.word_tokenize(t.lower())

for size in range(1, 6):

    unigrams = unigrams*size

    start = time.time()

    unigram_freqs = [unigrams.count(word) for word in unigrams]    
    freq_pairs = set((zip(unigrams, unigram_freqs)))
    freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]

    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")

# Runtime: 0.001s for 1x the size
# Runtime: 0.003s for 2x the size
# Runtime: 0.022s for 3x the size
# Runtime: 0.33s for 4x the size 
# Runtime: 8.065s for 5x the size
python sorting optimization counting n-gram
1个回答
0
投票

使用集合中的Counter并通过成员函数“ most_common()进行排序,无论时间长短,我都会得到0秒:

import nltk
nltk.download('punkt')


from operator import itemgetter
from collections import Counter
import time
t = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Curabitur pretium tincidunt lacus. Nulla gravida orci a odio. Nullam varius, turpis et commodo pharetra, est eros bibendum elit, nec luctus magna felis sollicitudin mauris. Integer in mauris eu nibh euismod gravida. Duis ac tellus et risus vulputate vehicula. Donec lobortis risus a elit. Etiam tempor. Ut ullamcorper, ligula eu tempor congue, eros est euismod turpis, id tincidunt sapien risus a quam. Maecenas fermentum consequat mi. Donec fermentum. Pellentesque malesuada nulla a mi. Duis sapien sem, aliquet nec, commodo eget, consequat quis, neque. Aliquam faucibus, elit ut dictum aliquet, felis nisl adipiscing sapien, sed malesuada diam lacus eget erat. Cras mollis scelerisque nunc. Nullam arcu. Aliquam consequat. Curabitur augue lorem, dapibus quis, laoreet et, pretium ac, nisi. Aenean magna nisl, mollis quis, molestie eu, feugiat in, orci. In hac habitasse platea dictumst."

unigrams = nltk.word_tokenize(t.lower())

for size in range(1, 5):

    unigrams = unigrams*size

    start = time.time()

    unigram_freqs = [unigrams.count(word) for word in unigrams]    
    freq_pairs = set((zip(unigrams, unigram_freqs)))
    freq_pairs = sorted(freq_pairs, key=itemgetter(1))[::-1]

    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Slow Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")

    start = time.time()
    a = Counter(unigrams).most_common()
    #print(a)
    end = time.time()

    time_elapsed = round(end-start, 3)

    print("Fast Runtime: " + str(time_elapsed) + "s for " + str(size) + "x the size")

慢速运行时间:0.003s,相当于1倍的尺寸

快速运行时间:0.0倍于1倍的大小

慢速运行时间:0.006s,两倍于大小

快速运行时间:0.0倍于2倍的大小

慢速运行时间:0.157秒,是3倍的大小

快速运行时间:0.0倍于3倍的大小

慢速运行时间:1.891s,是4倍的大小

快速运行时间:0.001倍于4倍的大小

© www.soinside.com 2019 - 2024. All rights reserved.