字典和zip()函数的优化

问题描述 投票:0回答:3

我有以下功能:

def filetxt():
    word_freq = {}
    lvl1      = []
    lvl2      = []
    total_t   = 0
    users     = 0
    text      = []

    for l in range(0,500):
        # Open File
        if os.path.exists("C:/Twitter/json/user_" + str(l) + ".json") == True:
            with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
                text_f = json.load(f)
                users = users + 1
                for i in range(len(text_f)):
                    text.append(text_f[str(i)]['text'])
                    total_t = total_t + 1
        else:
            pass

    # Filter
    occ = 0
    import string
    for i in range(len(text)):
        s = text[i] # Sample string
        a = re.findall(r'(RT)',s)
        b = re.findall(r'(@)',s)
        occ = len(a) + len(b) + occ
        s = s.encode('utf-8')
        out = s.translate(string.maketrans("",""), string.punctuation)


        # Create Wordlist/Dictionary
        word_list = text[i].lower().split(None)

        for word in word_list:
            word_freq[word] = word_freq.get(word, 0) + 1

        keys = word_freq.keys()

        numbo = range(1,len(keys)+1)
        WList = ', '.join(keys)
        NList = str(numbo).strip('[]')
        WList = WList.split(", ")
        NList = NList.split(", ")
        W2N = dict(zip(WList, NList))

        for k in range (0,len(word_list)):
            word_list[k] = W2N[word_list[k]]
        for i in range (0,len(word_list)-1):
            lvl1.append(word_list[i])
            lvl2.append(word_list[i+1])

我使用分析器发现,最大的 CPU 时间似乎花在了

zip()
函数以及代码的
join
split
部分,我想看看是否有任何方法忽略了我可以清理代码以使其更加优化,因为最大的滞后似乎在于我如何使用字典和
zip()
函数。任何帮助将不胜感激,谢谢!

附注此函数的基本目的是加载包含 20 条左右推文的文件,因此我很可能最终会通过此函数发送大约 20k - 50k 个文件。输出是我生成推文中所有不同单词的列表,后跟哪些单词链接到什么,例如:

1 "love"
2 "pasa"
3 "mirar"
4 "ants"
5 "kers"
6 "morir"
7 "dreaming"
8 "tan"
9 "rapido"
10 "one"
11 "much"
12 "la"
...
10 1
13 12
1 7
12 2
7 3
2 4
3 11
4 8
11 6
8 9
6 5
9 20
5 8
20 25
8 18
25 9
18 17
9 2
...
python optimization dictionary profiling zip
3个回答
2
投票

认为你想要这样的东西:

import string
from collections import defaultdict
rng = xrange if xrange else range

def filetxt():
    users     = 0
    total_t   = 0
    occ       = 0

    wordcount = defaultdict(int)
    wordpairs = defaultdict(lambda: defaultdict(int))
    for filenum in rng(500):
        try:
            with open("C:/Twitter/json/user_" + str(filenum) + ".json",'r') as inf:
                users += 1
                tweets = json.load(inf)
                total_t += len(tweets)

                for txt in (r['text'] for r in tweets):
                    occ += txt.count('RT') + txt.count('@')
                    prev = None
                    for word in txt.encode('utf-8').translate(None, string.punctuation).lower().split():
                        wordcount[word] += 1
                        wordpairs[prev][word] += 1
                        prev = word
        except IOError:
            pass

1
投票

我希望你不介意我擅自将你的代码修改为我更可能编写的内容。

from itertools import izip
def filetxt():
    # keeps track of word count for each word.
    word_freq = {}
    # list of words which we've found
    word_list = []
    # mapping from word -> index in word_list
    word_map  = {}
    lvl1      = []
    lvl2      = []
    total_t   = 0
    users     = 0
    text      = []

    ####### You should replace this with a glob (see: glob module)
    for l in range(0,500):
        # Open File
        try:
            with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
                text_f = json.load(f)
                users = users + 1
                # in this file there are multiple tweets so add the text
                # for each one.
                for t in text_f.itervalues():
                    text.append(t)  ## CHECK THIS
        except IOError:
            pass

    total_t = len(text)
    # Filter
    occ = 0
    import string
    for s in text:
        a = re.findall(r'(RT)',s)
        b = re.findall(r'(@)',s)
        occ += len(a) + len(b)
        s = s.encode('utf-8')
        out = s.translate(string.maketrans("",""), string.punctuation)


        # make a list of words that are in the text s
        words = s.lower().split(None)

        for word in word_list:
            # try/except is quicker when we expect not to miss
            # and it will be rare for us not to have
            # a word in our list already.
            try:
                word_freq[word] += 1
            except KeyError:
                # we've never seen this word before so add it to our list
                word_freq[word] = 1
                word_map[word] = len(word_list)
                word_list.append(word)


        # little trick to get each word and the word that follows
        for curword, nextword in zip(words, words[1:]):
            lvl1.append(word_map[curword])
            lvl2.append(word_map[nextword])

接下来要做的是给你以下信息。 lvl1 将为您提供与

word_list
中的单词相对应的数字列表。因此
word_list[lvl1[0]]
将是您处理的第一条推文中的第一个单词。
lvl2[0]
将是
lvl1[0]
之后的单词的索引,因此您可以说,
world_list[lvl2[0]]
follows word_list[lvl1[0]]
的单词。该代码在构建此代码时基本上维护了
word_map
word_list
word_freq

请注意,您之前执行此操作的方式,特别是您创建

W2N
的方式将无法正常工作。字典不维持秩序。有序字典将在 3.1 中推出,但现在先忘掉它吧。基本上,当你做
word_freq.keys()
时,每次你添加一个新单词时它都会发生变化,所以没有一致性。看这个例子,

>>> x = dict()
>>> x[5] = 2
>>> x
{5: 2}
>>> x[1] = 24
>>> x
{1: 24, 5: 2}
>>> x[10] = 14
>>> x
{1: 24, 10: 14, 5: 2}
>>>

所以5曾经是第二个,但现在是第三个。

我还更新了它以使用 0 索引而不是 1 索引。我不知道你为什么使用

range(1, len(...)+1)
而不仅仅是
range(len(...))

无论如何,您应该避免考虑传统意义上的 C/C++/Java 中的

for
循环,即对数字进行循环。您应该考虑到,除非您需要索引号,否则您不需要它。

经验法则:如果您需要索引,您可能需要该索引处的元素,并且无论如何您都应该使用

enumerate
链接

希望这有帮助...


0
投票

一些事情。这些线条放在一起对我来说很奇怪:

WList = ', '.join(keys)
<snip>
WList = WList.split(", ")

应该是

Wlist = list(keys)

您确定要对此进行优化吗?我的意思是,它真的这么慢值得你花时间吗?最后,如果能描述一下脚本应该做什么就好了,而不是让我们从代码中破译它:)

© www.soinside.com 2019 - 2024. All rights reserved.