我有以下功能:
def filetxt():
word_freq = {}
lvl1 = []
lvl2 = []
total_t = 0
users = 0
text = []
for l in range(0,500):
# Open File
if os.path.exists("C:/Twitter/json/user_" + str(l) + ".json") == True:
with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
text_f = json.load(f)
users = users + 1
for i in range(len(text_f)):
text.append(text_f[str(i)]['text'])
total_t = total_t + 1
else:
pass
# Filter
occ = 0
import string
for i in range(len(text)):
s = text[i] # Sample string
a = re.findall(r'(RT)',s)
b = re.findall(r'(@)',s)
occ = len(a) + len(b) + occ
s = s.encode('utf-8')
out = s.translate(string.maketrans("",""), string.punctuation)
# Create Wordlist/Dictionary
word_list = text[i].lower().split(None)
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = word_freq.keys()
numbo = range(1,len(keys)+1)
WList = ', '.join(keys)
NList = str(numbo).strip('[]')
WList = WList.split(", ")
NList = NList.split(", ")
W2N = dict(zip(WList, NList))
for k in range (0,len(word_list)):
word_list[k] = W2N[word_list[k]]
for i in range (0,len(word_list)-1):
lvl1.append(word_list[i])
lvl2.append(word_list[i+1])
我使用分析器发现,最大的 CPU 时间似乎花在了
zip()
函数以及代码的 join
和 split
部分,我想看看是否有任何方法忽略了我可以清理代码以使其更加优化,因为最大的滞后似乎在于我如何使用字典和 zip()
函数。任何帮助将不胜感激,谢谢!
附注此函数的基本目的是加载包含 20 条左右推文的文件,因此我很可能最终会通过此函数发送大约 20k - 50k 个文件。输出是我生成推文中所有不同单词的列表,后跟哪些单词链接到什么,例如:
1 "love"
2 "pasa"
3 "mirar"
4 "ants"
5 "kers"
6 "morir"
7 "dreaming"
8 "tan"
9 "rapido"
10 "one"
11 "much"
12 "la"
...
10 1
13 12
1 7
12 2
7 3
2 4
3 11
4 8
11 6
8 9
6 5
9 20
5 8
20 25
8 18
25 9
18 17
9 2
...
我认为你想要这样的东西:
import string
from collections import defaultdict
rng = xrange if xrange else range
def filetxt():
users = 0
total_t = 0
occ = 0
wordcount = defaultdict(int)
wordpairs = defaultdict(lambda: defaultdict(int))
for filenum in rng(500):
try:
with open("C:/Twitter/json/user_" + str(filenum) + ".json",'r') as inf:
users += 1
tweets = json.load(inf)
total_t += len(tweets)
for txt in (r['text'] for r in tweets):
occ += txt.count('RT') + txt.count('@')
prev = None
for word in txt.encode('utf-8').translate(None, string.punctuation).lower().split():
wordcount[word] += 1
wordpairs[prev][word] += 1
prev = word
except IOError:
pass
我希望你不介意我擅自将你的代码修改为我更可能编写的内容。
from itertools import izip
def filetxt():
# keeps track of word count for each word.
word_freq = {}
# list of words which we've found
word_list = []
# mapping from word -> index in word_list
word_map = {}
lvl1 = []
lvl2 = []
total_t = 0
users = 0
text = []
####### You should replace this with a glob (see: glob module)
for l in range(0,500):
# Open File
try:
with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
text_f = json.load(f)
users = users + 1
# in this file there are multiple tweets so add the text
# for each one.
for t in text_f.itervalues():
text.append(t) ## CHECK THIS
except IOError:
pass
total_t = len(text)
# Filter
occ = 0
import string
for s in text:
a = re.findall(r'(RT)',s)
b = re.findall(r'(@)',s)
occ += len(a) + len(b)
s = s.encode('utf-8')
out = s.translate(string.maketrans("",""), string.punctuation)
# make a list of words that are in the text s
words = s.lower().split(None)
for word in word_list:
# try/except is quicker when we expect not to miss
# and it will be rare for us not to have
# a word in our list already.
try:
word_freq[word] += 1
except KeyError:
# we've never seen this word before so add it to our list
word_freq[word] = 1
word_map[word] = len(word_list)
word_list.append(word)
# little trick to get each word and the word that follows
for curword, nextword in zip(words, words[1:]):
lvl1.append(word_map[curword])
lvl2.append(word_map[nextword])
接下来要做的是给你以下信息。 lvl1 将为您提供与
word_list
中的单词相对应的数字列表。因此 word_list[lvl1[0]]
将是您处理的第一条推文中的第一个单词。 lvl2[0]
将是 lvl1[0]
之后的单词的索引,因此您可以说,world_list[lvl2[0]]
是 follows word_list[lvl1[0]]
的单词。该代码在构建此代码时基本上维护了 word_map
、word_list
和 word_freq
。
请注意,您之前执行此操作的方式,特别是您创建
W2N
的方式将无法正常工作。字典不维持秩序。有序字典将在 3.1 中推出,但现在先忘掉它吧。基本上,当你做word_freq.keys()
时,每次你添加一个新单词时它都会发生变化,所以没有一致性。看这个例子,
>>> x = dict()
>>> x[5] = 2
>>> x
{5: 2}
>>> x[1] = 24
>>> x
{1: 24, 5: 2}
>>> x[10] = 14
>>> x
{1: 24, 10: 14, 5: 2}
>>>
所以5曾经是第二个,但现在是第三个。
我还更新了它以使用 0 索引而不是 1 索引。我不知道你为什么使用
range(1, len(...)+1)
而不仅仅是 range(len(...))
。
无论如何,您应该避免考虑传统意义上的 C/C++/Java 中的
for
循环,即对数字进行循环。您应该考虑到,除非您需要索引号,否则您不需要它。
经验法则:如果您需要索引,您可能需要该索引处的元素,并且无论如何您都应该使用
enumerate
。 链接
希望这有帮助...
一些事情。这些线条放在一起对我来说很奇怪:
WList = ', '.join(keys)
<snip>
WList = WList.split(", ")
应该是
Wlist = list(keys)
。
您确定要对此进行优化吗?我的意思是,它真的这么慢值得你花时间吗?最后,如果能描述一下脚本应该做什么就好了,而不是让我们从代码中破译它:)