我有一个200万元组的列表,第一个元素是文本,第二个元素是整数。例如
list_of_tuples = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]
我想将每个元组中的第一个项目标记化,并将所有单词列表附加到展平列表中,以便获得所需的输出。
list_of_tokenized_tuples = [(['here', 'is', 'some', 'text'], 1), (['this', 'is', 'more', 'text'], 5), (['a', 'final', 'tuple'], 12)]
list_of_all_words = ['here', 'is', 'some', 'text', 'this', 'is', 'more', 'text', 'a', 'final', 'tuple']
到目前为止,我相信我已经找到了一种方法来实现这一点,但是由于列表的长度,它实际上是时间密集的。有没有什么方法可以标记元组中的第一项和/或以不涉及循环的方式展平所有单词的列表?
list_of_tokenized_tuples = []
list_of_all_words = []
for text, num in list_of_tuples:
tokenized_text = list(word_tokenize(text))
tokenized_tuples = (tokenized_text, num)
list_of_all_words.append(tokenized_text)
list_of_tokenized_tuples.append(tokenized_tuples)
list_of_all_words = [val for sublist in list_of_all_words for val in sublist]
使用itertools
你可以把它写成:
from itertools import chain, imap
chain.from_iterable(imap(lambda (text,_): word_tokenize(text), list_of_tuples))
测试这个:
from itertools import chain, imap
def word_tokenize(text):
return text.split() # insert your tokenizer here
ts = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]
print list( chain.from_iterable(imap(lambda (t,_): word_tokenize(t), ts)) )
产量
['here', 'is', 'some', 'text', 'this', 'is', 'more', 'text', 'a', 'final', 'tuple']
我不确定这会给你买什么,因为在itertools函数的实现中有for循环。
>>> from itertools import chain
>>> list_of_tuples = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]
# Split up your list(str) from the int
>>> texts, nums = zip(*list_of_tuples)
# Go into each string and split by whitespaces,
# Then flatten the list of list of str to list of str
>>> list_of_all_words = list(chain(*map(str.split, texts)))
>>> list_of_all_words
['here', 'is', 'some', 'text', 'this', 'is', 'more', 'text', 'a', 'final', 'tuple']
如果你需要使用word_tokenize
,那么:
list_of_all_words = list(chain(*map(word_tokenize, texts)))
我为你写了这个发电机。如果你想创建一个列表,你可以做的其他事情很少(列表理解除外)。考虑到这一点,请看下面,它为您提供所需的输出,但作为两个单独的列表加入元组。我怀疑这太重要了,我相信你总能改变它以适应你的需要或偏好。
import timeit, random
list_of_tuples = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]
big_list = [random.choice(list_of_tuples) for x in range(1000)]
def gen(lot=big_list, m='tokenize'):
list_all_words = []
tokenised_words = []
i1 = 0
i2 = 0
i3 = 0
lol1 = len(lot)
while i1 < lol1:
# yield lot[i1]
lol2 = len(lot[i1])
while i2 < lol2:
if type(lot[i1][i2]) == str:
list_all_words.append((lot[i1][i2].split(), i1 + 1))
i2 += 1
i1 += 1
i2 = 0
# print(list_all_words)
lol3 = len(list_all_words)
while i3 < lol3:
tokenised_words += list_all_words[i3][0]
i3 += 1
if m == 'list':
yield list_all_words
if m == 'tokenize':
yield tokenised_words
for x in gen():
print(x)
print(timeit.timeit(gen))
# Output of timeit: 0.2610903770813007
# This should be unnoticable on system resources I would have thought.