python：tokenize没有for循环的元组列表

Question

我有一个200万元组的列表，第一个元素是文本，第二个元素是整数。例如

list_of_tuples = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]

我想将每个元组中的第一个项目标记化，并将所有单词列表附加到展平列表中，以便获得所需的输出。

list_of_tokenized_tuples = [(['here', 'is', 'some', 'text'], 1), (['this', 'is', 'more', 'text'], 5), (['a', 'final', 'tuple'], 12)]
list_of_all_words = ['here', 'is', 'some', 'text', 'this', 'is', 'more', 'text', 'a', 'final', 'tuple']

到目前为止，我相信我已经找到了一种方法来实现这一点，但是由于列表的长度，它实际上是时间密集的。有没有什么方法可以标记元组中的第一项和/或以不涉及循环的方式展平所有单词的列表？

list_of_tokenized_tuples = []
list_of_all_words = []

for text, num in list_of_tuples:
    tokenized_text = list(word_tokenize(text))
    tokenized_tuples = (tokenized_text, num)
    list_of_all_words.append(tokenized_text)
    list_of_tokenized_tuples.append(tokenized_tuples)

list_of_all_words = [val for sublist in list_of_all_words for val in sublist]

Answer 1

使用itertools你可以把它写成：

from itertools import chain, imap

chain.from_iterable(imap(lambda (text,_): word_tokenize(text), list_of_tuples))

测试这个：

from itertools import chain, imap

def word_tokenize(text):
    return text.split() # insert your tokenizer here

ts = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]

print list( chain.from_iterable(imap(lambda (t,_): word_tokenize(t), ts)) )

产量

['here', 'is', 'some', 'text', 'this', 'is', 'more', 'text', 'a', 'final', 'tuple']

我不确定这会给你买什么，因为在itertools函数的实现中有for循环。

Answer 2

TL;DR

>>> from itertools import chain

>>> list_of_tuples = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]

# Split up your list(str) from the int
>>> texts, nums = zip(*list_of_tuples)

# Go into each string and split by whitespaces,
# Then flatten the list of list of str to list of str
>>> list_of_all_words = list(chain(*map(str.split, texts)))

>>> list_of_all_words
['here', 'is', 'some', 'text', 'this', 'is', 'more', 'text', 'a', 'final', 'tuple']

如果你需要使用word_tokenize，那么：

list_of_all_words = list(chain(*map(word_tokenize, texts)))

Answer 3

我为你写了这个发电机。如果你想创建一个列表，你可以做的其他事情很少（列表理解除外）。考虑到这一点，请看下面，它为您提供所需的输出，但作为两个单独的列表加入元组。我怀疑这太重要了，我相信你总能改变它以适应你的需要或偏好。

import timeit, random


list_of_tuples = [('here is some text', 1), ('this is more text', 5), ('a final tuple', 12)]
big_list = [random.choice(list_of_tuples) for x in range(1000)]


def gen(lot=big_list, m='tokenize'):
    list_all_words = []
    tokenised_words = []
    i1 = 0
    i2 = 0
    i3 = 0
    lol1 = len(lot)
    while i1 < lol1:
        # yield lot[i1]
        lol2 = len(lot[i1])
        while i2 < lol2:
            if type(lot[i1][i2]) == str:
                list_all_words.append((lot[i1][i2].split(), i1 + 1))
            i2 += 1
        i1 += 1
        i2 = 0
    # print(list_all_words)
    lol3 = len(list_all_words)
    while i3 < lol3:
        tokenised_words += list_all_words[i3][0]
        i3 += 1
    if m == 'list':
        yield list_all_words
    if m == 'tokenize':
        yield tokenised_words


for x in gen():
    print(x)


print(timeit.timeit(gen))
# Output of timeit: 0.2610903770813007
# This should be unnoticable on system resources I would have thought.

python：tokenize没有for循环的元组列表

问题描述投票：0回答：3

3个回答

TL;DR

最新问题

python：tokenize没有for循环的元组列表

问题描述 投票：0回答：3

3个回答

TL;DR

最新问题

问题描述投票：0回答：3