Python如何使用(i,j)项作为术语索引创建文档矩阵

问题描述 投票:0回答:1

我在处理文本数据矩阵时遇到以下问题。

我也有原始文本文档,存储在列表中。下面是文本数据列表的第一个元素的示例。

text_data[1]
u"\n The Bechtel Group Inc. offered in 1985 to sell oil to Israel at a 
discount of at least 650 million for 10 years if it promised not to 
bomb a proposed Iraqi pipeline, a Foreign Ministry official said 
Wednesday. But then-Prime Minister Shimon Peres said the offer from 
Bruce Rappaport, a partner in the San Francisco-based construction and 
engineering company, was ``unimportant,'' the senior official told The 
Associated Press. Peres, now foreign minister, never discussed the 
offer with other government ministers, said the official, who spoke on 
condition of anonymity.

我希望得到一个矩阵,其中x_ {ij}表示第i个文档中第j个定位词的术语索引。下面是一个示例:

 Words W = np.array([0, 1, 2, 3, 4]) # word indices for a dictionary of words

 # D := document words X = np.array([
   [0, 0, 1, 2, 2], # e.g., this row means 1st, and 2nd position is the first term in the dictionary, etc.
   [0, 0, 1, 1, 1],
   [0, 1, 2, 2, 2],
   [4, 4, 4, 4, 4],
   [3, 3, 4, 4, 4],
   [3, 4, 4, 4, 4]
   ])

我能想到的是首先为语料库中的术语创建一个字典,并为其指定相应的索引。然后遍历每个文档,遍历整个文档,然后为出现在文档i和位置j中的单词添加术语索引。但这似乎很漫长且效率低下。

python numpy text data-manipulation lda
1个回答
0
投票

几个月前,我遇到了类似的挑战。我很确定有一种使用Python NLTK的方法。谷歌搜索“语料库到术语计数向量”应该为您提供一个良好的开端。

但是,正如您在问题中所建议的,我最终只是实施了自己的。

def document_to_term_counts(document, vocab):
    term_count = [0] * len(vocab)
    for word in document:
        if word in vocab:
            term_count[vocab.index(word)] += 1
    return term_count

def count_words_in_documents(documents):
    word_counts = {}
    for document in documents:
        words_found_in_document = set()
        for word in document:
            if word not in word_counts:
                word_counts[word] = {'all_appearances': 1, 'document_appearances': 1}
            else:
                word_counts[word]['all_appearances'] += 1
                if word not in words_found_in_document:
                    word_counts[word]['document_appearances'] += 1
            words_found_in_document.add(word)
    return word_counts

def word_counts_to_vocab(word_counts, min_document_apperances, max_document_apperances):
    vocab = []
    for word in word_counts:
        document_apperances = word_counts[word]['document_appearances']
        if document_apperances >= min_document_apperances and document_apperances <= max_document_apperances:
            vocab.append(word)
    return vocab

def documents_to_vocab(documents, min_document_apperances, max_document_apperances):
    word_counts = count_words_in_documents(documents)
    vocab = word_counts_to_vocab(word_counts, min_document_apperances, max_document_apperances)

    return vocab

documents = [
    ['the', 'quick', 'brown', 'fox', 'jumped'],
    ['foxes', 'are', 'quick']
]

vocab = documents_to_vocab(documents, 1, 100)
print('vocabulary:')
print(vocab)

for document in documents:
    term_counts = document_to_term_counts(document, vocab)
    print('-'*50)
    print(document)
    print(term_counts)

my full project

© www.soinside.com 2019 - 2024. All rights reserved.