使用sklearn为python中的可变n-gram计算TF-IDF

Question

问题：使用scikit-learn查找特定词汇表的可变n-gram的命中数。

说明。我从here中获得了示例。

想象我有一个语料库，我想找出多少命中（计数）的词汇如下：

myvocabulary = [(window=4, words=['tin', 'tan']),
                (window=3, words=['electrical', 'car'])
                (window=3, words=['elephant','banana'])

我在这里所说的窗口是单词可以出现的单词范围的长度。如下：

'锡棕'被击中（4个字以内）

'锡狗棕褐色'被击中（4个字以内）

''锡狗猫棕褐色被击中（4个字以内）

'锡汽车日蚀棕褐色未命中。锡和棕褐色相距四个字以上。

我只想计算在文本中出现的次数（window = 4，words = ['tin'，'tan']），其他所有字符都相同，然后将结果添加到熊猫中，以便计算tf-idf算法。我只能找到这样的东西：

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())

其中词汇表是一个简单的字符串列表，可以是单个单词或几个单词。

从scikitlearn以外：

class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)

要提取的不同n-gram的n值范围的上下边界。所有的n值都将使用min_n <= n <= max_n。

也不起作用。

有什么想法吗？谢谢。

Answer 1

我不确定是否可以使用CountVectorizer或TfidfVectorizer完成此操作。我已经编写了自己的函数来执行此操作，如下所示：

import pandas as pd
import numpy as np
import string 

def contained_within_window(token, word1, word2, threshold):
  word1 = word1.lower()
  word2 = word2.lower()
  token = token.translate(str.maketrans('', '', string.punctuation)).lower()
  if (word1 in token) and word2 in (token):
      word_list = token.split(" ")
      word1_index = [i for i, x in enumerate(word_list) if x == word1]
      word2_index = [i for i, x in enumerate(word_list) if x == word2]
      count = 0
      for i in word1_index:
        for j in word2_index:
          if np.abs(i-j) <= threshold:
            count=count+1
      return count
  return 0

样本：

corpus = [
    'This is the first document. And this is what I want',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
    'I like coding in sklearn',
    'This is a very good question'
]

df = pd.DataFrame(corpus, columns=["Test"])

您的df将如下所示：

    Test
0   This is the first document. And this is what I...
1   This document is the second document.
2   And this is the third one.
3   Is this the first document?
4   I like coding in sklearn
5   This is a very good question

现在您可以如下应用contained_within_window：

sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))

您得到：

您可以只运行for循环来检查不同的实例。然后，您就可以构造大熊猫df并在其上应用TfIdf，这很简单。

希望这会有所帮助！

使用sklearn为python中的可变n-gram计算TF-IDF

问题描述投票：1回答：1

1个回答

最新问题

使用sklearn为python中的可变n-gram计算TF-IDF

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1