为什么文本的特征提取不会返回所有可能的特征名称？

Question

以下是 Natural Language Processing with PyTorch一书中的代码片段：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns

corpus = ['Time flies flies like an arrow.', 'Fruit flies like a banana.']
one_hot_vectorizer = CountVectorizer()
vocab = one_hot_vectorizer.get_feature_names()

vocab的价值：

vocab = ['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']

为什么在提取的特征名称中没有'a'？如果它被自动排除为过于常见的单词，为什么“an”不会因同样的原因被排除？如何让.get_feature_names()过滤掉其他单词呢？

Answer 1

非常好的问题！虽然这不是pytorch问题，但sklearn问题=）

我鼓励首先通过这个https://www.kaggle.com/alvations/basic-nlp-with-nltk，尤其是。 “使用sklearn进行矢量化”部分

TL;DR

如果我们使用CountVectorizer，

from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer

sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."

with StringIO('\n'.join([sent1, sent2])) as fin:
    # Create the vectorizer
    count_vect = CountVectorizer()
    count_vect.fit_transform(fin)

# We can check the vocabulary in our vectorizer
# It's a dictionary where the words are the keys and 
# The values are the IDs given to each word. 
print(count_vect.vocabulary_)

[OUT]：

{'brown': 0,
 'dog': 1,
 'fox': 2,
 'jumps': 3,
 'lazy': 4,
 'mr': 5,
 'over': 6,
 'quick': 7,
 'the': 8}

我们没有告诉矢量化器去除标点符号和标记化和小写，他们是如何做到的？

而且，在词汇表中，它是一个禁用词，我们希望它消失了......并且跳跃不会被阻止或者被词典化！

如果我们在sklearn中查看CountVectorizer的文档，我们会看到：

CountVectorizer(
    input=’content’, encoding=’utf-8’, 
    decode_error=’strict’, strip_accents=None, 
    lowercase=True, preprocessor=None, 
    tokenizer=None, stop_words=None, 
    token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), 
    analyzer=’word’, max_df=1.0, min_df=1, 
    max_features=None, vocabulary=None, 
    binary=False, dtype=<class ‘numpy.int64’>)

更具体地说：

analyzer：string，{'word'，'char'，'char_wb'}或callable

该功能是否应由单词或字符n-gram组成。选项'char_wb'仅从字边界内的文本创建字符n-gram;单词边缘的n-gram用空格填充。如果传递了一个callable，它将用于从原始未处理的输入中提取特征序列。

预处理器：可调用或无（默认）

覆盖预处理（字符串转换）阶段，同时保留标记化和n-gram生成步骤。

tokenizer：callable或None（默认）

覆盖字符串标记化步骤，同时保留预处理和n-gram生成步骤。仅适用于analyzer =='word'。

stop_words：string {'english'}，list或None（默认）

如果是“英语”，则使用英语的内置停用词列表。如果列表，该列表被假定包含停用词，则所有这些将从生成的令牌中删除。仅适用于analyzer =='word'。如果为None，则不使用停用词。

小写：布尔值，默认为True

在标记化之前将所有字符转换为小写。

但就http://shop.oreilly.com/product/0636920063445.do的例子而言，这并不是导致问题的停顿词。

如果我们明确使用https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py中的英语停用词

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> one_hot_vectorizer = CountVectorizer(stop_words='english')

>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

>>> one_hot_vectorizer.get_feature_names()
['arrow', 'banana', 'flies', 'fruit', 'like', 'time']

那么在stop_words论证为无的情况下究竟发生了什么呢？

让我们尝试一下我在输入中添加一些单个字符的实验：

>>> corpus = ['Time flies flies like an arrow 1 2 3.', 'Fruit flies like a banana x y z.']

>>> one_hot_vectorizer = CountVectorizer()

>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()                                         
['an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time']

他们都又走了!!!

现在，如果我们深入研究文档，https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L738

token_pattern：string正则表达式，表示构成“标记”的内容，仅在analyzer == 'word'时使用。默认正则表达式选择2个或更多字母数字字符的标记（标点符号完全被忽略，并始终被视为标记分隔符）。

啊哈，这就是为什么所有单字符标记都被删除了！

CountVectorizer的默认模式是token_pattern=r"(?u)\b\w\w+\b"，为了使它能够采用单个字符，你可以尝试：

>>> one_hot_vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")           
>>> one_hot_vectorizer.fit(corpus)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
        vocabulary=None)
>>> one_hot_vectorizer.get_feature_names()
['1', '2', '3', 'a', 'an', 'arrow', 'banana', 'flies', 'fruit', 'like', 'time', 'x', 'y', 'z']

为什么文本的特征提取不会返回所有可能的特征名称？

问题描述投票：2回答：1

1个回答

TL;DR

最新问题

为什么文本的特征提取不会返回所有可能的特征名称？

问题描述 投票：2回答：1

1个回答

TL;DR

最新问题

问题描述投票：2回答：1