使用Python计算N Grams

问题描述 投票:20回答:8

我需要为包含以下文本的文本文件计算Unigrams,BiGrams和Trigrams:

“囊性纤维化仅影响美国3万名儿童和青少年。吸入盐水雾可减少充满囊性纤维化患者呼吸道的脓液和感染,但副作用包括令人讨厌的咳嗽和严酷的味道。这就是结论在本周出版的“新英格兰医学杂志”上发表的两项研究。“

我从Python开始并使用以下代码:

#!/usr/bin/env python
# File: n-gram.py
def N_Gram(N,text):
NList = []                      # start with an empty list
if N> 1:
    space = " " * (N-1)         # add N - 1 spaces
    text = space + text + space # add both in front and back
# append the slices [i:i+N] to NList
for i in range( len(text) - (N - 1) ):
    NList.append(text[i:i+N])
return NList                    # return the list
# test code
for i in range(5):
print N_Gram(i+1,"text")
# more test code
nList = N_Gram(7,"Here is a lot of text to print")
for ngram in iter(nList):
print '"' + ngram + '"'

http://www.daniweb.com/software-development/python/threads/39109/generating-n-grams-from-a-word

但它适用于一个单词中的所有n-gram,当我想要在CYSTIC和FIBROSIS或CYSTIC FIBROSIS之间的单词之间。有人可以帮我解决这个问题吗?

python nlp nltk n-gram
8个回答
28
投票

假设输入是一个字符串,包含空格分隔的单词,如x = "a b c d",您可以使用以下函数(编辑:请参阅最后一个函数以获得更完整的解决方案):

def ngrams(input, n):
  input = input.split(' ')
  output = []
  for i in range(len(input)-n+1):
    output.append(input[i:i+n])
  return output

ngrams('a b c d', 2) # [['a', 'b'], ['b', 'c'], ['c', 'd']]

如果你想将那些连接回字符串,你可以称之为:

[' '.join(x) for x in ngrams('a b c d', 2)] # ['a b', 'b c', 'c d']

最后,这并没有将事情总结为总数,所以如果您输入的是'a a a a',则需要将它们计算为dict:

for g in (' '.join(x) for x in ngrams(input, 2)):
   grams.setdefault(g, 0)
   grams[g] += 1

将所有这些组合成一个最终函数给出:

def ngrams(input, n):
  input = input.split(' ')
  output = {}
  for i in range(len(input)-n+1):
    g = ' '.join(input[i:i+n])
    output.setdefault(g, 0)
    output[g] += 1
  return output

ngrams('a a a a', 2) # {'a a': 3}

32
投票

这个blog的一个简短的Pythonesque解决方案:

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

用法:

>>> input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
>>> find_ngrams(input_list, 1)
[('all',), ('this',), ('happened',), ('more',), ('or',), ('less',)]
>>> find_ngrams(input_list, 2)
[('all', 'this'), ('this', 'happened'), ('happened', 'more'), ('more', 'or'), ('or', 'less')]
>>> find_ngrams(input_list, 3))
[('all', 'this', 'happened'), ('this', 'happened', 'more'), ('happened', 'more', 'or'), ('more', 'or', 'less')]

25
投票

使用NLTK(自然语言工具包)并使用函数将文本标记(拆分)到列表中,然后查找bigrams和trigrams。

import nltk
words = nltk.word_tokenize(my_text)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)

8
投票

python中还有一个名为Scikit的有趣模块。这是代码。这将有助于您获得在特定范围内给出的所有克数。这是代码

from sklearn.feature_extraction.text import CountVectorizer 
text = "this is a foo bar sentences and i want to ngramize it"
vectorizer = CountVectorizer(ngram_range=(1,6))
analyzer = vectorizer.build_analyzer()
print analyzer(text)

输出是

[u'this', u'is', u'foo', u'bar', u'sentences', u'and', u'want', u'to', u'ngramize', u'it', u'this is', u'is foo', u'foo bar', u'bar sentences', u'sentences and', u'and want', u'want to', u'to ngramize', u'ngramize it', u'this is foo', u'is foo bar', u'foo bar sentences', u'bar sentences and', u'sentences and want', u'and want to', u'want to ngramize', u'to ngramize it', u'this is foo bar', u'is foo bar sentences', u'foo bar sentences and', u'bar sentences and want', u'sentences and want to', u'and want to ngramize', u'want to ngramize it', u'this is foo bar sentences', u'is foo bar sentences and', u'foo bar sentences and want', u'bar sentences and want to', u'sentences and want to ngramize', u'and want to ngramize it', u'this is foo bar sentences and', u'is foo bar sentences and want', u'foo bar sentences and want to', u'bar sentences and want to ngramize', u'sentences and want to ngramize it']

这里给出了1到6范围内给出的所有克数。它使用了名为countVectorizer的方法。这是link


3
投票

使用collections.deque

from collections import deque
from itertools import islice

def ngrams(message, n=1):
    it = iter(message.split())
    window = deque(islice(it, n), maxlen=n)
    yield tuple(window)
    for item in it:
        window.append(item)
        yield tuple(window)

...或者你可以在一行中作为列表理解来做:

n = 2
message = "Hello, how are you?".split()
myNgrams = [message[i:i+n] for i in range(len(message) - n)]

1
投票

nltk对ngrams有本机支持

'n'是ngram大小ex:n = 3是三元组

from nltk import ngrams

def ngramize(texts, n):
    output=[]
    for text in texts:
        output += ngrams(text,n)
    return output

1
投票

虽然帖子很旧,但我想在这里提一下我的答案,以便大多数ngram创建逻辑可以在一个帖子中。

在Python中有一些名称为TextBlob的东西。它很容易创建类似于NLTK的ngrams。

下面是带有输出的代码片段,以便于理解。

sent = """This is to show the usage of Text Blob in Python"""
blob = TextBlob(sent)
unigrams = blob.ngrams(n=1)
bigrams = blob.ngrams(n=2)
trigrams = blob.ngrams(n=3)

输出是:

unigrams
[WordList(['This']),
 WordList(['is']),
 WordList(['to']),
 WordList(['show']),
 WordList(['the']),
 WordList(['usage']),
 WordList(['of']),
 WordList(['Text']),
 WordList(['Blob']),
 WordList(['in']),
 WordList(['Python'])]

bigrams
[WordList(['This', 'is']),
 WordList(['is', 'to']),
 WordList(['to', 'show']),
 WordList(['show', 'the']),
 WordList(['the', 'usage']),
 WordList(['usage', 'of']),
 WordList(['of', 'Text']),
 WordList(['Text', 'Blob']),
 WordList(['Blob', 'in']),
 WordList(['in', 'Python'])]

trigrams
[WordList(['This', 'is', 'to']),
 WordList(['is', 'to', 'show']),
 WordList(['to', 'show', 'the']),
 WordList(['show', 'the', 'usage']),
 WordList(['the', 'usage', 'of']),
 WordList(['usage', 'of', 'Text']),
 WordList(['of', 'Text', 'Blob']),
 WordList(['Text', 'Blob', 'in']),
 WordList(['Blob', 'in', 'Python'])]

就如此容易。

TextBlob正在完成更多工作。有关更多详细信息,请查看此文档 - https://textblob.readthedocs.io/en/dev/


1
投票

如果效率是一个问题,你必须建立多个不同的n-gram我会考虑使用以下代码(建立在Franck的优秀答案):

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

用法:

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

〜与NLTK速度相同:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
© www.soinside.com 2019 - 2024. All rights reserved.