如何使用 NLTK 分词器去除标点符号？

Question

我刚刚开始使用 NLTK，不太明白如何从文本中获取单词列表。如果我使用

nltk.word_tokenize()

，我会得到单词和标点符号的列表。我只需要文字。我怎样才能摆脱标点符号？此外

word_tokenize

不适用于多个句子：点会添加到最后一个单词。

Answer 1

查看 nltk 提供的其他标记化选项这里。例如，您可以定义一个标记生成器，它挑选出字母数字字符序列作为标记并删除其他所有内容：

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

输出：

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

Answer 2

您并不真正需要 NLTK 来删除标点符号。你可以用简单的 python 来删除它。对于字符串：

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

或者对于 unicode：

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

然后在你的分词器中使用这个字符串。

P.S. 字符串模块还有一些其他可以删除的元素集（如数字）。

Answer 3

下面的代码将删除所有标点符号以及非字母字符。从他们的书中复制的。

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

输出

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

Answer 4

正如评论中所注意到的，以sent_tokenize()开头，因为word_tokenize()仅适用于单个句子。您可以使用filter()过滤掉标点符号。如果您有一个 unicode 字符串，请确保它是一个 unicode 对象（而不是使用“utf-8”等编码进行编码的“str”）。

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

Answer 5

我刚刚使用了以下代码，删除了所有标点符号：

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

Answer 6

诚心问一句，什么是词？如果您的假设是一个单词仅由字母字符组成，那么您就错了，因为像

can't

这样的单词将被破坏成碎片（例如

can

和

）如果您在标记化之前删除标点符号，即很可能会对您的计划产生负面影响。

因此，解决方案是标记然后删除标点符号。

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

...然后，如果您愿意，您可以将某些标记（例如

'm

）替换为

am

。

Answer 7

我认为你需要某种正则表达式匹配（以下代码是Python 3中的）：

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

输出：

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

在大多数情况下应该工作得很好，因为它删除了标点符号，同时保留了像“n't”这样的标记，这些标记无法从正则表达式标记生成器（例如

wordpunct_tokenize

）中获得。

Answer 8

无需 nltk (python 3.x)，您可以在一行中完成此操作。

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

Answer 9

我使用此代码来删除标点符号：

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

如果你想检查一个 token 是否是一个有效的英文单词，你可能需要 PyEnchant

教程：

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

Answer 10

删除标点符号（它将删除 . 以及使用下面的代码处理部分标点符号）

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string

输入/输出示例：

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

Answer 11

只需添加@rmalouf 的解决方案，这将不包含任何数字，因为 \w+ 相当于 [a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Answer 12

因为

from string import punctuation

仅提供包含特殊字符的字符串变量

punctuation

...

!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

...它可以进行定制，例如删除单引号以保留撇号，例如单词

it's

可以分配您自己的。我将

punctuation

更改为

punctuations

，并添加了 's'，它可以插入到其他一些答案中。

punctuations = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'  # \' removed
text = " It'll be ok-ish!?? " 
text = ' '.join(filter(None, (word.strip(punctuation) for word in text.split())))

... 其中

text

变为：

"It'll be ok-ish"

Answer 13

只需过滤掉恰好出现在

string.punctuation

:

中的任何结果单词

from nltk import word_tokenize
import string

text = "Hello, Tom; here is your umbrella - do you like it?"
words = [w for w in word_tokenize(text) if w not in string.punctuation]
words

输出：

['Hello', 'Tom', 'here', 'is', 'your', 'umbrella', 'do', 'you', 'like', 'it']

如何使用 NLTK 分词器去除标点符号？

问题描述投票：0回答：13

13个回答

最新问题

如何使用 NLTK 分词器去除标点符号？

问题描述 投票：0回答：13

13个回答

最新问题

问题描述投票：0回答：13