如何标记单词并将其输入到另一个文件中?

问题描述 投票:0回答:2

我只能获取停用词以在文档中实现,然后创建一个新文件,并删除停用词。我无法获得单词标记化,搬运工或发送标记化处理的信息。

 import io
 from nltk.corpus import stopwords
 from nltk.tokenize import word_tokenize
 from nltk.stem import PorterStemmer
 from nltk.tokenize import sent_tokenize, word_tokenize
 ps = PorterStemmer()
 stop_words = set(stopwords.words('english'))
 file1 = open("data/hw1datasets/100554newsML.txt")

这是我无法执行到新txt文件中的部分。

 text = fileObj.read()
 stokens = nltk.sent_tokenize(text)
 wtokens = nltk.word_tokenize(text)

此部分创建新文件

 line = file1.read()
 words = line.split()
 for r in words:
     if not r in stop_words:
        appendFile = open('h1doc1.txt','a')
        appendFile.write(" "+r)
python nltk tokenize
2个回答
1
投票

不完全确定您的问题。我认为您的代码很接近,但是也许某些文件的输入/输出是您的问题。请勿循环使用.open(),因为它将反复打开文件。只需将其打开一次,并确保最后将文件.close()

import io
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))
file1 = open(r"./100554newsML.txt")

text = file1.read()
stokens = nltk.sent_tokenize(text)
wtokens = nltk.word_tokenize(text)
words = text.split()
appendFile = open(r'h1doc1.txt','w+')
for r in wtokens:
    if r not in stop_words:
        appendFile.write(" "+r)
appendFile.close()

printstokens上使用wtokens可以正常工作。

打印输出(令牌)

['Channel tunnel operator Eurotunnel on Monday announced details of a deal giving bank creditors 45.5 percent of the company in return for wiping out 1.0 billion pounds ($1.6 billion) of its massive debts.', 'The long-awaited but highly complex restructuring of nearly nearly nine billion pounds of debt and unpaid interest throws the company a lifeline which could secure what is still likely to be a difficult future.', 'The deal, announced simultaneously in Paris and London, brings the company back from the brink of bankruptcy but leaves current shareholders, who have already seen their investment dwindle, owning only 54.5 percent of the company.', '"We have fixed and capped the interest payments and arranged only to pay what is available in cash," Eurotunnel co-chairman Alastair Morton told reporters at a news conference.', '"Avoiding having to do this again is the name of the game."', 'Morton said the plan provides the Anglo-French company with the medium term financial stability to consolidate its commercial position and develop its operations, adding that the firm was now making a profit before interest.', "Although shareholders will see their holdings diluted, they were offered the prospect of a brighter future and urged to be patient after months of uncertainty while Eurotunnel wrestled to reduce the crippling interest payments negotiated during the tunnel's construction.", 'Eurotunnel, which has taken around half of the market in the busiest cross-Channel route from the European ferry companies, said a strong operating performance could allow it to pay its first dividend within the next 10 years.', 'French co-chairman Patrick Ponsolle told reporters at a Paris news conference that the dividend could come as early as 2004 if the company performed "very well".', 'Eurotunnel and the banks have come up with an ingenious formula to help the company get over the early years of the deal when, despite the swaps of debt for equity and bonds, it will still not be able to afford the annual interest bill of 400 million pounds.', 'If its revenue, after costs and depreciation, is less than 400 million pounds, then the company will issue "Stabilisation notes" to a maximum of 1.85 billion pounds to the banks.', 'Eurotunnel would not pay interest on these notes (which would constitute a debt issue) for ten years.', "Analysts said that under the deal, Eurotunnel's ability to finance its debt would become sustainable, at least for a few years.", '"If you look at the current cash flow of between 150 and 200 million pounds a year, what they can\'t find (to meet the bill) they will roll forward into the stabilisation notes, and they can keep that going for seven, eight, nine years," said an analyst at one major investment bank.', '"So they are here for that time," he added.', 'The company said in a statement there was still considerable work to be done to finalise and agree the details of the plan before it can be submitted to shareholders and the bank group for approval, probably early in the Spring of 1997.', 'Eurotunnel said the debt-for-equity swap would be at 130 pence, or 10.40 francs, per share -- considerably below the level of 160 pence widely reported in the run up to the deal\nThe company said a further 3.7 billion pounds of debt would be converted into new financial instruments and existing shareholders would be able to participate in this issue.', "If they choose not to take up free warrants entitling them to subscribe to this, Eurotunnel said shareholders' interests may be reduced further to just over 39 percent of the company by the end of December 2003.", "Eurotunnel's shares, which were suspended last week at 113.5 pence ahead of Monday's announcement, will resume trading on Tuesday.", 'Shareholders and all 225 creditor banks have to agree the deal.', '"I\'m hopeful but I\'m not taking it (approval) for granted," Morton admitted, "Shareholders are pretty angry in France."', 'Asked what would happen if the banks reject the deal, Morton said, "Nobody wants a collapse, nobody wants a doomsday scenario."', '($1=.6393 Pound)']

打印输出(wtokens)

['ï', '»', '¿Channel', 'tunnel', 'operator', 'Eurotunnel', 'on', 'Monday', 'announced', 'details', 'of', 'a', 'deal', 'giving', 'bank', 'creditors', '45.5', 'percent', 'of', 'the', 'company', 'in', 'return', 'for', 'wiping', 'out', '1.0', 'billion', 'pounds', '(', '$', '1.6', 'billion', ')', 'of', 'its', 'massive', 'debts', '.', 'The', 'long-awaited', 'but', 'highly', 'complex', 'restructuring', 'of', 'nearly', 'nearly', 'nine', 'billion', 'pounds', 'of', 'debt', 'and', 'unpaid', 'interest', 'throws', 'the', 'company', 'a', 'lifeline', 'which', 'could', 'secure', 'what', 'is', 'still', 'likely', 'to', 'be', 'a', 'difficult', 'future', '.', 'The', 'deal', ',', 'announced', 'simultaneously', 'in', 'Paris', 'and', 'London', ',', 'brings', 'the', 'company', 'back', 'from', 'the', 'brink', 'of', 'bankruptcy', 'but', 'leaves', 'current', 'shareholders', ',', 'who', 'have', 'already', 'seen', 'their', 'investment', 'dwindle', ',', 'owning', 'only', '54.5', 'percent', 'of', 'the', 'company', '.', '``', 'We', 'have', 'fixed', 'and', 'capped', 'the', 'interest', 'payments', 'and', 'arranged', 'only', 'to', 'pay', 'what', 'is', 'available', 'in', 'cash', ',', "''", 'Eurotunnel', 'co-chairman', 'Alastair', 'Morton', 'told', 'reporters', 'at', 'a', 'news', 'conference', '.', '``', 'Avoiding', 'having', 'to', 'do', 'this', 'again', 'is', 'the', 'name', 'of', 'the', 'game', '.', "''", 'Morton', 'said', 'the', 'plan', 'provides', 'the', 'Anglo-French', 'company', 'with', 'the', 'medium', 'term', 'financial', 'stability', 'to', 'consolidate', 'its', 'commercial', 'position', 'and', 'develop', 'its', 'operations', ',', 'adding', 'that', 'the', 'firm', 'was', 'now', 'making', 'a', 'profit', 'before', 'interest', '.', 'Although', 'shareholders', 'will', 'see', 'their', 'holdings', 'diluted', ',', 'they', 'were', 'offered', 'the', 'prospect', 'of', 'a', 'brighter', 'future', 'and', 'urged', 'to', 'be', 'patient', 'after', 'months', 'of', 'uncertainty', 'while', 'Eurotunnel', 'wrestled', 'to', 'reduce', 'the', 'crippling', 'interest', 'payments', 'negotiated', 'during', 'the', 'tunnel', "'s", 'construction', '.', 'Eurotunnel', ',', 'which', 'has', 'taken', 'around', 'half', 'of', 'the', 'market', 'in', 'the', 'busiest', 'cross-Channel', 'route', 'from', 'the', 'European', 'ferry', 'companies', ',', 'said', 'a', 'strong', 'operating', 'performance', 'could', 'allow', 'it', 'to', 'pay', 'its', 'first', 'dividend', 'within', 'the', 'next', '10', 'years', '.', 'French', 'co-chairman', 'Patrick', 'Ponsolle', 'told', 'reporters', 'at', 'a', 'Paris', 'news', 'conference', 'that', 'the', 'dividend', 'could', 'come', 'as', 'early', 'as', '2004', 'if', 'the', 'company', 'performed', '``', 'very', 'well', "''", '.', 'Eurotunnel', 'and', 'the', 'banks', 'have', 'come', 'up', 'with', 'an', 'ingenious', 'formula', 'to', 'help', 'the', 'company', 'get', 'over', 'the', 'early', 'years', 'of', 'the', 'deal', 'when', ',', 'despite', 'the', 'swaps', 'of', 'debt', 'for', 'equity', 'and', 'bonds', ',', 'it', 'will', 'still', 'not', 'be', 'able', 'to', 'afford', 'the', 'annual', 'interest', 'bill', 'of', '400', 'million', 'pounds', '.', 'If', 'its', 'revenue', ',', 'after', 'costs', 'and', 'depreciation', ',', 'is', 'less', 'than', '400', 'million', 'pounds', ',', 'then', 'the', 'company', 'will', 'issue', '``', 'Stabilisation', 'notes', "''", 'to', 'a', 'maximum', 'of', '1.85', 'billion', 'pounds', 'to', 'the', 'banks', '.', 'Eurotunnel', 'would', 'not', 'pay', 'interest', 'on', 'these', 'notes', '(', 'which', 'would', 'constitute', 'a', 'debt', 'issue', ')', 'for', 'ten', 'years', '.', 'Analysts', 'said', 'that', 'under', 'the', 'deal', ',', 'Eurotunnel', "'s", 'ability', 'to', 'finance', 'its', 'debt', 'would', 'become', 'sustainable', ',', 'at', 'least', 'for', 'a', 'few', 'years', '.', '``', 'If', 'you', 'look', 'at', 'the', 'current', 'cash', 'flow', 'of', 'between', '150', 'and', '200', 'million', 'pounds', 'a', 'year', ',', 'what', 'they', 'ca', "n't", 'find', '(', 'to', 'meet', 'the', 'bill', ')', 'they', 'will', 'roll', 'forward', 'into', 'the', 'stabilisation', 'notes', ',', 'and', 'they', 'can', 'keep', 'that', 'going', 'for', 'seven', ',', 'eight', ',', 'nine', 'years', ',', "''", 'said', 'an', 'analyst', 'at', 'one', 'major', 'investment', 'bank', '.', '``', 'So', 'they', 'are', 'here', 'for', 'that', 'time', ',', "''", 'he', 'added', '.', 'The', 'company', 'said', 'in', 'a', 'statement', 'there', 'was', 'still', 'considerable', 'work', 'to', 'be', 'done', 'to', 'finalise', 'and', 'agree', 'the', 'details', 'of', 'the', 'plan', 'before', 'it', 'can', 'be', 'submitted', 'to', 'shareholders', 'and', 'the', 'bank', 'group', 'for', 'approval', ',', 'probably', 'early', 'in', 'the', 'Spring', 'of', '1997', '.', 'Eurotunnel', 'said', 'the', 'debt-for-equity', 'swap', 'would', 'be', 'at', '130', 'pence', ',', 'or', '10.40', 'francs', ',', 'per', 'share', '--', 'considerably', 'below', 'the', 'level', 'of', '160', 'pence', 'widely', 'reported', 'in', 'the', 'run', 'up', 'to', 'the', 'deal', 'The', 'company', 'said', 'a', 'further', '3.7', 'billion', 'pounds', 'of', 'debt', 'would', 'be', 'converted', 'into', 'new', 'financial', 'instruments', 'and', 'existing', 'shareholders', 'would', 'be', 'able', 'to', 'participate', 'in', 'this', 'issue', '.', 'If', 'they', 'choose', 'not', 'to', 'take', 'up', 'free', 'warrants', 'entitling', 'them', 'to', 'subscribe', 'to', 'this', ',', 'Eurotunnel', 'said', 'shareholders', "'", 'interests', 'may', 'be', 'reduced', 'further', 'to', 'just', 'over', '39', 'percent', 'of', 'the', 'company', 'by', 'the', 'end', 'of', 'December', '2003', '.', 'Eurotunnel', "'s", 'shares', ',', 'which', 'were', 'suspended', 'last', 'week', 'at', '113.5', 'pence', 'ahead', 'of', 'Monday', "'s", 'announcement', ',', 'will', 'resume', 'trading', 'on', 'Tuesday', '.', 'Shareholders', 'and', 'all', '225', 'creditor', 'banks', 'have', 'to', 'agree', 'the', 'deal', '.', '``', 'I', "'m", 'hopeful', 'but', 'I', "'m", 'not', 'taking', 'it', '(', 'approval', ')', 'for', 'granted', ',', "''", 'Morton', 'admitted', ',', '``', 'Shareholders', 'are', 'pretty', 'angry', 'in', 'France', '.', "''", 'Asked', 'what', 'would', 'happen', 'if', 'the', 'banks', 'reject', 'the', 'deal', ',', 'Morton', 'said', ',', '``', 'Nobody', 'wants', 'a', 'collapse', ',', 'nobody', 'wants', 'a', 'doomsday', 'scenario', '.', "''", '(', '$', '1=.6393', 'Pound', ')']

h1doc1.txt的输出

 ï » ¿Channel tunnel operator Eurotunnel Monday announced details deal giving bank creditors 45.5 percent company return wiping 1.0 billion pounds ( $ 1.6 billion ) massive debts . The long-awaited highly complex restructuring nearly nearly nine billion pounds debt unpaid interest throws company lifeline could secure still likely difficult future . The deal , announced simultaneously Paris London , brings company back brink bankruptcy leaves current shareholders , already seen investment dwindle , owning 54.5 percent company . `` We fixed capped interest payments arranged pay available cash , '' Eurotunnel co-chairman Alastair Morton told reporters news conference . `` Avoiding name game . '' Morton said plan provides Anglo-French company medium term financial stability consolidate commercial position develop operations , adding firm making profit interest . Although shareholders see holdings diluted , offered prospect brighter future urged patient months uncertainty Eurotunnel wrestled reduce crippling interest payments negotiated tunnel 's construction . Eurotunnel , taken around half market busiest cross-Channel route European ferry companies , said strong operating performance could allow pay first dividend within next 10 years . French co-chairman Patrick Ponsolle told reporters Paris news conference dividend could come early 2004 company performed `` well '' . Eurotunnel banks come ingenious formula help company get early years deal , despite swaps debt equity bonds , still able afford annual interest bill 400 million pounds . If revenue , costs depreciation , less 400 million pounds , company issue `` Stabilisation notes '' maximum 1.85 billion pounds banks . Eurotunnel would pay interest notes ( would constitute debt issue ) ten years . Analysts said deal , Eurotunnel 's ability finance debt would become sustainable , least years . `` If look current cash flow 150 200 million pounds year , ca n't find ( meet bill ) roll forward stabilisation notes , keep going seven , eight , nine years , '' said analyst one major investment bank . `` So time , '' added . The company said statement still considerable work done finalise agree details plan submitted shareholders bank group approval , probably early Spring 1997 . Eurotunnel said debt-for-equity swap would 130 pence , 10.40 francs , per share -- considerably level 160 pence widely reported run deal The company said 3.7 billion pounds debt would converted new financial instruments existing shareholders would able participate issue . If choose take free warrants entitling subscribe , Eurotunnel said shareholders ' interests may reduced 39 percent company end December 2003 . Eurotunnel 's shares , suspended last week 113.5 pence ahead Monday 's announcement , resume trading Tuesday . Shareholders 225 creditor banks agree deal . `` I 'm hopeful I 'm taking ( approval ) granted , '' Morton admitted , `` Shareholders pretty angry France . '' Asked would happen banks reject deal , Morton said , `` Nobody wants collapse , nobody wants doomsday scenario . '' ( $ 1=.6393 Pound )

0
投票

https://github.com/nltk/nltk/blob/develop/nltk/cli.py中有一个漂亮的命令行工具

使用CLI安装NLTK:

pip install -U nltk[cli]

要使用,在终端/命令提示符下,调用nltk tokenize

$ nltk tokenize --help    

Usage: nltk tokenize [OPTIONS]

  This command tokenizes text stream using nltk.word_tokenize

Options:
  -l, --language TEXT      The language for the Punkt sentence tokenization.
  -l, --preserve-line      An option to keep the preserve the sentence and not
                           sentence tokenize it.
  -j, --processes INTEGER  No. of processes.
  -e, --encoding TEXT      Specify encoding of file.
  -d, --delimiter TEXT     Specify delimiter to join the tokens.
  -h, --help               Show this message and exit.

示例用法:

nltk tokenize -l en -j 4 --preserve-line -d " " -e utf8 < 100554newsML.txt > h1doc1.txt
© www.soinside.com 2019 - 2024. All rights reserved.