使用 nltk 查找特定的一致性索引

问题描述 投票:0回答:2

我使用下面的代码从 nltk 获取索引,然后显示每个索引的索引。我得到这些结果如下所示。到目前为止一切顺利。

我如何只查找一个特定的索引索引?在这个小例子中,将索引与索引相匹配很容易,但是如果我有 300 个索引,我想找到其中一个的索引。

.index
不会将列表中的多个项目作为参数。

有人可以指出我应该使用的命令/结构来获取索引以显示索引吗?我在下面附上了一个更有用的结果示例,它在 nltk 之外得到一个单独的索引列表。我想将这些组合成一个结果,但我该如何到达那里?

import nltk 
nltk.download('punkt') 
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text

moby = open('mobydick.txt', 'r')

moby_read = moby.read() 
moby_text = nltk.Text(nltk.word_tokenize(moby_read))

moby_text.concordance("monstrous")

moby_indices  = [index for (index, item) in enumerate(moby_text) if item == "monstrous"]

print(moby_indices)
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u

[858, 1124, 9359, 9417, 32173, 94151, 122253, 122269, 162203, 205095]

理想情况下我想要这样的东西。

Displaying 11 of 11 matches:
[858] ong the former , one was of a most monstrous size . ... This came towards us , 
[1124] N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
[9359] ll over with a heathenish array of monstrous clubs and spears . Some were thick
[9417] d as you gazed , and wondered what monstrous cannibal and savage could ever hav
[32173] that has survived the flood ; most monstrous and most mountainous ! That Himmal
[94151] they might scout at Moby Dick as a monstrous fable , or still worse and more de
[122253] of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
[122269] ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
[162203] ere to enter upon those still more monstrous stories of them which are to be fo
[162203] ght have been rummaged out of this monstrous cabinet there is no telling . But 
[205095] e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u
python nltk
2个回答
0
投票

每个索引列表对象中都有一个偏移量:

from itertools import zip_longest
from nltk.book import text1

def pad_int(number, length=8, pad_with=" "):
    return "".join(reversed([ch if ch else pad_with for i, ch in zip_longest(range(length), reversed(str(number)))]))

width = 50
for con_line in text1.concordance_list("monstrous", width=width):
  left, right = " ".join(con_line.left).strip()[-width:], " ".join(con_line.right).strip()[:width]
  offset = pad_int(con_line.offset)
  print(f"[{offset}]\t{left}  {con_line.query}  {right}")

[出]:

[     899]  , appeared . Among the former , one was of a most  monstrous  size . ... This came towards us , open - mouthed
[    1176]   BACON ' S VERSION OF THE PSALMS . " Touching that  monstrous  bulk of the whale or ork we have received nothing 
[    9530]  entry was hung all over with a heathenish array of  monstrous  clubs and spears . Some were thickly set with glit
[    9594]  r . You shuddered as you gazed , and wondered what  monstrous  cannibal and savage could ever have gone a death -
[   32717]  t animated mass that has survived the flood ; most  monstrous  and most mountainous ! That Himmalehan , salt - se
[   96103]  f the fishery , they might scout at Moby Dick as a  monstrous  fable , or still worse and more detestable , a hid
[  122521]  lt since the death of Radney .'" CHAPTER 55 Of the  Monstrous  Pictures of Whales . I shall ere long paint to you
[  124761]  Pictures of Whaling Scenes . In connexion with the  monstrous  pictures of whales , I am strongly tempted here to
[  124777]  rongly tempted here to enter upon those still more  monstrous  stories of them which are to be found in certain b
[  165681]  other marvels might have been rummaged out of this  monstrous  cabinet there is no telling . But a sudden stop wa
[  209645]  which are made of Whale - Bones ; for Whales of a  monstrous  size are oftentimes cast up dead upon that shore .


offset指的是什么?

指的是query token所在位置的索引。

from nltk.book import text1

width = 50
for con_line in text1.concordance_list("monstrous", width=width):
  print(text1.tokens[con_line.offset - 2], text1.tokens[con_line.offset - 1], text1.tokens[con_line.offset])
  

[出]:

a most monstrous
Touching that monstrous
array of monstrous
wondered what monstrous
; most monstrous
as a monstrous
Of the Monstrous
with the monstrous
still more monstrous
of this monstrous
of a monstrous

0
投票

我们可以使用

concordance_list
函数(https://www.nltk.org/api/nltk.text.html)这样我们就可以指定
width
lines
的个数,然后迭代
 line
获取
'offset'
(即行号)并添加括号
'['
']'
加上
roi
(即
'monstrous'
)在
left
right
词(每个
line
)之间:

some_text = open('/content/drive/My Drive/Colab Notebooks/DATA_FOLDERS/TEXT/mobydick.txt', 'r')
roi = 'monstrous'

moby_read = some_text.read()
moby_text = nltk.Text(nltk.word_tokenize(moby_read))
moby_text = moby_text.concordance_list(roi, width=22, lines=1000)
for line in moby_text:
    print('[' + str(line.offset) + '] ' + ' '.join(line.left) + ' ' + roi + ' ' + ' '.join(line.right))

或者如果您发现它更具可读性 (

import numpy as np
):

for line in moby_text:
    print('[' + str(line.offset) + '] ', np.append(' '.join(np.append(np.array(line.left), roi)), np.array(' '.join(line.right))))

输出(我的行号与你的不匹配,因为我使用了这个来源https://gist.github.com/StevenClontz/4445774只是有不同的间距/行号):

[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
[1652] the Psalms. ' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears .
[9933] gazed , and wondered what monstrous cannibal and savage could
[32736] survived the Flood ; most monstrous and most mountainous !
[95115] scout at Moby-Dick as a monstrous fable , or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field , Desmarest , monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales ,
[123541] enter upon those still more monstrous stories of them which

如果我们想考虑标点符号等等,我们可以这样做:

for line in moby_text:
    left_words = [left_word for left_word in line.left]
    right_words = [right_word for right_word in line.right]
    return_text = '[' +  str(line.offset) + '] '
    for word in left_words:
        if word == '.' or word == ',' or word == ';' or word == '!':
            return_text += word
        else:
            return_text += ' ' + word if return_text[-1] != ' ' else word
    return_text += roi + ' '
    for word in right_words:
        if word == '.' or word == ',' or word == ';' or word == '!':
            return_text += word
        else:
            return_text += ' ' + word if return_text[-1] != ' ' else word
    print(return_text)

输出:

[494] 306 LV. OF THE monstrous PICTURES OF WHALES.
[1385] one was of a most monstrous size. * *
[1652] the Psalms.' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears.
[9933] gazed, and wondered what monstrous cannibal and savage could
[32736] survived the Flood; most monstrous and most mountainous!
[95115] scout at Moby-Dick as a monstrous fable, or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field, Desmarest, monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales,
[123541] enter upon those still more monstrous stories of them which

但您可能需要对其进行调整,因为我没有对可能出现的不同上下文进行过多思考(例如,

'*'
、数字、全部大写的章节标题、罗马数字等),这更多由您决定输出文本的外观——我只是提供一个示例。

注意:width

函数中的
concordance_list
指的是下一个左边(和右边)单词的max长度,所以如果我们将它设置为
4
第一行会打印:

[494] THE monstrous

因为

len('THE ')
4
,所以设置为
3
会截断
'THE'
'monstrous'
的下一个左字:

[494] monstrous

虽然

lines
函数中的
concordance_list
指的是max行数,所以如果我们只想要包含
'monstrous'
的前两行:

[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
© www.soinside.com 2019 - 2024. All rights reserved.