我使用下面的代码从 nltk 获取索引,然后显示每个索引的索引。我得到这些结果如下所示。到目前为止一切顺利。
我如何只查找一个特定的索引索引?在这个小例子中,将索引与索引相匹配很容易,但是如果我有 300 个索引,我想找到其中一个的索引。
.index
不会将列表中的多个项目作为参数。
有人可以指出我应该使用的命令/结构来获取索引以显示索引吗?我在下面附上了一个更有用的结果示例,它在 nltk 之外得到一个单独的索引列表。我想将这些组合成一个结果,但我该如何到达那里?
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text
moby = open('mobydick.txt', 'r')
moby_read = moby.read()
moby_text = nltk.Text(nltk.word_tokenize(moby_read))
moby_text.concordance("monstrous")
moby_indices = [index for (index, item) in enumerate(moby_text) if item == "monstrous"]
print(moby_indices)
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u
[858, 1124, 9359, 9417, 32173, 94151, 122253, 122269, 162203, 205095]
理想情况下我想要这样的东西。
Displaying 11 of 11 matches:
[858] ong the former , one was of a most monstrous size . ... This came towards us ,
[1124] N OF THE PSALMS . `` Touching that monstrous bulk of the whale or ork we have r
[9359] ll over with a heathenish array of monstrous clubs and spears . Some were thick
[9417] d as you gazed , and wondered what monstrous cannibal and savage could ever hav
[32173] that has survived the flood ; most monstrous and most mountainous ! That Himmal
[94151] they might scout at Moby Dick as a monstrous fable , or still worse and more de
[122253] of Radney . ' '' CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
[122269] ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
[162203] ere to enter upon those still more monstrous stories of them which are to be fo
[162203] ght have been rummaged out of this monstrous cabinet there is no telling . But
[205095] e of Whale-Bones ; for Whales of a monstrous size are oftentimes cast up dead u
每个索引列表对象中都有一个偏移量:
from itertools import zip_longest
from nltk.book import text1
def pad_int(number, length=8, pad_with=" "):
return "".join(reversed([ch if ch else pad_with for i, ch in zip_longest(range(length), reversed(str(number)))]))
width = 50
for con_line in text1.concordance_list("monstrous", width=width):
left, right = " ".join(con_line.left).strip()[-width:], " ".join(con_line.right).strip()[:width]
offset = pad_int(con_line.offset)
print(f"[{offset}]\t{left} {con_line.query} {right}")
[出]:
[ 899] , appeared . Among the former , one was of a most monstrous size . ... This came towards us , open - mouthed
[ 1176] BACON ' S VERSION OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have received nothing
[ 9530] entry was hung all over with a heathenish array of monstrous clubs and spears . Some were thickly set with glit
[ 9594] r . You shuddered as you gazed , and wondered what monstrous cannibal and savage could ever have gone a death -
[ 32717] t animated mass that has survived the flood ; most monstrous and most mountainous ! That Himmalehan , salt - se
[ 96103] f the fishery , they might scout at Moby Dick as a monstrous fable , or still worse and more detestable , a hid
[ 122521] lt since the death of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere long paint to you
[ 124761] Pictures of Whaling Scenes . In connexion with the monstrous pictures of whales , I am strongly tempted here to
[ 124777] rongly tempted here to enter upon those still more monstrous stories of them which are to be found in certain b
[ 165681] other marvels might have been rummaged out of this monstrous cabinet there is no telling . But a sudden stop wa
[ 209645] which are made of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead upon that shore .
指的是query token所在位置的索引。
from nltk.book import text1
width = 50
for con_line in text1.concordance_list("monstrous", width=width):
print(text1.tokens[con_line.offset - 2], text1.tokens[con_line.offset - 1], text1.tokens[con_line.offset])
[出]:
a most monstrous
Touching that monstrous
array of monstrous
wondered what monstrous
; most monstrous
as a monstrous
Of the Monstrous
with the monstrous
still more monstrous
of this monstrous
of a monstrous
我们可以使用
concordance_list
函数(https://www.nltk.org/api/nltk.text.html)这样我们就可以指定width
和lines
的个数,然后迭代 line
获取'offset'
(即行号)并添加括号'['
']'
加上roi
(即'monstrous'
)在left
和right
词(每个line
)之间:
some_text = open('/content/drive/My Drive/Colab Notebooks/DATA_FOLDERS/TEXT/mobydick.txt', 'r')
roi = 'monstrous'
moby_read = some_text.read()
moby_text = nltk.Text(nltk.word_tokenize(moby_read))
moby_text = moby_text.concordance_list(roi, width=22, lines=1000)
for line in moby_text:
print('[' + str(line.offset) + '] ' + ' '.join(line.left) + ' ' + roi + ' ' + ' '.join(line.right))
或者如果您发现它更具可读性 (
import numpy as np
):
for line in moby_text:
print('[' + str(line.offset) + '] ', np.append(' '.join(np.append(np.array(line.left), roi)), np.array(' '.join(line.right))))
输出(我的行号与你的不匹配,因为我使用了这个来源:https://gist.github.com/StevenClontz/4445774只是有不同的间距/行号):
[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *
[1652] the Psalms. ' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears .
[9933] gazed , and wondered what monstrous cannibal and savage could
[32736] survived the Flood ; most monstrous and most mountainous !
[95115] scout at Moby-Dick as a monstrous fable , or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field , Desmarest , monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales ,
[123541] enter upon those still more monstrous stories of them which
如果我们想考虑标点符号等等,我们可以这样做:
for line in moby_text:
left_words = [left_word for left_word in line.left]
right_words = [right_word for right_word in line.right]
return_text = '[' + str(line.offset) + '] '
for word in left_words:
if word == '.' or word == ',' or word == ';' or word == '!':
return_text += word
else:
return_text += ' ' + word if return_text[-1] != ' ' else word
return_text += roi + ' '
for word in right_words:
if word == '.' or word == ',' or word == ';' or word == '!':
return_text += word
else:
return_text += ' ' + word if return_text[-1] != ' ' else word
print(return_text)
输出:
[494] 306 LV. OF THE monstrous PICTURES OF WHALES.
[1385] one was of a most monstrous size. * *
[1652] the Psalms.' Touching that monstrous bulk of the whale
[9874] with a heathenish array of monstrous clubs and spears.
[9933] gazed, and wondered what monstrous cannibal and savage could
[32736] survived the Flood; most monstrous and most mountainous!
[95115] scout at Moby-Dick as a monstrous fable, or still
[121328] '' CHAPTER LV OF THE monstrous PICTURES OF WHALES I
[121991] this bookbinder 's fish an monstrous PICTURES OF WHALES 333
[122749] same field, Desmarest, monstrous PICTURES OF WHALES 335
[123525] SCENES IN connection with the monstrous pictures of whales,
[123541] enter upon those still more monstrous stories of them which
但您可能需要对其进行调整,因为我没有对可能出现的不同上下文进行过多思考(例如,
'*'
、数字、全部大写的章节标题、罗马数字等),这更多由您决定输出文本的外观——我只是提供一个示例。
注意:width
函数中的
concordance_list
指的是下一个左边(和右边)单词的max长度,所以如果我们将它设置为4
第一行会打印:
[494] THE monstrous
因为
len('THE ')
是4
,所以设置为3
会截断'THE'
'monstrous'
的下一个左字:
[494] monstrous
虽然
lines
函数中的concordance_list
指的是max行数,所以如果我们只想要包含'monstrous'
的前两行:
[494] 306 LV . OF THE monstrous PICTURES OF WHALES .
[1385] one was of a most monstrous size . * *