我正在制作一个命名实体识别器,我正在努力使用Python将数据放入正确的格式。我所拥有的是某个字符串以及该文本中带有所有标签的命名实体列表。例如:
text = “Hidden Figures is a 2016 American biographical drama film directed by Theodore Melfi and written by Melfi and Allison Schroeder.”
这个字符串也可以是“[[隐藏数字]]是由[[Theodore Melfi]]执导并由[[Melfi]]和[[Allison Schroeder]]编写的2016 [[美国]]传记剧电影。”如果那样的话让它更容易。
listOfNEsAndTags = [‘Hidden Figures PRO’, 'American LOC’, 'Theodore Melfi PER’, 'Melfi PER’, 'Allison Schroeder PER’]
我想要的输出是:
Hidden PRO
Figures PRO
is O
a O
2016 O
American LOC
biographical O
drama O
film O
directed O
by O
Theodore PER
Melfi PER
and O
written O
by O
Melfi PER
and O
Allison PER
Schroeder PER
. O
到目前为止,我只得到以下功能:
def wordPerLine(text, neplustags):
text = re.sub(r"([?!,.]+)", r" \1 ", text)
wpl = text.split()
output = []
for line in wpl:
output.append(line + ” O")
return output
这为每一行提供了默认标记O(这是非命名实体的标记)。我怎样才能使文本中的命名实体获得正确的标记?
这可能有效,用其他东西取代印刷品,需要改进正则表达式,但这是一个好的开始。
text = "[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]]."
tags = {"Hidden test Figures": "PRO", "American": "LOC", 'Theodore Melfi': "PER", 'Melfi': "PER", 'Allison Schroeder': "PER"}
text = re.sub(r"([?!,.]+)", r" \1", text)
search = ""
inTag = False
for w in text.split(" "):
outTag = False
rest = w
if rest[:2] == "[[":
rest = rest[2:]
inTag = True
if rest[-2:] == "]]":
rest = rest[:-2]
outTag = True
if inTag:
search += rest
if outTag:
val = tags[search]
for word in search.split():
print(word + ": " + val)
inTag = False
search = ""
else:
search += " "
else:
print(rest + ": O")
输入:
[[Hidden test Figures]] is, a 2016 [[American]] biographical drama film directed by [[Theodore Melfi]] and written by [[Melfi]] and [[Allison Schroeder]].
输出:
Hidden: PRO
test: PRO
Figures: PRO
is: O
,: O
a: O
2016: O
American: LOC
biographical: O
drama: O
film: O
directed: O
by: O
Theodore: PER
Melfi: PER
and: O
written: O
by: O
Melfi: PER
and: O
Allison: PER
Schroeder: PER
.: O