将 json 文件格式更改为 .spacy 以进行自定义 NER 标记

问题描述 投票:0回答:1

我想为我的项目创建一个自定义标签。为了获得有关此主题的帮助,我浏览了 使用 spaCy 3.0 构建自定义 NER 模型 本教程。 JSON 文件的内容

[{"text": "The F15 aircraft uses a lot of fuel", "entities": [4, 7, "aircraft"]},
 {"text":"did you see the F16 landing?", "entities": [16, 19, "aircraft"]},
 {"text":"how many missiles can a F35 carry", "entities": [24, 27, "aircraft"]},
 {"text":"is the F15 outdated", "entities": [7, 10, "aircraft"]},
 {"text":"does the US still train pilots to dog fight?","entities": [0, 0, "aircraft"]},
 {"text":"how long does it take to train a F16 pilot","entities": [33, 36, "aircraft"]}]

但是,这是一种旧的文件格式。我们需要在 Spacy v3.1 中构建自定义 NER 模型。所以教程中有一个代码块可以将.json格式更改为.spacy格式。

以下是代码块

import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

当我运行此代码时,我收到错误。

如何克服该错误?

python nlp google-colaboratory spacy named-entity-recognition
1个回答
0
投票

在您提供的代码块中,TRAIN_DATA 应该是一个元组列表,其中每个元组包含一个文本及其相应的注释(包括实体跨度)。但是,您提供的 JSON 数据的格式不同。

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

# Your JSON data
json_data = [
    {"text": "The F15 aircraft uses a lot of fuel", "entities": [4, 7, "aircraft"]},
    {"text": "did you see the F16 landing?", "entities": [16, 19, "aircraft"]},
    # ... (other data)
]

# Convert JSON data to spaCy format
TRAIN_DATA = []
for entry in json_data:
    text = entry["text"]
    entities = []
    for start, end, label in [entry["entities"]]:  # Note: Assuming each entry has only one entity
        entities.append((start, end, label))
    TRAIN_DATA.append((text, {"entities": entities}))

# spaCy setup
nlp = spacy.blank("en")  # load a new spacy model
db = DocBin()  # create a DocBin object

# Process each text
for text, annot in tqdm(TRAIN_DATA):
    doc = nlp.make_doc(text)  # create doc object from text
    ents = []
    for start, end, label in annot["entities"]:  # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents  # label the text with the ents
    db.add(doc)

© www.soinside.com 2019 - 2024. All rights reserved.