使用 Python 进行搜索引擎查询的命名实体识别

问题描述 投票:0回答:1

我正在尝试使用 Python 对搜索引擎查询进行命名实体识别。

搜索引擎查询的最大问题是它们通常不完整或全部小写。

对于这项任务,我被推荐使用 Spacy、NLTK、Stanford NLP、Flair、Hugging Face 的 Transformers 作为解决此问题的一些方法。

我想知道 SO 社区中是否有人知道处理搜索引擎查询的 NER 的最佳方法,因为到目前为止我遇到了问题。

例如,使用 Spacy:

import spacy

# Load the pre-trained model
nlp = spacy.load("en_core_web_sm")

# Process a text
text = "google and apple are looking at buying u.k. startup for $1 billion"
text = "who is barack obama"
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(ent.text, ent.label_)

对于我得到的第一个查询:

google ORG
u.k. GPE
$1 billion MONEY

这是一个很好的答案。但是,对于小写的搜索查询“who is barack obama”,它没有返回任何实体。

我确信我不是第一个在 Python 中对搜索引擎查询进行 NER 的人,所以我希望找到可以为我指明正确方向的人。

python nlp search-engine named-entity-recognition
1个回答
0
投票

问题

大多数NER模型都以Cased tokens为主要特征。

解决方案

我会尝试 GPT 模型,因为它们已经接受过掩蔽和上下文任务的训练,因此它们应该能够根据上下文识别实体。

我用 chatgpt 进行了快速实验。

提示:

Named entity recognition (NER) is a natural language processing (NLP) method that extracts information from text. NER involves detecting and categorizing important information in text known as named entities. Named entities refer to the key subjects of a piece of text, such as names, locations, companies, events and products, as well as themes, topics, times, monetary values and percentages. You are an expert on recognizing Named entities. 

I will provide you short sentences and you will respond all the entities you find. 

Return the entities clasified in four types:

PER for persons such as Bill Clinton, Gauss, Jennifer Lopez
LOC for locations such as California, Europe, 9th Avenue
ORG for organizations such as Apple, Google, UNO
MISC any other type of entity you consider that do not fits in the beforementioned cases. 

Respond in JSON format. 

For example:

"google and apple are looking at buying u.k. startup for $1 billion"

response:

{"entities": [
{"name": "google", "type": "ORG"},
{"name": "apple", "type": "ORG"},
{"name": "u.k.", "type": "MISC"}
]}

它在您的用例中反应良好(在 chatgpt 应用程序中尝试一下!)

代码

以下代码和依赖项应该可以在 OpenAI 模型的第一种方法中发挥作用

!pip install openai==1.2.0 pyautogen==0.2.0b2

(很难找到当前的版本组合,openAI 最近迁移到新的 API,所以教程现在很乱......)

from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key="<you openAI API Key>")

# Function to perform Named Entity Recognition (NER)
def perform_ner(text):
    # Define the prompt for NER task
    prompt = """
    
    You are an expert on recognising Named entities. I will provide you short sentences and you will respond all the entities you find. Return the entities clasified in four types:
    PER for persons such as Bill Clinton, Gauss, Jennifer Lopez
    LOC for locations such as California, Europe, 9th Avenue
    ORG for organizations such as Apple, Google, UNO
    MISC any other type of entity you consider that do not fits in the beforementioned cases. 

    Respond in JSON format. 

    For example:

    "google and apple are looking at buying u.k. startup for $1 billion"

    response:

    {"entities": [
    {"name": "google", "type": "ORG"},
    {"name": "apple", "type": "ORG"},
    {"name": "u.k.", "type": "MISC"}
    ]}
    
    """

    # Generate completion using OpenAI API
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": f"{prompt}"},
            {"role": "user", "content": text}
        ],
        max_tokens=100,
        n=1,
        stop=None,
        temperature=0
    )

    # Extract and return entities from response
    
    entities = response.choices[0].message.content.strip()
    return json.loads(entities)

# Function to receive new text and return NER JSON
def get_ner_json(new_text):
    # Perform NER on the new text
    entities = perform_ner(new_text)
    return entities

# Example new text
new_text = "I went to Paris last summer and visited the Eiffel Tower."

# Get NER JSON for the new text
ner_json = get_ner_json(new_text)
print(json.dumps(ner_json, indent=2))

输出:

{
  "entities": [
    {
      "name": "paris",
      "type": "LOC"
    },
    {
      "name": "eiffel tower",
      "type": "LOC"
    }
  ]
}
© www.soinside.com 2019 - 2024. All rights reserved.