文本中字典的命名实体识别

问题描述 投票:0回答:1

我需要从文本中提取关键字。我有一个关键字字典,比方说

apache-spark
java
pathon
amazon-web-services
apache-kafka

例如我有一个工作职位:

Design, develop and maintain ETL processing pipelines for data ingestion and sharing
Contribute to system architecture design discussions and improvements
Communicating with different teams regarding data quality, consistency and availability

Our technology stack:

GCP (BigQuery, GCS, PubSub, DataProc etc)
Spark, Kafka, Kudu
Airflow, dbt
Tableau
4+ years experience as a Data Engineer
Proven track record of working with SQL, Python, Airflow and Docker
Experience in large scale data processing (Apache Spark or similar) and Scala/Java is a big plus
Strong expertise in cloud-based data warehouses like Google BigQuery
Fluency in English verbal and written skills.

在文本中我们有

Apache Spark
关键字。我的字典包含略有不同的关键字 -
apache-spark
Kafka
也是如此-在我的字典中我有
apache-kafka
.

是否可以使用 Stanford NER 从文本中提取此类关键字?这是斯坦福 NER 的任务还是我走错了路?

nlp stanford-nlp named-entity-recognition
1个回答
0
投票

拥有 NER 的某种别名和同义词绝对有帮助。没有它,模型很难猜测,除非已经存在具有别名检测功能的预训练模型。

但是当涉及到计算术语时,StackOverflow 会有所帮助!

import requests
from  collections import defaultdict

terms = """apache-spark
java
pathon
amazon-web-services
apache-kafka""".split('\n')

alias = defaultdict(set)
related = defaultdict(set)

for t in terms:
  url = f"https://stackoverflow.com/tags/{t}/synonyms"
  response = requests.get(url)
  bsoup = BeautifulSoup(response.content.decode('utf8'))
  for a in bsoup.find_all('a', attrs={'rel': 'tag'}):
    if "show questions tagged" in a.attrs['aria-label']:
      alias[t].add(a.text)
    else
      related[t].add(a.text)

print(alias)

[出]:

defaultdict(set,
            {'apache-spark': {'apache-spark',
              'spark',
              'spark-cluster-framework'},
             'java': {'.java',
              'core-java',
              'j2se',
              'java',
              'java-api',
              'java-libraries',
              'java-se',
              'javax',
              'jdk',
              'jre',
              'openjdk',
              'oraclejdk'},
             'amazon-web-services': {'amazon-web-services', 'aws'},
             'apache-kafka': {'apache-kafka',
              'kafka',
              'kafka-cluster',
              'kafka-partition',
              'kafka-topic'}})
© www.soinside.com 2019 - 2024. All rights reserved.