Python:屏蔽电子邮件文本中的命名实体

问题描述 投票:1回答:1

我创建了一个python脚本来提取命名实体,如下所示:

# set java path
java_path = r'C:/Program Files/Java/jre1.8.0_161/bin/java.exe'

os.environ['JAVAHOME'] = java_path

# initialize NER tagger
sn = StanfordNERTagger('C:/Users/Parag/Documents/stanford-ner-2018-10-16/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
                       path_to_jar='C:/Users/Parag/Documents/stanford-ner-2018-10-16/stanford-ner-2018-10-16/stanford-ner.jar')

# tag named entities
ner_tagged_sentences = [sn.tag(sent.split()) for sent in dataset_unseen['Text']]
dataset_unseen['Text'] = dataset_unseen.apply(Detectner,axis=1)
# extract all named entities
named_entities = []

for sentence in ner_tagged_sentences:
    temp_entity_name = ''
    temp_named_entity = None

    for term, tag in sentence:
        if tag != 'O':
            temp_entity_name = ' '.join([temp_entity_name, term]).strip()
            temp_named_entity = (temp_entity_name, tag)

        else:
            if temp_named_entity:
                named_entities.append(temp_named_entity)
                temp_entity_name = ''
                temp_named_entity = None
entity_frame = pd.DataFrame(named_entities, columns=['Entity Name', 'Entity Type'])
entity_frame.head()

**输出**

 Entity Name      Entity Type       Frequency

 ABC Farms        ORGANIZATION          5 

 Freddy Hill Lane  ORGANIZATION          3 

 North Lane Thames ORGANIZATION          2 

现在,我想用“ ######”之类的样式掩盖这些命名实体,以通过隐藏客户敏感信息来遵循GDPR法规。

我尝试过类似的选项:

  1. 在原始数据帧上应用循环-检查命名实体数据帧中存在的命名实体的文本-用'#####'屏蔽命名实体。

  2. 定义用于掩盖文本中命名实体的功能:

def Detectner(row):
    ner_tagged_sentences = [sn.tag(sent.split()) for sent in row['Text']]
    results = ner_tagged_sentences.sub('**********',row['Text'])
    return results

dataset_unseen['Text'] = dataset_unseen.apply(Detectner,axis=1)

但是我收到以下错误:

AttributeError: ("'list' object has no attribute 'sub'", 'occurred at index 0')

如何提取和屏蔽文本中的命名实体。此代码的任何改进都受到高度赞赏!

python stanford-nlp text-mining named-entity-recognition data-masking
1个回答
0
投票

[当您标记句子时,您正在该行中创建一个list

ner_tagged_sentences = [sn.tag(sent.split()) for sent in row['Text']]

typener_tagged_sentences是没有list方法的sub

您可以尝试多种方法来实现使文档匿名的目标:

  1. 用非O标签用某些东西替换令牌(令牌级别)
  2. 直接在文档中替换命名实体文本(字符串级)

似乎您正在尝试执行数字(2)

© www.soinside.com 2019 - 2024. All rights reserved.