  • 高甘油三酯血症:['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
  • 酶抑制剂:['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
  • 搭桥手术:['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
  • 珀斯:['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
  • 气候:['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']


更准确地说,我想把我的概念分为医学和非医学。但是,仅使用类别来划分概念是非常困难的。例如,尽管enzyme inhibitorbypass surgery这两个概念属于医学领域,但它们的类别彼此非常不同。

因此,我想知道是否有办法获得类别的parent category(例如,enzyme inhibitorbypass surgery的类别属于medical父类别)



正如@IlmariKaronen所建议的,我也使用了categories of categories,我得到的结果如下(category附近的小字体是categories of the category)。 enter image description here


此外,正如@IlmariKaronen使用Wikiproject指出的细节可能是潜在的。然而,似乎Medicine wikiproject似乎没有所有医学术语。因此,我们还需要检查其他wikiprojects。


  1. 使用库qazxsw poi 将mediawiki导入为pw qazxsw poi
  2. 使用库pymediawiki p = wikipedia.page('enzyme inhibitor') print(p.categories)




如需很长的清单,请查看以下链接。 import pywikibot as pw site = pw.Site('en', 'wikipedia') print([ cat.title() for cat in pw.Page(site, 'support-vector machine').categories() if 'hidden' not in cat.categoryinfo ])



Solution Overview



  • 积极学习(下面提供的示例方法)
  • ['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous'] 作为https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing的答案提供
  • SPARQL祖先类别由MediaWiki backlinks提供的@TavoGC和/或@Stanislav Kralin作为对您的问题的评论(根据他们的差异,这两个可能是他们自己的合奏,但为此你必须联系两个创作者并比较他们的结果)。


虽然我们在谈论它,但我会反对parent categories@Meena Nagarajan提出的方法,因为:

  • 距离度量不应该是欧几里德,余弦相似性是更好的度量(由例如@ananand_v.singh使用),因为它没有考虑向量的大小(它不应该,这是word2vec或GloVe如何训练)
  • 如果我理解正确,就会产生许多人工集群,而我们只需要两个:医学和非医学集群。此外,医学的质心不是以药物本身为中心。这会产生额外的问题,比如说,质心远离药物,而其他词语,例如,this answerspaCy(或任何其他不合适的药物)可能会进入群集。
  • 评估结果很难,更重要的是,这个问题是严格主观的。此外,单词向量难以想象和理解(使用PCA / TSNE /类似物将它们转换为较低维度[2D / 3D]这么多单词,会给我们完全非感性的结果[是的,我试过这样做,PCA大约5%解释了你的较长数据集的方差,真的,真的很低])。


Active Learning approach



正如active learning指出的那样,单词向量是最有前途的方法之一,我也将在这里使用它(不同的是,IMO以更清洁和更容易的方式)。


  • 不要使用语境化嵌入字符作为当前可用的现有技术(例如medical
  • 检查有多少概念没有表示(例如,表示为零向量)。它应该被检查(并在我的代码中检查,到时候会有进一步的讨论),你可以使用其中大部分存在的嵌入。

Measuring similarity using spaCy





你可以用你的数据代替qazxsw poi。

看看class Similarity: def __init__(self, centroid, nlp, n_threads: int, batch_size: int): # In our case it will be medicine self.centroid = centroid # spaCy's Language model (english), which will be used to return similarity to # centroid of each concept self.nlp = nlp self.n_threads: int = n_threads self.batch_size: int = batch_size self.missing: typing.List[int] = [] def __call__(self, concepts): concepts_similarity = [] # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL) for i, concept in enumerate( self.nlp.pipe( concepts, n_threads=self.n_threads, batch_size=self.batch_size ) ): if concept.has_vector: concepts_similarity.append(self.centroid.similarity(concept)) else: # If document has no vector, it's assumed to be totally dissimilar to centroid concepts_similarity.append(-1) self.missing.append(i) return np.array(concepts_similarity) 并注意到我使用过import json import typing import numpy as np import spacy nlp = spacy.load("en_vectors_web_lg") centroid = nlp("medicine") concepts = json.load(open("concepts_new.txt")) concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)( concepts ) 。它由685.000个独特的单词向量(很多)组成,并且可以为您的情况开箱即用。安装spaCy后,您必须单独下载,上面的链接提供了更多信息。







false positives
  • medicine参数描述了在每次迭代期间将向专家显示多少个例子(它是最大值,如果已经要求样本或者没有足够的样本显示,它将返回更少)。
  • sklearn-like表示每次迭代中阈值的下降(我们从1开始意味着完美的相似性)。
  • class ActiveLearner: def __init__( self, concepts, concepts_similarity, max_steps: int, samples: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.max_steps: int = max_steps self.samples: int = samples self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 - 如果专家回答概念不相关(或大部分不相关,因为返回多个概念),则步骤乘以此浮点数。它用于精确定位每次迭代时samples变化之间的精确阈值。
  • 概念根据它们的相似性进行排序(概念越相似,越高)







最后是qazxsw poi的整个代码,根据专家找到最佳相似度阈值:

def _ask_expert(self, available_concepts_indices):
    # Get random concepts (the ones above the threshold)
    concepts_to_show = set(
            available_concepts_indices, len(available_concepts_indices)
    # Remove those already presented to an expert
    concepts_to_show = concepts_to_show - self._checked_concepts
    # Print message for an expert and concepts to be classified
    if concepts_to_show:
        print("\nAre those concepts related to medicine?\n")
                f"{i}. {concept}"
                for i, concept in enumerate(
                    self.concepts[list(concepts_to_show)[: self.samples]]
        return input("[y]es / [n]o / [any]quit ")
    return "y"






Are those concepts related to medicine?                                                      

0. anesthetic drug                                                                                                                                                                         
1. child and adolescent psychiatry                                                                                                                                                         
2. tertiary care center                                                     
3. sex therapy                           
4. drug design                                                                                                                                                                             
5. pain disorder                                                      
6. psychiatric rehabilitation                                                                                                                                                              
7. combined oral contraceptive                                
8. family practitioner committee                           
9. cancer family syndrome                          
10. social psychology                                                                                                                                                                      
11. drug sale                                                                                                           
12. blood system                                                                        

[y]es / [n]o / [any]quit y


# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
    if decision.lower() == "y":
        # You can't go higher as current threshold is related to medicine
        self._max_threshold = self.threshold_
        if self.threshold_ - self.step < self._min_threshold:
            return False
        # Lower the threshold
        self.threshold_ -= self.step
        return True
    if decision.lower() == "n":
        # You can't got lower than this, as current threshold is not related to medicine already
        self._min_threshold = self.threshold_
        # Multiply threshold to pinpoint exact spot
        self.step *= self.change_multiplier
        if self.threshold_ + self.step < self._max_threshold:
            return False
        # Lower the threshold
        self.threshold_ += self.step
        return True
    return False

在回答了一些问题后,阈值为0.1(ActiveLearner之间的所有内容都被认为是非医学的,而class ActiveLearner: def __init__( self, concepts, concepts_similarity, samples: int, max_steps: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.samples: int = samples self.max_steps: int = max_steps self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("\nAre those concepts related to medicine?\n") print( "\n".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "\n", ) return input("[y]es / [n]o / [any]quit ") return "y" # True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False def fit(self): for _ in range(self.max_steps): available_concepts_indices = np.nonzero( self.concepts_similarity >= self.threshold_ )[0] if available_concepts_indices.size != 0: decision = self._ask_expert(available_concepts_indices) if not self._parse_expert_decision(decision): break else: self.threshold_ -= self.step return self 被认为是医学上的)我得到了以下结果:

class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions


Possible improvements

正如开头所提到的,使用我的方法与其他答案混合可能会遗漏像import json import typing import numpy as np import spacy class Similarity: def __init__(self, centroid, nlp, n_threads: int, batch_size: int): # In our case it will be medicine self.centroid = centroid # spaCy's Language model (english), which will be used to return similarity to # centroid of each concept self.nlp = nlp self.n_threads: int = n_threads self.batch_size: int = batch_size self.missing: typing.List[int] = [] def __call__(self, concepts): concepts_similarity = [] # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL) for i, concept in enumerate( self.nlp.pipe( concepts, n_threads=self.n_threads, batch_size=self.batch_size ) ): if concept.has_vector: concepts_similarity.append(self.centroid.similarity(concept)) else: # If document has no vector, it's assumed to be totally dissimilar to centroid concepts_similarity.append(-1) self.missing.append(i) return np.array(concepts_similarity) class ActiveLearner: def __init__( self, concepts, concepts_similarity, samples: int, max_steps: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.samples: int = samples self.max_steps: int = max_steps self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("\nAre those concepts related to medicine?\n") print( "\n".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "\n", ) return input("[y]es / [n]o / [any]quit ") return "y" # True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False def fit(self): for _ in range(self.max_steps): available_concepts_indices = np.nonzero( self.concepts_similarity >= self.threshold_ )[0] if available_concepts_indices.size != 0: decision = self._ask_expert(available_concepts_indices) if not self._parse_expert_decision(decision): break else: self.threshold_ -= self.step return self class Classifier: def __init__(self, centroid, threshold: float): self.centroid = centroid self.threshold: float = threshold def predict(self, concepts_pipe): predictions = [] for concept in concepts_pipe: predictions.append(self.centroid.similarity(concept) > self.threshold) return predictions if __name__ == "__main__": nlp = spacy.load("en_vectors_web_lg") centroid = nlp("medicine") concepts = json.load(open("concepts_new.txt")) concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)( concepts ) learner = ActiveLearner( np.array(concepts), concepts_similarity, samples=20, max_steps=50 ).fit() print(f"Found threshold {learner.threshold_}\n") classifier = Classifier(centroid, learner.threshold_) pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096) predictions = classifier.predict(pipe) print( "\n".join( f"{concept}: {label}" for concept, label in zip(concepts[20:40], predictions[20:40]) ) ) 属于[-1, 0.1)的想法,主动学习方法在上面提到的两种启发式方法之间的抽签时将更具决定性的投票。

我们也可以创建一个主动学习集。而不是一个阈值,比如0.1,我们会使用它们中的多个(增加或减少),假设那些是[0.1, 1]

让我们说kartagener s syndrome: True summer season: True taq: False atypical neuroleptic: True anterior cingulate: False acute respiratory distress syndrome: True circularity: False mutase: False adrenergic blocking drug: True systematic desensitization: True the turning point: True 9l: False pyridazine: False bisoprolol: False trq: False propylhexedrine: False type 18: True darpp 32: False rickettsia conorii: False sport shoe: True 得到,对于每个阈值,它是各自的sport shoe像这样:


进行多数投票我们会在2票中选出3分来表示0.1, 0.2, 0.3, 0.4, 0.5。此外,如果低于它的阈值超出投票权,那么我也会减轻太严格的门槛(如果sport shoe看起来像这样:True/False)。

最后可能的改进我想出了:在上面的代码我使用True True False False False向量,这是创建概念的单词向量的意思。假设缺少一个单词(由零组成的向量),在这种情况下,它会被推离non-medical质心。您可能不希望这样(因为某些利基医学术语[像True/False或其他人这样的缩写]可能会缺少它们的表示),在这种情况下,您只能平均那些不等于零的向量。



“因此,我想知道是否有办法获得类别的True True True False False(例如,Docmedicine的类别属于gpv父类别)”

MediaWiki类别本身就是维基页面。 “父类别”只是“子”类别页面所属的类别。因此,您可以使用与获取任何其他Wiki页面的类别完全相同的方式获取类别的父类别。

例如,使用parent category

enzyme inhibitor



bypass surgery




现在这些最适合这项任务,因为它们包含来自维基百科语料库的大多数单词,但是如果它们不适合你,或者将来被删除,你可以使用我将在下面列出的更多这些,说,有一个更好的方法来做到这一点,即通过将它们传递给tensorflow的通用语言模型pymediawiki模块,在这个模块中你不需要做大部分繁重的工作,你可以阅读更多有关p = wikipedia.page('Category:Enzyme inhibitors') parents = p.categories 的原因我把它放在后面维基百科的文本转储是因为我听说有人说在处理医学样本时他们有点难以处理。 import re from mediawiki import MediaWiki #TermFind will search through a list a given term def TermFind(term,termList): responce=False for val in termList: if re.match('(.*)'+term+'(.*)',val): responce=True break return responce #Find if the links and backlinks lists contains a given term def BoundedTerm(wikiPage,term): aList=wikiPage.links bList=wikiPage.backlinks responce=False if TermFind(term,aList)==True and TermFind(term,bList)==True: responce=True return responce container=[] wikipedia = MediaWiki() for val in termlist: cpage=wikipedia.page(val) if BoundedTerm(cpage,'term')==True: container.append('medical') else: container.append('nonmedical') 确实提出了一个解决方案来解决这个问题,但我从来没有尝试过,所以我无法确定它的准确性。






Facebook的快速文本:This paper

或者这个embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2") embeddings = embed(["Input Text here as"," List of strings"]) session.run(embeddings)




file content.py

Follow this link





这也是在http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz的GitHub上托管你可以找到def AllTopics(): topics = []# list all your topics, not added here for space restricitons for i in range(len(topics)-1): yield topics[i] 文件和其他文件,在我的情况下我无法在所有主题上运行它,但我会敦促你在完整的主题列表上运行它(直接克隆存储库并运行SummaryGenerator.py),并通过pull请求上传similarity.txt,以防您没有得到预期的结果。如果可能的话,还可以将cazxswpoi作为主题和嵌入文件上传到csv文件中。

编辑后的更改2将similarityGenerator切换到基于层次结构的聚类(凝聚)我建议您将标题名称保留在树形图的底部,为了查看import wikipedia import pickle from content import Alltopics summary = [] failed = [] for topic in Alltopics(): try: summary.append(wikipedia.summary(tuple((topic,str(topic))))) except Exception as e: failed.append(tuple((topic,e))) with open("summary.txt", "wb") as fp: pickle.dump(summary , fp) with open('failed.txt', 'wb') as fp: pickle.dump('failed', fp) 的定义,我验证了查看一些样本,结果看起来很好的,你可以改变import tensorflow as tf import tensorflow_hub as hub import numpy as np import os import pandas as pd import re import pickle import sys from sklearn.cluster import AgglomerativeClustering from sklearn import metrics from scipy.cluster import hierarchy from scipy.spatial import distance_matrix try: with open("summary.txt", "rb") as fp: # Unpickling summary = pickle.load(fp) except Exception as e: print ('Cannot load the summary file, Please make sure that it exists, if not run Summary Generator first', e) sys.exit('Read the error message') module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3" embed = hub.Module(module_url) tf.logging.set_verbosity(tf.logging.ERROR) messages = [x[1] for x in summary] labels = [x[0] for x in summary] with tf.Session() as session: session.run([tf.global_variables_initializer(), tf.tables_initializer()]) message_embeddings = session.run(embed(messages)) # In message embeddings each vector is a second (1,512 vector) and is numpy.ndarray (noOfElemnts, 512) X = message_embeddings agl = AgglomerativeClustering(n_clusters=5, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', pooling_func='deprecated') agl.fit(X) dist_matrix = distance_matrix(X,X) Z = hierarchy.linkage(dist_matrix, 'complete') dendro = hierarchy.dendrogram(Z) cluster_labels = agl.labels_ 值来微调你的模型。注意:这需要您再次运行摘要生成器。我认为你应该能够从这里开始,你要做的就是尝试https://github.com/anandvsingh/WikipediaSimilarity的一些值,看看所有医学术语组合在一起,然后找到该群集的similarity.txt,你就完成了。从这里我们按摘要分组,群集将更准确。如果您遇到任何问题或不理解某些问题,请在下面发表评论。


message_embeddings库也是从给定页面中提取类别的好选择,因为dendrogram here返回一个简单的列表。如果它们都具有相同的标题,该库还允许您搜索多个页面。









