如何在python中对维基百科类别进行分组？

Question

对于我的数据集的每个概念，我存储了相应的维基百科类别。例如，请考虑以下5个概念及其相应的维基百科类别。

高甘油三酯血症：['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
酶抑制剂：['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
搭桥手术：['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
珀斯：['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
气候：['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']

如您所见，前三个概念属于医学领域（而其余两个术语不是医学术语）。

更准确地说，我想把我的概念分为医学和非医学。但是，仅使用类别来划分概念是非常困难的。例如，尽管enzyme inhibitor和bypass surgery这两个概念属于医学领域，但它们的类别彼此非常不同。

因此，我想知道是否有办法获得类别的parent category（例如，enzyme inhibitor和bypass surgery的类别属于medical父类别）

我目前正在使用pymediawiki和pywikibot。但是，我不仅限于这两个库，并且很乐意使用其他库来解决问题。

编辑

正如@IlmariKaronen所建议的，我也使用了categories of categories，我得到的结果如下（category附近的小字体是categories of the category）。

但是，我仍然找不到使用这些类别细节来决定某个术语是医学还是非医学术语的方法。

此外，正如@IlmariKaronen使用Wikiproject指出的细节可能是潜在的。然而，似乎Medicine wikiproject似乎没有所有医学术语。因此，我们还需要检查其他wikiprojects。

编辑：我目前从维基百科概念中提取类别的代码如下。这可以使用pywikibot或pymediawiki完成如下。

使用库qazxsw poi 将mediawiki导入为pw qazxsw poi
使用库pymediawiki p = wikipedia.page('enzyme inhibitor') print(p.categories)

类别的类别也可以用@IlmariKaronen的答案中所示的相同方式完成。

如果您正在寻找更长的测试概念列表，我在下面提到了更多示例。

pywikibot

如需很长的清单，请查看以下链接。 import pywikibot as pw site = pw.Site('en', 'wikipedia') print([ cat.title() for cat in pw.Page(site, 'support-vector machine').categories() if 'hidden' not in cat.categoryinfo ])

注意：我不希望解决方案100％工作（如果提出的算法能够检测到对我来说足够多的医学概念）

如果需要，我很乐意提供更多细节。

Answer 1

Solution Overview

好的，我会从多个方向解决问题。这里有一些很好的建议，如果我是你，我会使用这些方法的集合（多数投票，预测标签，在你的二进制情况下超过50％的分类器达成一致）。

我正在考虑以下方法：

积极学习（下面提供的示例方法）
['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']作为https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing的答案提供
SPARQL祖先类别由MediaWiki backlinks提供的@TavoGC和/或@Stanislav Kralin作为对您的问题的评论（根据他们的差异，这两个可能是他们自己的合奏，但为此你必须联系两个创作者并比较他们的结果）。

这样，三分之二的人必须同意某个概念是医学概念，这可以进一步减少错误的可能性。

虽然我们在谈论它，但我会反对parent categories在@Meena Nagarajan提出的方法，因为：

距离度量不应该是欧几里德，余弦相似性是更好的度量（由例如@ananand_v.singh使用），因为它没有考虑向量的大小（它不应该，这是word2vec或GloVe如何训练）
如果我理解正确，就会产生许多人工集群，而我们只需要两个：医学和非医学集群。此外，医学的质心不是以药物本身为中心。这会产生额外的问题，比如说，质心远离药物，而其他词语，例如，this answer或spaCy（或任何其他不合适的药物）可能会进入群集。
评估结果很难，更重要的是，这个问题是严格主观的。此外，单词向量难以想象和理解（使用PCA / TSNE /类似物将它们转换为较低维度[2D / 3D]这么多单词，会给我们完全非感性的结果[是的，我试过这样做，PCA大约5％解释了你的较长数据集的方差，真的，真的很低]）。

基于上面提到的问题，我提出了使用computer的解决方案，这是对这些问题非常遗忘的方法。

Active Learning approach

在这个机器学习的子集中，当我们很难想出一个精确的算法时（就像一个术语是human类别的一部分意味着什么），我们问人类“专家”（实际上并不需要是专家）提供一些答案。

知识编码

正如active learning指出的那样，单词向量是最有前途的方法之一，我也将在这里使用它（不同的是，IMO以更清洁和更容易的方式）。

我不打算在我的回答中重复他的观点，所以我将加上我的两分钱：

不要使用语境化嵌入字符作为当前可用的现有技术（例如medical）
检查有多少概念没有表示（例如，表示为零向量）。它应该被检查（并在我的代码中检查，到时候会有进一步的讨论），你可以使用其中大部分存在的嵌入。

Measuring similarity using spaCy

此类测量编码为spaCy的GloVe单词向量的anand_v.singh与其他所有概念之间的相似性。

BERT

此代码将为每个概念返回一个数字，用于衡量它与质心的相似程度。此外，它记录了缺少其表示的概念索引。可能会这样调用：

medicine

你可以用你的数据代替qazxsw poi。

看看class Similarity: def __init__(self, centroid, nlp, n_threads: int, batch_size: int): # In our case it will be medicine self.centroid = centroid # spaCy's Language model (english), which will be used to return similarity to # centroid of each concept self.nlp = nlp self.n_threads: int = n_threads self.batch_size: int = batch_size self.missing: typing.List[int] = [] def __call__(self, concepts): concepts_similarity = [] # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL) for i, concept in enumerate( self.nlp.pipe( concepts, n_threads=self.n_threads, batch_size=self.batch_size ) ): if concept.has_vector: concepts_similarity.append(self.centroid.similarity(concept)) else: # If document has no vector, it's assumed to be totally dissimilar to centroid concepts_similarity.append(-1) self.missing.append(i) return np.array(concepts_similarity)并注意到我使用过import json import typing import numpy as np import spacy nlp = spacy.load("en_vectors_web_lg") centroid = nlp("medicine") concepts = json.load(open("concepts_new.txt")) concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)( concepts )。它由685.000个独特的单词向量（很多）组成，并且可以为您的情况开箱即用。安装spaCy后，您必须单独下载，上面的链接提供了更多信息。

此外，您可能希望使用多个质心字，例如添加像new_concepts.json或spacy.load这样的单词并平均他们的单词向量。我不确定这是否会对你的情况产生积极影响。

其他可能性可能是使用多个质心并计算每个概念和质心的多个之间的相似性。在这种情况下，我们可能会有一些阈值，这可能会删除一些en_vectors_web_lg，但可能会遗漏一些可以认为类似于disease的术语。此外，它会使案件更加复杂，但如果你的结果不令人满意，你应该考虑上面的两个选项（并且只有这些选项，没有先前的想法，不要跳到这种方法）。

现在，我们粗略地衡量概念的相似性。但是，某个概念与药物有0.1个正相似性意味着什么呢？这是一个应该归类为医疗的概念吗？或者可能已经太远了？

问专家

要获得一个阈值（低于它的条件将被视为非医学），最简单的方法是让人类为我们分类一些概念（这就是主动学习的内容）。是的，我知道这是一种非常简单的主动学习形式，但无论如何我都会这么认为。

我用health接口编写了一个类，要求人们对概念进行分类，直到达到最佳阈值（或最大迭代次数）。

false positives

medicine参数描述了在每次迭代期间将向专家显示多少个例子（它是最大值，如果已经要求样本或者没有足够的样本显示，它将返回更少）。
sklearn-like表示每次迭代中阈值的下降（我们从1开始意味着完美的相似性）。
class ActiveLearner: def __init__( self, concepts, concepts_similarity, max_steps: int, samples: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.max_steps: int = max_steps self.samples: int = samples self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 - 如果专家回答概念不相关（或大部分不相关，因为返回多个概念），则步骤乘以此浮点数。它用于精确定位每次迭代时samples变化之间的精确阈值。
概念根据它们的相似性进行排序（概念越相似，越高）

下面的功能要求专家提出意见，并根据他的答案找到最佳阈值。

step

示例问题如下所示：

change_multiplier

...解析专家的答案：

step

最后是qazxsw poi的整个代码，根据专家找到最佳相似度阈值：

def _ask_expert(self, available_concepts_indices):
    # Get random concepts (the ones above the threshold)
    concepts_to_show = set(
        np.random.choice(
            available_concepts_indices, len(available_concepts_indices)
        ).tolist()
    )
    # Remove those already presented to an expert
    concepts_to_show = concepts_to_show - self._checked_concepts
    self._checked_concepts.update(concepts_to_show)
    # Print message for an expert and concepts to be classified
    if concepts_to_show:
        print("\nAre those concepts related to medicine?\n")
        print(
            "\n".join(
                f"{i}. {concept}"
                for i, concept in enumerate(
                    self.concepts[list(concepts_to_show)[: self.samples]]
                )
            ),
            "\n",
        )
        return input("[y]es / [n]o / [any]quit ")
    return "y"

总而言之，您必须手动回答一些问题，但我认为这种方法更为准确。

此外，您不必遍历所有样本，只需要一小部分样本。您可以决定有多少样本构成一个医学术语（是否显示了40个医学样本和10个非医学样本，仍应被视为医学？），这样您就可以根据自己的喜好对这种方法进行微调。如果存在异常值（例如，50个中的1个样本是非医疗的），我会认为该阈值仍然有效。

再次：这种方法应与其他方法混合，以尽量减少错误分类的机会。

分类

当我们从专家那里获得阈值时，分类将是瞬时的，这里是一个简单的分类类：

Are those concepts related to medicine?                                                      

0. anesthetic drug                                                                                                                                                                         
1. child and adolescent psychiatry                                                                                                                                                         
2. tertiary care center                                                     
3. sex therapy                           
4. drug design                                                                                                                                                                             
5. pain disorder                                                      
6. psychiatric rehabilitation                                                                                                                                                              
7. combined oral contraceptive                                
8. family practitioner committee                           
9. cancer family syndrome                          
10. social psychology                                                                                                                                                                      
11. drug sale                                                                                                           
12. blood system                                                                        

[y]es / [n]o / [any]quit y

为简洁起见，这是最终的源代码：

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
    if decision.lower() == "y":
        # You can't go higher as current threshold is related to medicine
        self._max_threshold = self.threshold_
        if self.threshold_ - self.step < self._min_threshold:
            return False
        # Lower the threshold
        self.threshold_ -= self.step
        return True
    if decision.lower() == "n":
        # You can't got lower than this, as current threshold is not related to medicine already
        self._min_threshold = self.threshold_
        # Multiply threshold to pinpoint exact spot
        self.step *= self.change_multiplier
        if self.threshold_ + self.step < self._max_threshold:
            return False
        # Lower the threshold
        self.threshold_ += self.step
        return True
    return False

在回答了一些问题后，阈值为0.1（ActiveLearner之间的所有内容都被认为是非医学的，而class ActiveLearner: def __init__( self, concepts, concepts_similarity, samples: int, max_steps: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.samples: int = samples self.max_steps: int = max_steps self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("\nAre those concepts related to medicine?\n") print( "\n".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "\n", ) return input("[y]es / [n]o / [any]quit ") return "y" # True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False def fit(self): for _ in range(self.max_steps): available_concepts_indices = np.nonzero( self.concepts_similarity >= self.threshold_ )[0] if available_concepts_indices.size != 0: decision = self._ask_expert(available_concepts_indices) if not self._parse_expert_decision(decision): break else: self.threshold_ -= self.step return self被认为是医学上的）我得到了以下结果：

class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions

正如您所看到的，这种方法远非完美，因此最后一节描述了可能的改进：

Possible improvements

正如开头所提到的，使用我的方法与其他答案混合可能会遗漏像import json import typing import numpy as np import spacy class Similarity: def __init__(self, centroid, nlp, n_threads: int, batch_size: int): # In our case it will be medicine self.centroid = centroid # spaCy's Language model (english), which will be used to return similarity to # centroid of each concept self.nlp = nlp self.n_threads: int = n_threads self.batch_size: int = batch_size self.missing: typing.List[int] = [] def __call__(self, concepts): concepts_similarity = [] # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL) for i, concept in enumerate( self.nlp.pipe( concepts, n_threads=self.n_threads, batch_size=self.batch_size ) ): if concept.has_vector: concepts_similarity.append(self.centroid.similarity(concept)) else: # If document has no vector, it's assumed to be totally dissimilar to centroid concepts_similarity.append(-1) self.missing.append(i) return np.array(concepts_similarity) class ActiveLearner: def __init__( self, concepts, concepts_similarity, samples: int, max_steps: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.samples: int = samples self.max_steps: int = max_steps self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("\nAre those concepts related to medicine?\n") print( "\n".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "\n", ) return input("[y]es / [n]o / [any]quit ") return "y" # True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False def fit(self): for _ in range(self.max_steps): available_concepts_indices = np.nonzero( self.concepts_similarity >= self.threshold_ )[0] if available_concepts_indices.size != 0: decision = self._ask_expert(available_concepts_indices) if not self._parse_expert_decision(decision): break else: self.threshold_ -= self.step return self class Classifier: def __init__(self, centroid, threshold: float): self.centroid = centroid self.threshold: float = threshold def predict(self, concepts_pipe): predictions = [] for concept in concepts_pipe: predictions.append(self.centroid.similarity(concept) > self.threshold) return predictions if __name__ == "__main__": nlp = spacy.load("en_vectors_web_lg") centroid = nlp("medicine") concepts = json.load(open("concepts_new.txt")) concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)( concepts ) learner = ActiveLearner( np.array(concepts), concepts_similarity, samples=20, max_steps=50 ).fit() print(f"Found threshold {learner.threshold_}\n") classifier = Classifier(centroid, learner.threshold_) pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096) predictions = classifier.predict(pipe) print( "\n".join( f"{concept}: {label}" for concept, label in zip(concepts[20:40], predictions[20:40]) ) )属于[-1, 0.1)的想法，主动学习方法在上面提到的两种启发式方法之间的抽签时将更具决定性的投票。

我们也可以创建一个主动学习集。而不是一个阈值，比如0.1，我们会使用它们中的多个（增加或减少），假设那些是[0.1, 1]。

让我们说kartagener s syndrome: True summer season: True taq: False atypical neuroleptic: True anterior cingulate: False acute respiratory distress syndrome: True circularity: False mutase: False adrenergic blocking drug: True systematic desensitization: True the turning point: True 9l: False pyridazine: False bisoprolol: False trq: False propylhexedrine: False type 18: True darpp 32: False rickettsia conorii: False sport shoe: True得到，对于每个阈值，它是各自的sport shoe像这样：

medicine，

进行多数投票我们会在2票中选出3分来表示0.1, 0.2, 0.3, 0.4, 0.5。此外，如果低于它的阈值超出投票权，那么我也会减轻太严格的门槛（如果sport shoe看起来像这样：True/False）。

最后可能的改进我想出了：在上面的代码我使用True True False False False向量，这是创建概念的单词向量的意思。假设缺少一个单词（由零组成的向量），在这种情况下，它会被推离non-medical质心。您可能不希望这样（因为某些利基医学术语[像True/False或其他人这样的缩写]可能会缺少它们的表示），在这种情况下，您只能平均那些不等于零的向量。

我知道这篇文章相当冗长，所以如果您有任何问题，请在下面发布。

Answer 2

“因此，我想知道是否有办法获得类别的True True True False False（例如，Doc和medicine的类别属于gpv父类别）”

MediaWiki类别本身就是维基页面。 “父类别”只是“子”类别页面所属的类别。因此，您可以使用与获取任何其他Wiki页面的类别完全相同的方式获取类别的父类别。

例如，使用parent category：

enzyme inhibitor

Answer 3

您可以尝试按照每个类别返回的mediawiki链接和反向链接对维基百科类别进行分类

bypass surgery

我的想法是尝试猜测大多数类别共有的术语，我尝试生物学，医学和疾病，并取得良好的效果。也许您可以尝试使用BoundedTerms的多个调用来进行分类，或者单个调用多个术语并将结果组合起来进行分类。希望能帮助到你

Answer 4

在NLP中有一个单词向量的概念，它基本上是通过查看大量文本来实现的，它试图将单词转换为多维向量，然后减少这些向量之间的距离，使它们之间的相似性更大，好事情是，许多人已经生成了这个单词向量，并在非常宽松的许可下使它们可用，并且在你的情况下你正在使用维基百科，并且在这里存在单词向量medical

现在这些最适合这项任务，因为它们包含来自维基百科语料库的大多数单词，但是如果它们不适合你，或者将来被删除，你可以使用我将在下面列出的更多这些，说，有一个更好的方法来做到这一点，即通过将它们传递给tensorflow的通用语言模型pymediawiki模块，在这个模块中你不需要做大部分繁重的工作，你可以阅读更多有关p = wikipedia.page('Category:Enzyme inhibitors') parents = p.categories的原因我把它放在后面维基百科的文本转储是因为我听说有人说在处理医学样本时他们有点难以处理。 import re from mediawiki import MediaWiki #TermFind will search through a list a given term def TermFind(term,termList): responce=False for val in termList: if re.match('(.*)'+term+'(.*)',val): responce=True break return responce #Find if the links and backlinks lists contains a given term def BoundedTerm(wikiPage,term): aList=wikiPage.links bList=wikiPage.backlinks responce=False if TermFind(term,aList)==True and TermFind(term,bList)==True: responce=True return responce container=[] wikipedia = MediaWiki() for val in termlist: cpage=wikipedia.page(val) if BoundedTerm(cpage,'term')==True: container.append('medical') else: container.append('nonmedical')确实提出了一个解决方案来解决这个问题，但我从来没有尝试过，所以我无法确定它的准确性。

现在你如何使用tensorflow中的嵌入这个词很简单，就行了

http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

由于你可能不熟悉tensorflow并试图运行这段代码，你可能会遇到一些麻烦，embed他们已经完全提到了如何使用它，从那里你应该能够轻松地根据你的需要修改它。

有了这个说我会建议首先检查他的张量模块嵌入模块和他们预先训练好的单词嵌入，如果他们不为你工作检查维基媒体链接，如果这也不起作用然后继续本文的概念我有联系。由于这个答案描述的是NLP方法，因此它不会100％准确，因此在继续之前请记住这一点。

手套矢量here.

Facebook的快速文本：This paper

或者这个embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2") embeddings = embed(["Input Text here as"," List of strings"]) session.run(embeddings)

如果您在遵循colab教程后遇到问题，请将问题添加到下面的问题和评论中，从那里我们可以继续进行。

编辑添加的代码以集群主题

简单，我正在编码他们的摘要句子，而不是使用单词向量

file content.py

Follow this link

文件summaryGenerator.py

https://nlp.stanford.edu/projects/glove/

文件SimilartiyCalculator.py

https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

这也是在http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz的GitHub上托管你可以找到def AllTopics(): topics = []# list all your topics, not added here for space restricitons for i in range(len(topics)-1): yield topics[i]文件和其他文件，在我的情况下我无法在所有主题上运行它，但我会敦促你在完整的主题列表上运行它（直接克隆存储库并运行SummaryGenerator.py），并通过pull请求上传similarity.txt，以防您没有得到预期的结果。如果可能的话，还可以将cazxswpoi作为主题和嵌入文件上传到csv文件中。

编辑后的更改2将similarityGenerator切换到基于层次结构的聚类（凝聚）我建议您将标题名称保留在树形图的底部，为了查看import wikipedia import pickle from content import Alltopics summary = [] failed = [] for topic in Alltopics(): try: summary.append(wikipedia.summary(tuple((topic,str(topic))))) except Exception as e: failed.append(tuple((topic,e))) with open("summary.txt", "wb") as fp: pickle.dump(summary , fp) with open('failed.txt', 'wb') as fp: pickle.dump('failed', fp)的定义，我验证了查看一些样本，结果看起来很好的，你可以改变import tensorflow as tf import tensorflow_hub as hub import numpy as np import os import pandas as pd import re import pickle import sys from sklearn.cluster import AgglomerativeClustering from sklearn import metrics from scipy.cluster import hierarchy from scipy.spatial import distance_matrix try: with open("summary.txt", "rb") as fp: # Unpickling summary = pickle.load(fp) except Exception as e: print ('Cannot load the summary file, Please make sure that it exists, if not run Summary Generator first', e) sys.exit('Read the error message') module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3" embed = hub.Module(module_url) tf.logging.set_verbosity(tf.logging.ERROR) messages = [x[1] for x in summary] labels = [x[0] for x in summary] with tf.Session() as session: session.run([tf.global_variables_initializer(), tf.tables_initializer()]) message_embeddings = session.run(embed(messages)) # In message embeddings each vector is a second (1,512 vector) and is numpy.ndarray (noOfElemnts, 512) X = message_embeddings agl = AgglomerativeClustering(n_clusters=5, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', pooling_func='deprecated') agl.fit(X) dist_matrix = distance_matrix(X,X) Z = hierarchy.linkage(dist_matrix, 'complete') dendro = hierarchy.dendrogram(Z) cluster_labels = agl.labels_值来微调你的模型。注意：这需要您再次运行摘要生成器。我认为你应该能够从这里开始，你要做的就是尝试https://github.com/anandvsingh/WikipediaSimilarity的一些值，看看所有医学术语组合在一起，然后找到该群集的similarity.txt，你就完成了。从这里我们按摘要分组，群集将更准确。如果您遇到任何问题或不理解某些问题，请在下面发表评论。

Answer 5

message_embeddings库也是从给定页面中提取类别的好选择，因为dendrogram here返回一个简单的列表。如果它们都具有相同的标题，该库还允许您搜索多个页面。

在医学中，似乎有许多关键词根和后缀，因此找到关键词的方法可能是寻找医学术语的好方法。

n_clusters

代码实际上只是将关键字和后缀列表与每个页面的标题及其类别进行比较，以确定页面是否与医学相关。它还查看较大主题的相关页面/子页面，并确定它们是否相关。我不太熟悉我的药，所以请原谅这些类别，但这里是一个标记到底部的例子：

n_cluster

这个示例列表获得了列表中应该包含的内容的~70％，至少据我所知。

Answer 6

这个问题对我来说有点不清楚，似乎不是一个简单的问题需要解决，可能需要一些NLP模型。此外，词语概念和类别可互换使用。我的理解是，酶抑制剂，旁路手术和高甘油三酯血症等概念需要结合在一起作为医学，其余的作为非医学。此问题需要的数据多于类别名称。需要语料库来训练LDA模型（例如），其中整个文本信息被馈送到算法并且它返回每个概念的最可能的主题。

cluster_label

如何在python中对维基百科类别进行分组？

问题描述投票：15回答：6

6个回答

Solution Overview

Active Learning approach

知识编码

Measuring similarity using spaCy

问专家

分类

Possible improvements

最新问题

如何在python中对维基百科类别进行分组？

问题描述 投票：15回答：6

6个回答

Solution Overview

Active Learning approach

知识编码

Measuring similarity using spaCy

问专家

分类

Possible improvements

最新问题

问题描述投票：15回答：6