如何在Python中稳定Kmeans集群

问题描述 投票:0回答:1

我正在分析以确定给定文本中的“功能需求”。为了实现这一目标,我从文本中提取了名词和动词,并应用KMeans聚类将相似的语义单词分组在一起。随后,我通过关注包含大量动词和专有名词的集群来确定功能需求,因为这些通常分别表示可操作的任务和特定实体。但每次运行代码时,集群总是“出现不同的情况”,所以我想问是否有任何方法可以“修复每次运行的输出”。这样我就可以对其应用进一步的逻辑。 另外,如果您对这种寻找 SRS 生成功能需求的方法有任何反馈、评论或改进,我将不胜感激。 以下是代码: import spacy import numpy as np from sklearn.cluster import KMeans from spacy.lang.en.stop_words import STOP_WORDS import re import nltk # nltk.download('wordnet') from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() from textblob import TextBlob nlp = spacy.load("en_core_web_sm") nouns = set() verbs = set() sentence = "A hotel has a certain number of rooms. Each room can be either single bed or double bed type and may be AC or Non-AC type. Guests can reserve rooms in advance or can reserve rooms on the spot depending upon availability of rooms. The receptionist would enter data pertaining to guests such as their arrival time, advance paid, approximate duration of stay, and the type of the room required. Depending on this data and subject to the availability of a suitable room, the computer would allot a room number to the guest and assign a unique token number to each guest. If the guest cannot be accommodated, the computer generates an apology message. The hotel catering services manager would input the quantity and type of food items as and when consumed by the guest, the token number of the guest, and the corresponding date and time. When a customer prepares to check-out, the hotel automation software should generate the entire bill for the customer and also print the balance amount payable by him. During check-out, guests can opt to register themselves for a frequent guests program." clean_text = re.sub(r'[^A-Za-z\s]', '', sentence) # Tokenize the sentence and remove stopwords tokens = nlp(clean_text) filtered_words = [token.text for token in tokens if token.text.lower() not in STOP_WORDS] filtered_sentence = ' '.join(filtered_words) blob = TextBlob(filtered_sentence) tags = blob.tags # print(tags) for word, tag in tags: if tag.startswith('NN'): # Check if the tag indicates a noun nouns.add(word.lower()) elif tag.startswith('VB'): # Check if the tag indicates a verb verbs.add(word.lower()) # print("Nouns => ", nouns) # print("Verbs => ", verbs) combined_sentence = list(nouns) + list(verbs) newstr=' '.join(combined_sentence) filtered_tokens = nlp(newstr) word_vectors = np.array([token.vector for token in filtered_tokens]) # Perform K-means clustering on the word vectors num_clusters = 15 kmeans = KMeans(n_clusters=num_clusters) kmeans.fit(word_vectors) # Retrieve the cluster labels for each filtered token cluster_labels = kmeans.labels_ word_clusters = {} for i, token in enumerate(filtered_tokens): if token.text==".": continue if cluster_labels[i] in word_clusters: word_clusters[cluster_labels[i]].append(token.text) else: word_clusters[cluster_labels[i]] = [token.text] for cluster_id, words in word_clusters.items(): print(f"Cluster {cluster_id + 1}: {', '.join(words)}") # Calculate verb density threshold based on total words in clusters threshold = len(filtered_tokens) // num_clusters print("\nFunctional Requirements:") for cluster_id, words in word_clusters.items(): verb_count = sum(1 for word in words if nlp(word)[0].pos_ == 'VERB' or nlp(word)[0].pos_ == 'PROPN') if verb_count >= 1: print(f"Cluster {cluster_id + 1}: {', '.join(words)}")

您需要的是来自 kmeans 构造函数

kmeans_docs
random_state
python nlp nltk spacy srs
1个回答
0
投票

在一个更简单的示例中,在 Jupyter NB 的单元上运行以下代码: import numpy as np np.random.seed(seed=20) test = np.random.random(size=(100,6)) num_clusters = 15 kmeans = KMeans(n_clusters=4, random_state=159) kmeans.fit(test) kmeans.predict(test) 然后请注意,在不同的单元格上运行以下代码总是会得到相同的结果。

num_clusters = 15 new_kmeans = KMeans(n_clusters=4, random_state=159) new_kmeans.fit(test) new_kmeans.predict(test)

(即使重新启动内核,仍然会得到相同的结果)。

© www.soinside.com 2019 - 2024. All rights reserved.