使用 SKLearn 执行 CategoricalNB() 时接收 IndexError

问题描述 投票:0回答:0

我目前在尝试使用 SKLearn 执行 CategoricalNB 时遇到索引错误。我这样做是为了我的数据类作业。

对于上下文,这里是我应该做的说明。我目前卡在步骤 4.1

*对于您的第三次反思,您将通过 sklearn 的库实现朴素贝叶斯。使用与您之前的家庭作业 Homework 3 相同的数据集,您将实施 sklearn 的朴素贝叶斯分类器进行训练和测试。

这是个人作业。

第 1 步: 将帖子添加到您的 Wordpress 页面或网站:https://umw.domains/Links 到外部站点。

第 2 步:Kaggle 加载此数据集

第 3 步: 选择 7 个要用作证据的分类特征/列。 (因为它们已经按顺序编码,你不需要重新编码)

- Provide rationale for selecting these 7.

第 4 步: 使用“前向选择算法”训练、测试和报告(反映)您观察到的内容

步骤 4.1:训练 p 个不同的单特征分类器,其中 p 是特征的数量(在您的例子中,p = 7),并查看每个分类器的表现。使用 f1-measure 作为选择最佳特征的唯一标准。(但您将计算其他指标,例如:准确度、精确度和召回率)选择单独使用时效果最好的特征作为获胜者。打印获胜特征的混淆矩阵。给个总结吧

步骤 4.2: 现在,对于剩余的 p-1 个特征中的每一个,用你原来的赢家加上新的赢家来训练一个分类器。选择“最佳”双特征分类器作为获胜者。同样,使用 f1-score 作为标准。打印获胜特征集的混淆矩阵。给个总结吧

步骤 4.3: 对 p − 2、p − 3、p − 4 中的每一个重复此操作。 . .剩下的特征,找到“最好的”三特征,四特征,五特征,。 . .分类器。打印获胜特征集的混淆矩阵。给个总结吧

步骤 4.4: 您的最终答案是“最佳”中的最佳。 (从“最佳”单特征、双特征等分类器中,选择性能最佳的一个。这是您将宣布获胜者的特征集。)*

现在这是我的python代码: 做clf.predict()时报错

# -*- coding: utf-8 -*-
"""
Created on Tue Mar 21 15:39:24 2023

@author: nico
"""

import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import precision_score, recall_score, f1_score

#read dataset
df = pd.read_csv('./Datasets/dataset.csv')
#print(len(df))

#make sure there are only graduates and dropouts (no other target is allowed)
df = df[(df['Target']=='Graduate') | (df['Target'] == 'Dropout')]
#print(len(df))
#Subsample to balance the dataset
grad = df[df['Target']=='Graduate']
drop = df[df['Target']=='Dropout']
grad = grad.sample(n=len(drop), random_state=101)
df = pd.concat([grad, drop],axis=0)
print(len(df['Target']=='Graduate'))
print(len(df['Target']=='Dropout'))

#splits data into Y which contains the Target and X which contains all seven evidence columns
Y = df['Target']

#Rename categories to not have spaces in the name
df.rename(
    columns={"Mother's qualification" : "Mom_Qualification", "Mother's occupation" : "Mom_Occupation", "Father's qualification" : "Dad_Qualification", "Father's occupation" : "Dad_Occupation", "Age at enrollment" : "Age_Enrolled"},
    inplace=True,
    )
X = df[['Course','Mom_Qualification','Mom_Occupation','Dad_Qualification','Dad_Occupation','Displaced','Age_Enrolled']]


#print(df.info())
'''
The Seven Columns I selected as evidence are:
    
1.    Course -                  The reason I chose Course is because some course are much harder than others, having higher dropout rates in comparison to other courses. 
                                Thus if a student is taking a course they are probably more likely to droput.  
             
2.    Moms's Qualification -    The reason I chose Mothers Qualification is because I think it could have a substantial influence on whether or not their child graduates.
                                As if a mother has high/outstanding qualifications then there child would have more support and be influenced to follow their parents footsteps and 
                                be more likely to Graduate
                                
3.    Moms's Occupation  -      The reason I chose Mothers Occupation is because I think it could a similar influence that a mothers qualification does. Where a higher paying occupation 
                                will provide more support and opportunities to the child, thus leading to a higher chance of graduating. 
                                
4.    Dad's Qualification -     Same reasons as Moms Qualification

5.    Dads Occupation  -        Same reasons as Moms Occupation          

6.    Displaced  -              The reason I chose Displaced is because I think students who are displaced will be under more stress and thus be more likely to drop out. So it could be
                                a telling factor as to whether or not a student is likely to graduate. 
                            
7.    Age Enrolled  -           The reason why I chose Age Enrolled is mostly because I found it to be interesting and wonder if certain ages have a higher rate of dropout out.
'''

#array containing all the evidence columns
columns = ['Course','Mom_Qualification','Mom_Occupation','Dad_Qualification','Dad_Occupation','Displaced','Age_Enrolled']
k = 10

print('Number of Graduates: ',len(df[df['Target'] == 'Graduate']))
print('Number of Dropouts: ',len(df[df['Target'] == 'Dropout']))

totalf1s = []
bestF1 = 0;
bestFeat = ''
bestConfusion = []
i = 0

#Notes from talking to Professor about the error: 
    
#give one column at time then tqo then three untill you reach all seven
#ur gigving it seven columns when its only predicting one so its getting an error
#also define min categories for each column. Look at NB_Sklearn.py

#have it iterate through one time  one featrue and every time after that
#just make sure to iterate through each

#work on it over the weekend. Due date extended due to the error

#meeting w professor next week for a longer discussion


#Step 4.1   
alreadyUsed = []

#iterate through each feature within a column of the DF
for feat in columns: 
    #grabs the minimum categories for that feature
    minCats = (len(df[feat].unique())) #get the minimum number of categories for this column
    
    #Subset the evidence data to only include one column
    singleX = df[[feat]]
    #KFold cross validation that suhffles data and has 10 folds
    kf = KFold(n_splits=k, shuffle=True)
    
    
    #Initialize lists to keep track of scores
    f1s = []
    accuracy = []
    precision = []
    recall = []
    predicted = []
    actual = []
   # print(feat)
   # print(minCats)
    
   #loop through each fold in the cross-validation
    for train_index, test_index in kf.split(X):
        #Keep track of which fold the system is on
        print("Fold: ", i+1)
        i+=1
        
        #Initiallize classifier (with the appropriate min categories)
        clf = CategoricalNB(min_categories=minCats)
        
        #get training data. For the evidence get the data from the single column subset of X
        trainX = singleX.iloc[train_index]
        trainY = Y.iloc[train_index]
        
        #get test data
        testX = singleX.iloc[test_index]
        testY = Y.iloc[test_index]
        
        #Train classifier on training set
        clf.fit(trainX, trainY)
        
        
        #make predictions on the test set
        predictions = clf.predict(testX) #INDEXERROR GIVEN HERE
       
        #Record predictions and update the lists that keep track of all scores for all columns
        f1s.append(f1_score(testY, predictions, pos_label="Graduate"),)
        accuracy.append(accuracy_score(testY,predictions))
        precision.append(precision_score(testY, predictions, pos_label='Graduate'))
        recall.append(recall_score(testY, predictions, pos_label='Graduate'))
        
        #end of Kfold cross vaildation
    
    #calculates average performance metrics acros K folds
    avgF1 = sum(f1s) / len(f1s)
    avgAccuracy = sum(accuracy) / len(accuracy)
    avgPrecision = sum(precision) / len(precision)
    avgRecall = sum(recall) / len(recall)
    
    #Print score info out for a single column
    print(feat, 'Averages Summary:')
    print('F1-Score: ',avgF1)        
    print('Accuracy: ',avgAccuracy)
    print('Precision: ',avgPrecision)
    print('Recall: ', avgRecall)
    #Reset the counter
    i = 0
    
    #I append this but havent used alreadyUsed[], have this just incase
    alreadyUsed.append(feat)
    
    #If the current columns F1 score is better than the current best, the current score becomes the best score
    if avgF1 > bestF1:
        bestF1 = avgF1 #record f1 scoree
        bestFeat = feat #record feature with the best f1 score
        bestConfusion = confusion_matrix(testY, predictions) #record the confusion matrix for the best feature
print(feat, " is the best one-feature classifier.")
print("Confusion Matrix: ", bestConfusion)        
 


#IGNORE THIS: IMPLEMENTING IT ONCE I GET THE ABOVE STEP WORKING BEFORE I TRY AND CONTINUE THIS PART FURTHER
#Step 2 (havent tested this yet)       
best_feat_set = [bestFeat]
best_feat_set_score = bestF1
for i in range(2, len(columns)+1):
    best_i_feat = ''
    best_i_score = 0
    for feat in columns:
        if feat not in best_feat_set:
            feat_set = best_feat_set + [feat]
            clf = CategoricalNB()
            kf = KFold(n_splits=5, shuffle=True, random_state=42)
           #Need to finish this part once I get 4.1 working

**这是我收到的错误:**

Traceback (most recent call last):

  File "D:\Comp Sci\219\ReflectionSKNB.py", line 139, in <module>
    predictions = clf.predict(testX)

  File "D:\Comp Sci\MyAnacondaDont\lib\site-packages\sklearn\naive_bayes.py", line 83, in predict
    jll = self._joint_log_likelihood(X)

  File "D:\Comp Sci\MyAnacondaDont\lib\site-packages\sklearn\naive_bayes.py", line 1461, in _joint_log_likelihood
    jll += self.feature_log_prob_[i][:, indices].T

IndexError: index 29 is out of bounds for axis 1 with size 29

代码在第一列成功完成了 KFold 交叉验证。但是,在第一个 for 循环的第二次迭代中,我收到索引错误。所以它在第一列成功运行,但第二列是弹出错误的地方。

当我第一次遇到这个错误时,我去找我的教授寻求帮助。我昨天(3 月 25 日)与他会面,并一直在尝试实施他的修复,但无济于事。在我们的讨论中,我们确定索引错误的原因是我没有在我的证据(X 数组)中使用一列,而是使用了所有七列。所以 predict 试图预测一列,但被输入了七列。

为了解决这个问题,我创建了 singleX 数据框。在代码中写着:singleX = df[[feat]]

教授认为我为什么会得到这个错误的另一个原因是因为最初我在初始化分类器时没有定义 min_categories 参数。

为了解决这个问题,我创建了一个 minCats 变量,它包含数据框中特定列中的最小类别。因此,在调用 CategoricalNB() 时,我将 min_categories 参数设置为等于 minCats。

在实施他的修复后,我仍然收到 IndexError。

我不知道我还应该做些什么或可以做些什么来解决这个问题。我读过类似的 stackoverflow 帖子,其中给出了相同的错误,但没有找到针对我的特定问题的任何解决方案。

如果有人能让我知道为什么会出现此错误以及如何修复它,我将不胜感激。这样我终于可以继续完成作业的其他步骤了。如果需要更多信息,请告诉我!提前谢谢你!

python scikit-learn data-analysis prediction naivebayes
© www.soinside.com 2019 - 2024. All rights reserved.