如何使用sklearn增加MultinomialNB()的准确度分数,并使用matplotlib在图形中显示结果?

问题描述 投票:0回答:1

我正在研究一个如下所示的数据集:this

在我附带的屏幕截图中,您可以看到我的数据集包含16行和12个元组,但实际上它包含521行和12个元组。

  • 第一栏是:“月经早期开始”
  • 第二栏:“口服避孕药”
  • 第三栏:“饮食维持”
  • 第四栏:“受乳腺癌影响”
  • 第五栏:“受宫颈癌影响?”
  • 第六栏:“家庭中的癌症史?”
  • 第7栏:“教育?”
  • 第8栏:“丈夫的年龄”
  • 第9栏:“更年期结束年龄?”
  • 第10栏:“食物含有高脂肪?”
  • 第11栏:“堕胎?”
  • 第12栏:“受卵巢癌影响?”。

这里所有列都包含分类变量。所以我通过使用LabelEncoder和OneHotEncoder预处理数据集并避免虚拟变量陷阱我删除了创建2列以上的第1列虚拟变量。

然后我用test_size = 0.25和random_state = 18将数据集分成2部分,然后我将X_train和y_train拟合到MultinomialNB()并得到准确度得分0.7938931297709924。

然后我构建了几个学习曲线,看起来像这个this和这个this one但最重要的是我的模型给出R平方值:0.557和Adj。 R平方:0.543,这是不好的我假设。

这是我的混淆矩阵confusion matrix 我希望r平方和adj平方值都在1左右,但我不明白我怎么能有效地做到这一点,因为我是这个领域的新手并且之前没有使用任何包含所有分类变量且没有值的数据集,请帮助我使用朴素贝叶斯算法更好地使我的模型如果你在我的模型中发现任何错误,请让我知道和帮助,也请通过提供资源和教程+代码示例来帮助我构建数据可视化图形我的模型。这是我的代码:


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 

#Importing the dataset
dataset = pd.read_csv('RiskFactor.csv')
X =  dataset.iloc[:, :-1].values
y = dataset.iloc[:, 11].values
#dummy_x = dataset.iloc[:, [0,6,7,8]].values

from sklearn.preprocessing import LabelEncoder,OneHotEncoder

label_x = LabelEncoder()
X[:,0] = label_x.fit_transform(X[:,0] ) #Menarche start early

label_x = LabelEncoder()
X[:,1] = label_x.fit_transform(X[:,1] )

label_x = LabelEncoder()
X[:,2] = label_x.fit_transform(X[:,2] )

label_x = LabelEncoder()
X[:,3] = label_x.fit_transform(X[:,3] )

label_x = LabelEncoder()
X[:,4] = label_x.fit_transform(X[:,4] ) 
label_x = LabelEncoder()
X[:,5] = label_x.fit_transform(X[:,5] )

label_x = LabelEncoder()
X[:,6] = label_x.fit_transform(X[:,6] ) #Education

label_x = LabelEncoder()
X[:,7] = label_x.fit_transform(X[:,7] ) #Age of Husband

label_x = LabelEncoder()
X[:,8] = label_x.fit_transform(X[:,8] ) #Menopause End age?

label_x = LabelEncoder()
X[:,9] = label_x.fit_transform(X[:,9] )


label_x = LabelEncoder()
X[:,10] = label_x.fit_transform(X[:,10] )

onehotencoder = OneHotEncoder(categorical_features = "all")
X = onehotencoder.fit_transform(X).toarray()


#avoiding dummy variable trap by removing extra columns 

X = X[: ,[1,2,3,4,5,6,7,8,9,10,11,12,14,15,17,18,20,21,22,23,24,25,26]]


# Encoding the Dependent Variable

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.25,
random_state = 18)

from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

print(classifier)

y_expect = y_test



#predicting the test set result

y_pred = classifier.predict(X_test)

#Making the Confusion Matrix

from sklearn.metrics import confusion_matrix,accuracy_score

cm = confusion_matrix (y_test, y_pred)


print(accuracy_score(y_expect,y_pred))


# finding P value from statsmodels

import statsmodels.formula.api as sm

regressor_OLS = sm.OLS(endog=y,exog = X).fit()

regressor_OLS.summary()

from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                    n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and training learning curve.

Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
    An object of that type which is cloned for each validation.

title : string
    Title for the chart.

X : array-like, shape (n_samples, n_features)
    Training vector, where n_samples is the number of samples and
    n_features is the number of features.

y : array-like, shape (n_samples) or (n_samples, n_features), optional
    Target relative to X for classification or regression;
    None for unsupervised learning.

ylim : tuple, shape (ymin, ymax), optional
    Defines minimum and maximum yvalues plotted.

cv : int, cross-validation generator or an iterable, optional
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:
      - None, to use the default 3-fold cross-validation,
      - integer, to specify the number of folds.
      - An object to be used as a cross-validation generator.
      - An iterable yielding train/test splits.

    For integer/None inputs, if ``y`` is binary or multiclass,
    :param train_sizes:
    :class:`StratifiedKFold` used. If the estimator is not a classifier
    or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

    Refer :ref:`User Guide <cross_validation>` for the various
    cross-validators that can be used here.

 n_jobs : integer, optional
    Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
    plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
    estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1,
                 color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")

plt.legend(loc="best")
return plt



estimator = MultinomialNB()


title = "Learning Curves (Naive Bayes classifier ALGORITHM)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation    

#set.
cv = ShuffleSplit(n_splits=100, test_size=0.25, random_state=17)

#cv = None
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv,    
n_jobs=1)

plt.show()
matplotlib machine-learning scikit-learn naivebayes multinomial
1个回答
0
投票
I've solved this problem by using PCA ,here is the code :   


# -*- coding: utf-8 -*-
"""
Created on Tue Jul 31 22:38:32 2018

@author: MOBASSIR
"""


# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd




#Importing the dataset
dataset = pd.read_csv('ovarian.csv')
X =  dataset.iloc[:, :-1].values
y = dataset.iloc[:, 11].values
#dummy_x = dataset.iloc[:, [0,6,7,8]].values

from sklearn.preprocessing import LabelEncoder,OneHotEncoder

label_x1 = LabelEncoder()
X[:,0] = label_x1.fit_transform(X[:,0] ) #Menarche start early



label_x2 = LabelEncoder()
X[:,1] = label_x2.fit_transform(X[:,1] )

label_x3 = LabelEncoder()
X[:,2] = label_x3.fit_transform(X[:,2] )


label_x4 = LabelEncoder()
X[:,3] = label_x4.fit_transform(X[:,3] )


label_x5 = LabelEncoder()
X[:,4] = label_x5.fit_transform(X[:,4] )


label_x6 = LabelEncoder()
X[:,5] = label_x6.fit_transform(X[:,5] )


label_x7 = LabelEncoder()
X[:,6] = label_x7.fit_transform(X[:,6] ) #Education




label_x8 = LabelEncoder()
X[:,7] = label_x8.fit_transform(X[:,7] ) #Age of Husband




label_x9 = LabelEncoder()
X[:,8] = label_x9.fit_transform(X[:,8] ) #Menopause End age?




label_x10 = LabelEncoder()
X[:,9] = label_x10.fit_transform(X[:,9] )




label_x11 = LabelEncoder()
X[:,10] = label_x11.fit_transform(X[:,10] )




onehotencoder = OneHotEncoder(categorical_features = [0,6,7,8])
X = onehotencoder.fit_transform(X).toarray()


# Avoiding the Dummy Variable Trap

"""

idx_to_delete = [0, 13, 16, 19]
X = [i for i in range(X.shape[-1]) if i not in idx_to_delete]

X = X[:, 1:]


df = pd.DataFrame(X, dtype='float64')


df = pd.to_numeric(X)

"""


#avoiding dummy variable trap by removing extra columns

#X = X[: ,[1,2,3,4,5,6,7,8,9,10,11,12,14,15,17,18,20,21,22,23,24,25,26]]

"""
#4,8,10,12,18,21,22,23 for dropped columns
#5,9,11,13,19,22,23,24 for dropped columns
#1,4,5,6 == 2,5,6,7
X = X[: ,[9,11,23,24]]
"""

 #24,21,19,18,17,14,12,10,8,7,6 ,4 ,3 ,2,1 for undropped column
 #25,22,20,19,18,15,13,11,9,8,7 ,5 ,4 ,3,2
 #2,5,6,8,12,15
X = X[: ,[9,13,16,18,19]]


# Encoding the Dependent Variable

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
"""
onehotencoder = OneHotEncoder()
y= onehotencoder.fit_transform(y).toarray()
"""





# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)




# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)




# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_






#Applying naive bayes classifier

from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB

classifier = BernoulliNB()
classifier.fit(X_train, y_train)

print(classifier)

y_expect = y_test



#predicting the test set result

y_pred = classifier.predict(X_test)

#Making the Confusion Matrix

from sklearn.metrics import confusion_matrix,accuracy_score

cm = confusion_matrix (y_test, y_pred)


print(accuracy_score(y_expect,y_pred))





# finding P value from statsmodels

import statsmodels.formula.api as sm

regressor_OLS = sm.OLS(endog=y,exog = X).fit()

regressor_OLS.summary()




from sklearn.cross_validation import cross_val_score

ck =  BernoulliNB()
scores = cross_val_score(ck,X,y,cv=10, scoring='accuracy')
print (scores)

print (scores.mean())




from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):



    '''Generate a simple plot of the test and training learning curve.

   Parameters
   ----------
   estimator : object type that implements the "fit" and "predict" methods
       An object of that type which is cloned for each validation.

   title : string
       Title for the chart.

   X : array-like, shape (n_samples, n_features)
       Training vector, where n_samples is the number of samples and
       n_features is the number of features.

   y : array-like, shape (n_samples) or (n_samples, n_features), optional
       Target relative to X for classification or regression;
       None for unsupervised learning.

   ylim : tuple, shape (ymin, ymax), optional
       Defines minimum and maximum yvalues plotted.

   cv : int, cross-validation generator or an iterable, optional
       Determines the cross-validation splitting strategy.
       Possible inputs for cv are:
         - None, to use the default 3-fold cross-validation,
         - integer, to specify the number of folds.
         - An object to be used as a cross-validation generator.
         - An iterable yielding train/test splits.

       For integer/None inputs, if ``y`` is binary or multiclass,
       :param train_sizes:
       :class:`StratifiedKFold` used. If the estimator is not a classifier
       or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

       Refer :ref:`User Guide <cross_validation>` for the various
       cross-validators that can be used here.

   n_jobs : integer, optional
       Number of jobs to run in parallel (default 1).'''


    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



estimator = BernoulliNB()


title = "Learning Curves (Naive Bayes classifier ALGORITHM)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.25, random_state=0)

plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=1)

plt.show()

#End of Bayes theorem


plt.rcParams['font.size'] = 14

plt.hist(y_pred, bins = 8)

plt.xlim(0, 1)

plt.title('Predicted probabilities')
plt.xlabel('Affected by ovarian cancer?(predicted)')
plt.ylabel('frequency')




from sklearn.metrics import recall_score,precision_score

recall_score(y_test,y_pred,average='macro')

precision_score(y_test, y_pred, average='micro')








# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Naive Bayes (Training set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Naive Bayes (Test set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()
© www.soinside.com 2019 - 2024. All rights reserved.