Julia 和 MLJ 中的数据类型

问题描述 投票:0回答:1

我是 Julia 的新手,正在尝试拟合一个简单的分类树

包导入和环境激活:

using Pkg 
Pkg.activate(".")

using CSV
using DataFrames
using Random
using Downloads
using ARFFFiles
using ScientificTypes
using DataFramesMeta
using DynamicPipe
using MLJ
using MLJDecisionTreeInterface

数据:

titanic_reader  = CSV.File("/home/andrea/dev/julia/titanic.csv"; header = 1);
titanic = DataFrame(titanic_reader);

# remove missing values
titanic =  dropmissing(titanic);


titanic = @transform(titanic, 
    :class=categorical(:class), 
    :sex=categorical(:sex),  
    :survived=categorical(:survived)
    );

查看数据

first (titanic , 3)

3×4 DataFrame
 Row │ class  sex     age      survived 
     │ Cat…   Cat…    Float64  Cat…     
─────┼──────────────────────────────────
   1 │ 3      male       22.0  N
   2 │ 1      female     38.0  Y
   3 │ 3      female     26.0  Y

检查数据架构

schema(titanic);


┌──────────┬───────────────┬───────────────────────────────────┐
│ names    │ scitypes      │ types                             │
├──────────┼───────────────┼───────────────────────────────────┤
│ class    │ Multiclass{3} │ CategoricalValue{Int64, UInt32}   │
│ sex      │ Multiclass{2} │ CategoricalValue{String7, UInt32} │
│ age      │ Continuous    │ Float64                           │
│ survived │ Multiclass{2} │ CategoricalValue{String1, UInt32} │
└──────────┴───────────────┴───────────────────────────────────┘

架构对我来说似乎没问题

准备建模数据:

# target and features
y, X = unpack(titanic, ==(:survived), rng = 123);

# partitiont training & test 
(X_trn, X_tst), (y_trn, y_tst)  = partition((X, y), 0.75, multi=true,  rng=123);

适合模型:

# model
mod = @load DecisionTreeClassifier pkg = "DecisionTree" ;
fm = mod() ;
fm_mach = machine(fm, X_trn, y_trn);

问题就在这里:

Warning: The number and/or types of data arguments do not match what the specified model
│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc DecisionTree.DecisionTreeClassifier` to learn more about your model's requirements.
│ 
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│ 
│ In general, data in `machine(model, data...)` is expected to satisfy
│ 
│     scitype(data) <: MLJ.fit_data_scitype(model)
│ 
│ In the present case:
│ 
│ scitype(data) = Tuple{Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{3}}, AbstractVector{Multiclass{2}}}}, AbstractVector{Multiclass{2}}}
│ 
│ fit_data_scitype(model) = Tuple{Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Count}, AbstractVector{<:OrderedFactor}}}, AbstractVector{<:Finite}}
└ @ MLJBase ~/.julia/packages/MLJBase/eCnWm/src/machines.jl:231

拟合模型时显然:

fit!(fm_mach)

我收到错误

[ Info: It seems an upstream node in a learning network is providing data of incompatible scitype. See above. 
ERROR: ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this
Stacktrace:

我几乎确定错误取决于数据类型规范,但是我无法找到解决方案。

machine-learning julia
1个回答
0
投票

我可以使用 MLJ 函数 OpenML 中的 Titan 数据集来复制您的问题:

using MLJ
import DataFrames as DF
import DataFramesMeta as DFM

table = OpenML.load(42638)

然后稍微清理一下以获得与您正在使用的完全相同的数据集:

titanic =  DF.dropmissing(titanic);
DF.rename!(titanic, "pclass"=>"class")
titanic = titanic[:,[:class,:sex,:survived,:age]] # select only the fields you are using

titanic = DFM.@transform(titanic, 
    :class=categorical(:class), 
    :sex=categorical(:sex),  
    :survived=categorical(:survived)
    );

现在的问题是

DecisionTreeClassifier
包中的
DeicisonTree
模型非常高效(快!),但它只需要 ordered 数据。

在这种情况下,您也许可以将

class
强制为有序字段。另一种方法是使用
DecisionTreeClassifier
中的
BetaML
模型,其代价是速度慢一点,可以使用任何类型的输入,包括缺失的输入(因此无需删除它们或仅使用少数字段 -原始
titan
数据集有更多字段):

mod = @load DecisionTreeClassifier pkg = "BetaML" ;
fm = mod() ;
fm_mach = machine(fm, X_trn, y_trn);
fit!(fm_mach)

yhat_trn = mode.(predict(fm_mach , X_trn))
accuracy(y_trn,yhat_trn) # 0.91

yhat_tst = mode.(predict(fm_mach , X_tst))
accuracy(y_tst,yhat_tst) # 0.78

请注意,这里有一个关于将 Titan 数据库与决策树和 MLJ 拟合的很好的教程:https://forem.julialang.org/mlj/julia-boards-the-titanic-1ne8 .

© www.soinside.com 2019 - 2024. All rights reserved.