Julia 和 MLJ 中的数据类型

Question

我是 Julia 的新手，正在尝试拟合一个简单的分类树

包导入和环境激活：

using Pkg 
Pkg.activate(".")

using CSV
using DataFrames
using Random
using Downloads
using ARFFFiles
using ScientificTypes
using DataFramesMeta
using DynamicPipe
using MLJ
using MLJDecisionTreeInterface

数据：

titanic_reader  = CSV.File("/home/andrea/dev/julia/titanic.csv"; header = 1);
titanic = DataFrame(titanic_reader);

# remove missing values
titanic =  dropmissing(titanic);


titanic = @transform(titanic, 
    :class=categorical(:class), 
    :sex=categorical(:sex),  
    :survived=categorical(:survived)
    );

查看数据

first (titanic , 3)

3×4 DataFrame
 Row │ class  sex     age      survived 
     │ Cat…   Cat…    Float64  Cat…     
─────┼──────────────────────────────────
   1 │ 3      male       22.0  N
   2 │ 1      female     38.0  Y
   3 │ 3      female     26.0  Y

检查数据架构

schema(titanic);


┌──────────┬───────────────┬───────────────────────────────────┐
│ names    │ scitypes      │ types                             │
├──────────┼───────────────┼───────────────────────────────────┤
│ class    │ Multiclass{3} │ CategoricalValue{Int64, UInt32}   │
│ sex      │ Multiclass{2} │ CategoricalValue{String7, UInt32} │
│ age      │ Continuous    │ Float64                           │
│ survived │ Multiclass{2} │ CategoricalValue{String1, UInt32} │
└──────────┴───────────────┴───────────────────────────────────┘

架构对我来说似乎没问题

准备建模数据：

# target and features
y, X = unpack(titanic, ==(:survived), rng = 123);

# partitiont training & test 
(X_trn, X_tst), (y_trn, y_tst)  = partition((X, y), 0.75, multi=true,  rng=123);

适合模型：

# model
mod = @load DecisionTreeClassifier pkg = "DecisionTree" ;
fm = mod() ;
fm_mach = machine(fm, X_trn, y_trn);

问题就在这里：

Warning: The number and/or types of data arguments do not match what the specified model
│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc DecisionTree.DecisionTreeClassifier` to learn more about your model's requirements.
│ 
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│ 
│ In general, data in `machine(model, data...)` is expected to satisfy
│ 
│     scitype(data) <: MLJ.fit_data_scitype(model)
│ 
│ In the present case:
│ 
│ scitype(data) = Tuple{Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{3}}, AbstractVector{Multiclass{2}}}}, AbstractVector{Multiclass{2}}}
│ 
│ fit_data_scitype(model) = Tuple{Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Count}, AbstractVector{<:OrderedFactor}}}, AbstractVector{<:Finite}}
└ @ MLJBase ~/.julia/packages/MLJBase/eCnWm/src/machines.jl:231

拟合模型时显然：

fit!(fm_mach)

我收到错误

[ Info: It seems an upstream node in a learning network is providing data of incompatible scitype. See above. 
ERROR: ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this
Stacktrace:

我几乎确定错误取决于数据类型规范，但是我无法找到解决方案。

Answer 1

我可以使用 MLJ 函数 OpenML 中的 Titan 数据集来复制您的问题：

using MLJ
import DataFrames as DF
import DataFramesMeta as DFM

table = OpenML.load(42638)

然后稍微清理一下以获得与您正在使用的完全相同的数据集：

titanic =  DF.dropmissing(titanic);
DF.rename!(titanic, "pclass"=>"class")
titanic = titanic[:,[:class,:sex,:survived,:age]] # select only the fields you are using

titanic = DFM.@transform(titanic, 
    :class=categorical(:class), 
    :sex=categorical(:sex),  
    :survived=categorical(:survived)
    );

现在的问题是

DecisionTreeClassifier

包中的

DeicisonTree

模型非常高效（快！），但它只需要 ordered 数据。

在这种情况下，您也许可以将

class

强制为有序字段。另一种方法是使用

DecisionTreeClassifier

中的

BetaML

模型，其代价是速度慢一点，可以使用任何类型的输入，包括缺失的输入（因此无需删除它们或仅使用少数字段 -原始

titan

数据集有更多字段）：

mod = @load DecisionTreeClassifier pkg = "BetaML" ;
fm = mod() ;
fm_mach = machine(fm, X_trn, y_trn);
fit!(fm_mach)

yhat_trn = mode.(predict(fm_mach , X_trn))
accuracy(y_trn,yhat_trn) # 0.91

yhat_tst = mode.(predict(fm_mach , X_tst))
accuracy(y_tst,yhat_tst) # 0.78

请注意，这里有一个关于将 Titan 数据库与决策树和 MLJ 拟合的很好的教程：https://forem.julialang.org/mlj/julia-boards-the-titanic-1ne8 .

Julia 和 MLJ 中的数据类型

问题描述投票：0回答：1

1个回答

最新问题

Julia 和 MLJ 中的数据类型

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1