我是 Julia 的新手,正在尝试拟合一个简单的分类树
包导入和环境激活:
using Pkg
Pkg.activate(".")
using CSV
using DataFrames
using Random
using Downloads
using ARFFFiles
using ScientificTypes
using DataFramesMeta
using DynamicPipe
using MLJ
using MLJDecisionTreeInterface
数据:
titanic_reader = CSV.File("/home/andrea/dev/julia/titanic.csv"; header = 1);
titanic = DataFrame(titanic_reader);
# remove missing values
titanic = dropmissing(titanic);
titanic = @transform(titanic,
:class=categorical(:class),
:sex=categorical(:sex),
:survived=categorical(:survived)
);
查看数据
first (titanic , 3)
3×4 DataFrame
Row │ class sex age survived
│ Cat… Cat… Float64 Cat…
─────┼──────────────────────────────────
1 │ 3 male 22.0 N
2 │ 1 female 38.0 Y
3 │ 3 female 26.0 Y
检查数据架构
schema(titanic);
┌──────────┬───────────────┬───────────────────────────────────┐
│ names │ scitypes │ types │
├──────────┼───────────────┼───────────────────────────────────┤
│ class │ Multiclass{3} │ CategoricalValue{Int64, UInt32} │
│ sex │ Multiclass{2} │ CategoricalValue{String7, UInt32} │
│ age │ Continuous │ Float64 │
│ survived │ Multiclass{2} │ CategoricalValue{String1, UInt32} │
└──────────┴───────────────┴───────────────────────────────────┘
架构对我来说似乎没问题
准备建模数据:
# target and features
y, X = unpack(titanic, ==(:survived), rng = 123);
# partitiont training & test
(X_trn, X_tst), (y_trn, y_tst) = partition((X, y), 0.75, multi=true, rng=123);
适合模型:
# model
mod = @load DecisionTreeClassifier pkg = "DecisionTree" ;
fm = mod() ;
fm_mach = machine(fm, X_trn, y_trn);
问题就在这里:
Warning: The number and/or types of data arguments do not match what the specified model
│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│
│ Run `@doc DecisionTree.DecisionTreeClassifier` to learn more about your model's requirements.
│
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`. Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│
│ In general, data in `machine(model, data...)` is expected to satisfy
│
│ scitype(data) <: MLJ.fit_data_scitype(model)
│
│ In the present case:
│
│ scitype(data) = Tuple{Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{3}}, AbstractVector{Multiclass{2}}}}, AbstractVector{Multiclass{2}}}
│
│ fit_data_scitype(model) = Tuple{Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Count}, AbstractVector{<:OrderedFactor}}}, AbstractVector{<:Finite}}
└ @ MLJBase ~/.julia/packages/MLJBase/eCnWm/src/machines.jl:231
拟合模型时显然:
fit!(fm_mach)
我收到错误
[ Info: It seems an upstream node in a learning network is providing data of incompatible scitype. See above.
ERROR: ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this
Stacktrace:
我几乎确定错误取决于数据类型规范,但是我无法找到解决方案。
我可以使用 MLJ 函数 OpenML 中的 Titan 数据集来复制您的问题:
using MLJ
import DataFrames as DF
import DataFramesMeta as DFM
table = OpenML.load(42638)
然后稍微清理一下以获得与您正在使用的完全相同的数据集:
titanic = DF.dropmissing(titanic);
DF.rename!(titanic, "pclass"=>"class")
titanic = titanic[:,[:class,:sex,:survived,:age]] # select only the fields you are using
titanic = DFM.@transform(titanic,
:class=categorical(:class),
:sex=categorical(:sex),
:survived=categorical(:survived)
);
现在的问题是
DecisionTreeClassifier
包中的 DeicisonTree
模型非常高效(快!),但它只需要 ordered 数据。
在这种情况下,您也许可以将
class
强制为有序字段。另一种方法是使用 DecisionTreeClassifier
中的 BetaML
模型,其代价是速度慢一点,可以使用任何类型的输入,包括缺失的输入(因此无需删除它们或仅使用少数字段 -原始titan
数据集有更多字段):
mod = @load DecisionTreeClassifier pkg = "BetaML" ;
fm = mod() ;
fm_mach = machine(fm, X_trn, y_trn);
fit!(fm_mach)
yhat_trn = mode.(predict(fm_mach , X_trn))
accuracy(y_trn,yhat_trn) # 0.91
yhat_tst = mode.(predict(fm_mach , X_tst))
accuracy(y_tst,yhat_tst) # 0.78
请注意,这里有一个关于将 Titan 数据库与决策树和 MLJ 拟合的很好的教程:https://forem.julialang.org/mlj/julia-boards-the-titanic-1ne8 .