我的问题是关于H2O增益/升力表。我知道响应率是落入组/ bin的所有事件的比例。如何获取落入bin 1,bin 2等的数据?我想看看关键变量在每个组/ bin中如何看待响应率。
如何详细说明增益/升力表中的度量是如何计算的(公式)
可以在此文件中找到增益和提升图表的公式:https://github.com/h2oai/h2o-3/blob/master/h2o-core/src/main/java/hex/GainsLift.java
这表现了:
E =事件总数
N =观察次数
G =组数(十分位数为10,十分位数为20)
P =事件观测的总体比例(P = E / N)
ei =组i中的事件数,i = 1,2,...,G
ni =组i中的观察数
pi =组i中观察事件的比例(pi = ei / ni)
组:硬编码为16;如果存在少于16个唯一概率值,则组的数量减少到唯一分位数阈值的数量。
累积数据分数= sum_n_i / N.
lower_threshold =由分位数箱设置
lift = pi / P.
cumulative_lift =(Σiei/Σini)/ P.
response_rate = 100 * pi
cumulative_response_rate = 100 *Σiei/Σini
capture_rate = 100 * ei / E.
cumulative_capture_rate = 100 *Σiei/ E.
增益= 100 *(lift-1)
cumulative_gain = 100 *(sum_lift-1)
average_response_rate = E / N.
这是使用H2O-3 Python API的示例演练:
import h2o
import pandas as pd
import numpy as np
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import and split the dataset
cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
convert response column to a factor
cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
# set the predictor names and the response column name
predictors = ["displacement","power","weight","acceleration","year"]
response = "economy_20mpg"
# split dataset
train, valid = cars.split_frame(ratios=[.7],seed=1234)
# Initialize and train a GBM
cars_gbm = H2OGradientBoostingEstimator(seed = 1234)
cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame=valid)
# Generate Gains and Lift Table
# documentation on this parameter can be found here:
# http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/model_categories.html?#h2o.model.H2OBinomialModel.gains_lift
gainslift = cars_gbm.gains_lift(train=False, valid=True, xval=False)
正如预期的那样,我们有16个组,因为这是硬编码的默认行为。
默认情况下,增益和提升表为您提供的不仅仅是十分位数或通风,这意味着您可以更灵活地挑选出您感兴趣的百分位数。
让我们以获取十分位数为例。在这个例子中,我们看到我们可以从第6行开始,跳过第7行,然后取其余行来获取我们的十分位数。
由于Gains和Lift Table返回TwoDimTable,我们可以使用我们的组号作为选择索引。
# show gains and lift table data type
print('H2O Gains Lift Table is of type: ', type(gainslift))
H2O Gains Lift Table is of type: <class 'h2o.two_dim_table.H2OTwoDimTable'>
# since this table is small and for ease of use let's covert to a pandas dataframe
pandas_gl = gainslift.as_data_frame()
pandas_gl.set_index('group')
gainslift_deciles = pandas_gl.iloc[pd.np.r_[5,7:16], :]
gainslift_deciles
What if I Want Just the Ventiles
那些也可供选择,所以让我们接下来做。
gainslift_ventiles = pandas_gl.iloc[pd.np.r_[7,9,11,13,15], :]
gainslift_ventiles