带有外部存储器迭代器的XGBoost AFT生存模型

Question

背景：我已经编写了用于批量训练的 XGBoost 迭代器，如链接示例中所示。现在我想从

xgboost

库训练 AFT 模型。问题是 XGB

DMatrix

，我们需要运行

set_float_info

来设置生存审查间隔。例如：

dtrain.set_float_info('label_lower_bound', y_lower_bound[train_index])
dtrain.set_float_info('label_upper_bound', y_upper_bound[train_index])

附件请找到我的编辑代码（无法附加所有内容，但这是有问题的要点）。我在

df

中获得了审查时间数据，但我不知道如何将其“附加”到

Xy_train

。

class BatchedParquetIterator(xgboost.DataIter):
  def __init__(
      self
      ):
    # ...    
    super().__init__(cache_prefix=os.path.join(".", "cache"))

  def next(self, input_data: Callable):
    """Advance the iterator by 1 step and pass the data to XGBoost.  This function is
    called by XGBoost during the construction of ``DMatrix``
    """
    
    if self._it == len(self._file_paths):
      return 0  # return 0 to let XGBoost know this is the end of iteration
    
    df = pd.read_parquet(self._file_paths[self._it])
    X, y = self._preprocess(df)
    
    input_data(data=X, label=y)
    self._it += 1

    return 1  # Return 1 to let XGBoost know we haven't seen all the files yet.

  def reset(self):
    """Reset the iterator to its beginning"""
    self._it = 0

  def _preprocess(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    # ...
    return X, y


parquet_iterator_train = BatchedParquetIterator(batches)
Xy_train = xgboost.DMatrix(parquet_iterator_train)

Answer 1

事实证明这很容易。该文档指出：

input_data
是XGBoost传入的函数，它与
DMatrix
具有完全相同的签名。

有趣的是，下限和上限不仅可以通过

set_float_info

传递（如 AFT 教程中所示），还可以通过

DMatrix

构造函数传递（请参阅文档）。

总而言之，只需更改上述代码中的一行即可：

class BatchedParquetIterator(xgboost.DataIter):
  # ...
  def next(self, input_data: Callable):
    # ...
    input_data(data=X, label=y, label_lower_bound=llb, label_upper_bound=lub)
    # ...

其中

llb

和

lub

是定义所考虑间隔的数组。

带有外部存储器迭代器的XGBoost AFT生存模型

问题描述投票：0回答：1

1个回答

最新问题

带有外部存储器迭代器的XGBoost AFT生存模型

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1