流式 LightGBM 数据集构建在训练中冻结

Question

我一直在尝试使用参考数据集（称为 ref_dataset）以流方式在 Python 中构建 LightGBM 数据集。我不确定它是如何完成的，它涉及调用 Dataset 类中看似非公共的方法。

我已经尝试过：

label_column = "label"
weight_column = "weight"
ref_dataset = lightgbm.Dataset(
   sample_df.drop(columns=[label_column, weight_column])
   label=sample_df[label_column],
   weight=sample_df[weight_column],
   params=config,
   **(ref_dataset_kwargs or {}),
)
ref_dataset.construct()
temp_dataset = lightgbm.Dataset(None, reference=ref_dataset, params=ref_dataset.get_params())
# train_filenames_and_part_infos is just a list of tuple[filename, part_info_dict]
estimated_num_rows = sum(
    part_info["num_rows"] for _, part_info in train_filenames_and_part_infos
)
temp_dataset._init_from_ref_dataset(estimated_num_rows, ref_dataset._handle)

weights_list = []
labels_list = []
# This loop is not actually my code, which is more complicated, but basically what it does
for filename, _ in train_filenames_and_part_infos:
    tbl: pyarrow.Table = load_from_file(filename)
    labels = tbl[label_column].to_pandas().to_numpy()
    weights = tbl[weight_column].to_pandas().to_numpy()

    labels_list.append(labels)
    weights_list.append(weights)
    tbl = tbl.drop_columns([label_column, weight_column])
    np_array: np.ndarray = tbl.to_pandas().to_numpy()
    if temp_dataset._start_row + np_array.shape[0] > temp_dataset.num_data():
        raise RuntimeError("Dataset is too small to fit the data")
    temp_dataset._push_rows(np_array)

all_weights = np.concatenate(weights_list)
all_labels = np.concatenate(labels_list)
actual_length = all_weights.shape[0]
# Unfortunately, the estimate is not exact for various reasons
extra_zeros_features = np.zeros(
     (estimated_num_rows - actual_length, temp_dataset.num_feature()), dtype=np.float32
)
temp_dataset._push_rows(extra_zeros_features)
_LIB.LGBM_DatasetMarkFinished(temp_dataset._handle)
extra_zeros = np.zeros(estimated_num_rows - actual_length, dtype=np.float32)
temp_dataset.set_weight(np.concatenate([all_weights, extra_zeros]))
temp_dataset.set_label(np.concatenate([all_labels, extra_zeros]))

lightgbm.train(
    params=config, # includes network parameters for distributed voting parallel training
    train_set=temp_dataset,
    num_boost_round=100,
    valid_sets=valid_sets, # initialized somewhere else
    valid_names=valid_names, # initialized somewhere else
    init_model=starting_model, # not really necessary
    **lightgbm_train_kwargs, # empty
)

不幸的是，当我运行这段代码时，我得到了这个控制台输出（有些行可能是无序的，因为我实际上是在分布式上运行它，并且日志是聚合的；我已经做了一些简单的编辑来删除侵入性的线条）：

[LightGBM] [Info] Total Bins 137618
[LightGBM] [Info] Trying to bind port 50627...
[LightGBM] [Info] Binding port 50627 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Info] Number of data points in the train set: 3934363, number of used features: 1382
[LightGBM] [Info] Connected to rank 0
[LightGBM] [Info] Connected to rank 1
[LightGBM] [Info] Connected to rank 2
[LightGBM] [Info] Connected to rank 3
[LightGBM] [Info] Connected to rank 4
[LightGBM] [Info] Connected to rank 5
[LightGBM] [Info] Connected to rank 6
[LightGBM] [Info] Connected to rank 8
[LightGBM] [Info] Local rank: 7, total number of machines: 9
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 5.318313 seconds.
[LightGBM] [Info] Start training from score -0.000000

然后它就坐在那里，CPU 和网络处于空闲状态。我没有看到它在几个小时内取得任何进展。我已经检查了所有的排名，是不是我做错了什么？我还如何使用给定的样本构建？

更多信息：检查空闲 Python 进程的堆栈跟踪表明代码卡在：

update (lightgbm/basic.py:3891)
train (lightgbm/engine.py:276)
... my code ...

对于我正在使用的 LightGBM 版本（4.3.0），这对应于代码：

_safe_call(_LIB.LGBM_BoosterUpdateOneIter(
                self._handle,
                ctypes.byref(is_finished)))

另一个更新：不同工人的垃圾箱数量似乎不同；有些有 137608、137612、137616。这是一个问题吗？

Answer 1

事实证明，我不小心将不同的sample_df传递给了工人。如果您的参考数据集不匹配，事情就会冻结。

流式 LightGBM 数据集构建在训练中冻结

问题描述投票：0回答：1

1个回答

最新问题

流式 LightGBM 数据集构建在训练中冻结

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1