如何对数据集生成过程中的参数进行超参数调整？

Question

我有一个数据集，它是许多“超参数”的结果，从某种意义上说，数据集本身依赖于设置为特定值的特定参数。我希望能够在 ML 模型中优化这些参数（例如通过 GridSearch 或类似的东西），并能够将这些隐藏的超参数公开为可以搜索的值。在本例中，我使用的是

sklearn

管道和

param_grid

大多数工具仅以超参数调整为中心在生成数据集之后，并且仅更改与管道定义方式相关的内容。但理想情况下，我希望能够更改数据生成的一个阶段的低通截止频率，看看是否会产生性能更高的模型。

是否有任何工具将数据集生成作为管道过程的一部分，因此我们可以利用它来实际设置这些隐藏的超参数？

我遇到的唯一可能有用的东西是 sklearn FunctionTransformer。这似乎允许任意功能，但我担心重复工作；假设超参数 A 在回归器中，数据集生成是否每次都会发生？或者 sklearn 知道如何缓存常用步骤吗？

刚刚了解所有这些，很高兴有一些见解！

Answer 1

您可以使用自定义转换器并利用

memory

中

Pipeline

的

scikit-learn

参数来缓存转换器输出。

自定义数据集生成器变压器

class CustomDatasetGenerator(BaseEstimator, TransformerMixin):
    def __init__(self, lowpass_cutoff=1.0, another_param=2.0, ...):
        self.lowpass_cutoff = lowpass_cutoff
        self.another_param = another_param
        ...
        
    def fit(self, X, y=None):
        # Nothing to do here
        return self

    def transform(self, X=None, y=None):
        # Generate your dataset based on self.lowpass_cutoff, self.another_param, ...
        X_new = your_data_generation_function(self.lowpass_cutoff, self.another_param, ...)
        return X_new

在管道中使用内存

# Create a temporary caching directory
location = 'cachedir'

pipeline = Pipeline([
    ('dataset_gen', CustomDatasetGenerator()),
    ('classifier', SVC())
], memory=location)

使用 GridSearchCV 调整超参数

param_grid = {
    'dataset_gen__lowpass_cutoff': [0.5, 1.0, 1.5, 2.0],
    'dataset_gen__another_param': [1.0, 2.0, 3.0],
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf']
}

# Placeholder dataset
X_dummy = np.empty((100, 1))  # Here, 100 is a dummy number, replace which how many you need.
y_dummy = your_target_generation_function_or_placeholder()  # Depending on how you get/generate your targets

grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_dummy, y_dummy)

shutil.rmtree(location) # Clear cache

当您以这种方式使用

GridSearchCV

（或类似工具）时，它将搜索数据集生成参数以及模型的超参数。

memory

参数提供的缓存将确保如果数据生成参数（或任何转换器参数）在网格搜索中没有更改，则将使用缓存的转换后的数据，而不是重新计算所有内容，从而节省时间。

如何对数据集生成过程中的参数进行超参数调整？

问题描述投票：0回答：1

1个回答

最新问题

如何对数据集生成过程中的参数进行超参数调整？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1