我正在尝试重新编写scikit-learn排列重要性的源代码来实现:
import polars as pl
import polars.selectors as cs
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=42,
shuffle=False,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
feature_names = [f"feature_{i}" for i in range(X.shape[1])]
X_train_polars = pl.DataFrame(X_train, schema=feature_names)
X_test_polars = pl.DataFrame(X_test, schema=feature_names)
y_train_polars = pl.Series(y_train, schema=["target"])
y_test_polars = pl.Series(y_test, schema=["target"])
为了获得一组特征的未来重要性,我们需要同时排列一组特征,然后传递给评分器以与基线分数进行比较。
但是,在检查特征簇时,我正在努力替换多个极坐标数据框列:
from sklearn.utils import check_random_state
random_state = check_random_state(42)
random_seed = random_state.randint(np.iinfo(np.int32).max + 1)
X_train_permuted = X_train_polars.clone()
shuffle_arr = np.array(X_train_permuted[:, ["feature_0", "feature_1"]])
random_state.shuffle(shuffle_arr)
X_train_permuted.replace_column( # This operation is in place
0,
pl.Series(name="feature_0", values=shuffle_arr))
通常
shuffle_arr
的形状为 (n_samples,),可以使用 polars.DataFrame.replace_column()
轻松替换极坐标数据框中的相关列。在这种情况下,shuffle_arr
具有多维形状(n_samples,n_features in a cluster)。更换相关列的有效方法是什么?
TL;博士
pl_series = [pl.Series(name, values)
for name, values in zip(features, shuffle_arr.T)]
X_train_permuted = (
X_train_permuted.with_columns(
pl_series
)
)
让我们看一个简单的例子。
X_train_permuted
import polars as pl
import numpy as np
np.random.seed(0)
data = {f'feature_{i}': np.random.rand(4) for i in range(0,3)}
X_train_permuted = pl.DataFrame(data)
X_train_permuted
shape: (4, 3)
┌───────────┬───────────┬───────────┐
│ feature_0 ┆ feature_1 ┆ feature_2 │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═══════════╡
│ 0.548814 ┆ 0.423655 ┆ 0.963663 │
│ 0.715189 ┆ 0.645894 ┆ 0.383442 │
│ 0.602763 ┆ 0.437587 ┆ 0.791725 │
│ 0.544883 ┆ 0.891773 ┆ 0.528895 │
└───────────┴───────────┴───────────┘
随机播放
feature_1
和 feature_2
使用列表来跟踪您正在洗牌的功能:
features = ["feature_0", "feature_1"]
features = ["feature_0", "feature_1"]
shuffle_arr = np.array(X_train_permuted[:, features])
from sklearn.utils import check_random_state
random_state = check_random_state(42)
random_seed = random_state.randint(np.iinfo(np.int32).max + 1)
random_state.shuffle(shuffle_arr)
shuffle_arr
array([[0.71518937, 0.64589411],
[0.60276338, 0.43758721],
[0.5488135 , 0.4236548 ],
[0.54488318, 0.891773 ]])
将
X_train_permuted
中的关联列替换为 shuffle_arr
值
pl.DataFrame.with_columns
。pl_series
)为每个打乱后的特征传递一个列表(此处:
pl.Series
)和 zip
。确保转置 shuffle_arr
以访问列(请参阅 .T
)。pl_series = [pl.Series(name, values)
for name, values in zip(features, shuffle_arr.T)]
X_train_permuted = (
X_train_permuted.with_columns(
pl_series
)
)
X_train_permuted
shape: (4, 3)
┌───────────┬───────────┬───────────┐
│ feature_0 ┆ feature_1 ┆ feature_2 │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═══════════╡
│ 0.715189 ┆ 0.645894 ┆ 0.963663 │
│ 0.602763 ┆ 0.437587 ┆ 0.383442 │
│ 0.548814 ┆ 0.423655 ┆ 0.791725 │
│ 0.544883 ┆ 0.891773 ┆ 0.528895 │
└───────────┴───────────┴───────────┘