Scikit-Learn 排列和更新 Polars DataFrame

Question

我正在尝试重新编写scikit-learn排列重要性的源代码来实现：

与Polars的兼容性
与功能集群的兼容性

import polars as pl
import polars.selectors as cs
import numpy as np

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=3,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    random_state=42,
    shuffle=False,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
feature_names = [f"feature_{i}" for i in range(X.shape[1])]

X_train_polars = pl.DataFrame(X_train, schema=feature_names)
X_test_polars = pl.DataFrame(X_test, schema=feature_names)
y_train_polars = pl.Series(y_train, schema=["target"])
y_test_polars = pl.Series(y_test, schema=["target"])

为了获得一组特征的未来重要性，我们需要同时排列一组特征，然后传递给评分器以与基线分数进行比较。

但是，在检查特征簇时，我正在努力替换多个极坐标数据框列：

from sklearn.utils import check_random_state
random_state = check_random_state(42)
random_seed = random_state.randint(np.iinfo(np.int32).max + 1)

X_train_permuted = X_train_polars.clone()
shuffle_arr = np.array(X_train_permuted[:, ["feature_0", "feature_1"]])

random_state.shuffle(shuffle_arr)
X_train_permuted.replace_column( # This operation is in place
                0, 
                pl.Series(name="feature_0", values=shuffle_arr))

通常

shuffle_arr

的形状为 (n_samples,)，可以使用

polars.DataFrame.replace_column()

轻松替换极坐标数据框中的相关列。在这种情况下，

shuffle_arr

具有多维形状（n_samples，n_features in a cluster）。更换相关列的有效方法是什么？

Answer 1

TL；博士

pl_series = [pl.Series(name, values) 
             for name, values in zip(features, shuffle_arr.T)]

X_train_permuted = (
    X_train_permuted.with_columns(
        pl_series
    )
)

让我们看一个简单的例子。

X_train_permuted

import polars as pl
import numpy as np

np.random.seed(0)

data = {f'feature_{i}': np.random.rand(4) for i in range(0,3)}

X_train_permuted = pl.DataFrame(data)

X_train_permuted

shape: (4, 3)
┌───────────┬───────────┬───────────┐
│ feature_0 ┆ feature_1 ┆ feature_2 │
│ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╡
│ 0.548814  ┆ 0.423655  ┆ 0.963663  │
│ 0.715189  ┆ 0.645894  ┆ 0.383442  │
│ 0.602763  ┆ 0.437587  ┆ 0.791725  │
│ 0.544883  ┆ 0.891773  ┆ 0.528895  │
└───────────┴───────────┴───────────┘

随机播放

feature_1

和
feature_2

使用列表来跟踪您正在洗牌的功能：

features = ["feature_0", "feature_1"]

features = ["feature_0", "feature_1"]

shuffle_arr = np.array(X_train_permuted[:, features])

from sklearn.utils import check_random_state

random_state = check_random_state(42)
random_seed = random_state.randint(np.iinfo(np.int32).max + 1)

random_state.shuffle(shuffle_arr)

shuffle_arr

array([[0.71518937, 0.64589411],
       [0.60276338, 0.43758721],
       [0.5488135 , 0.4236548 ],
       [0.54488318, 0.891773  ]])

将

X_train_permuted

中的关联列替换为
shuffle_arr
值

使用
```
pl.DataFrame.with_columns
```
。
使用列表理解（应用
pl_series
```
）为每个打乱后的特征传递一个列表（此处：
```
pl.Series
）和
```
zip
```
。确保转置
```
shuffle_arr
```
以访问列（请参阅
```
.T
```
）。

pl_series = [pl.Series(name, values) 
             for name, values in zip(features, shuffle_arr.T)]

X_train_permuted = (
    X_train_permuted.with_columns(
        pl_series
    )
)

X_train_permuted

shape: (4, 3)
┌───────────┬───────────┬───────────┐
│ feature_0 ┆ feature_1 ┆ feature_2 │
│ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╡
│ 0.715189  ┆ 0.645894  ┆ 0.963663  │
│ 0.602763  ┆ 0.437587  ┆ 0.383442  │
│ 0.548814  ┆ 0.423655  ┆ 0.791725  │
│ 0.544883  ┆ 0.891773  ┆ 0.528895  │
└───────────┴───────────┴───────────┘

Scikit-Learn 排列和更新 Polars DataFrame

问题描述投票：0回答：1

1个回答

最新问题

Scikit-Learn 排列和更新 Polars DataFrame

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1