我有一个列名,可以根据某些函数参数更改其前缀和后缀,但列名的一部分始终是相同的。我需要将该列重命名为易于在不同工作流程中引用的名称。我正在寻找最快的方法来找到我正在寻找的列并将其重命名为我想要的名称。
我正在使用 for 循环来检查字符串的部分是否在每列中,但我不认为这是基于正则表达式过滤重命名列的最有效的方法。
这就是我想出的:
data = pl.DataFrame({
"foo": [1, 2, 3, 4, 5],
"bar": [5, 4, 3, 2, 1],
"std_volatility_pct_21D": [0.1, 0.2, 0.15, 0.18, 0.16]
})
for col in data.columns:
if "volatility_pct" in col:
new_data = data.rename({col: "realized_volatility"})
import polars as pl
import polars.selectors as cs
data = pl.DataFrame(
{
"foo": [1, 2, 3, 4, 5],
"bar": [5, 4, 3, 2, 1],
"std_volatility_pct_21D": [0.1, 0.2, 0.15, 0.18, 0.16],
}
)
# 1
def rename_volatility_column(data):
for col in data.columns:
if "volatility_pct" in col:
return data.rename({col: "realized_volatility"})
return data
%timeit rename_volatility_column(data)
# 2
def adjust_volatility_column(data):
return data.select(
~cs.contains("volatility_pct"),
cs.contains("volatility_pct").alias("realized_volatility"),
)
%timeit adjust_volatility_column(data)
# 3
%timeit data.rename(lambda col: "realized_volatility" if "volatility_pct" in col else col)
#1
18.8 µs ± 636 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
#2
330 µs ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
#3
133 µs ± 7.71 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
您可以使用极坐标的列选择器。
~cs.contains("volatility_pct")
选择 不 包含 volatility_pct
cs.contains("volatility_pct").alias("realized_volatility")
选择包含 volatility_pct
的所有列并将其重命名为 realized_volatility
import polars.selectors as cs
(
data
.select(
~cs.contains("volatility_pct"),
cs.contains("volatility_pct").alias("realized_volatility"),
)
)
.rename()
还接受 Callable - 这可能会更好写。
df.rename(lambda col:
"realized_volatility" if "volatility_pct" in col else col
)
shape: (5, 3)
┌─────┬─────┬─────────────────────┐
│ foo ┆ bar ┆ realized_volatility │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪═════════════════════╡
│ 1 ┆ 5 ┆ 0.1 │
│ 2 ┆ 4 ┆ 0.2 │
│ 3 ┆ 3 ┆ 0.15 │
│ 4 ┆ 2 ┆ 0.18 │
│ 5 ┆ 1 ┆ 0.16 │
└─────┴─────┴─────────────────────┘
任何一种方法在性能方面似乎都没有太大差异。