我应该如何在Polars中对向量/矩阵进行运算

Question

寻找一种极坐标方式来对向量（列表/数组）和矩阵（列表（列表）/数组（数组））执行操作。

polars-0.19.9

小df

import polars as pl
df = pl.DataFrame({
   "a": [[1,2], [3,4]],
   "b": [[10, 20], [30, 40]]
})

好像polars不支持
```
lists/arrays
```
上的操作：

df.with_columns(pl.col("a") + pl.col("b"))
PanicException: `add` operation not supported for dtype `list[i64]`

但是它确实支持
```
structs
```
上的操作（非常惊讶）：

df.with_columns((pl.col("a").list.to_struct() + pl.col("b").list.to_struct()).alias("sum"))

对于向量，我们可能可以使用explode + group_by + join，但执行连接的缺点是：

df = df.with_row_count('i')

c = (
  df
    .select(["a", "b", "i"])  # required to not explode other cols in frame
    .explode(['a','b'])
    .groupby('i')
    .agg(c=pl.col('a')+pl.col('b'))
    .select(['i','c'])
)

df = df.join(c, on="i"). # but now we need to join resulting col back to frame

对向量执行此操作的另一种方法是爆炸 + group_by(maintain_order) + hstack - 这消除了加入的需要：

df = df.with_row_count('i')

c = (
  df
    .select(["a", "b", "i"])  # required to not explode other cols in frame
    .explode(['a','b'])
    .groupby('i', maintain_order=True). # allows to use hstack
    .agg(c=pl.col('a')+pl.col('b'))
    .select(['c'])
)

df = df.hstack(c)

Apply/map_elements 将列表转换为 numpy 数组似乎根本不是一个选项，Polar 在执行 apply 时仅使用一个核心（但是 VAEX 声称能够在多个线程上并行应用操作）+据我了解，没有零-复制发生：

import numpy as np

df.with_columns(
   pl.struct(["a", "b"])
   .apply(lambda x: (np.array(x["a"]) + np.array(x["b"])).tolist())
   .alias("c")
)

这里的建议是什么 - 哪种方式被认为更极性？当我在 row/col 中嵌套列表/数组时，我应该如何处理这种情况？进行多轮explode+group_by似乎很难管理。

谢谢你。

Answer 1

你的数字 3 可以更简洁一点，如下所示：

(
  df.with_row_count('i')
    .explode(['a','b'])
    .group_by('i')
    .agg('a','b',c=pl.col('a')+pl.col('b'))
    .drop('i')
)
shape: (2, 3)
┌───────────┬───────────┬───────────┐
│ a         ┆ b         ┆ c         │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1, 2]    ┆ [10, 20]  ┆ [11, 22]  │
│ [3, 4]    ┆ [30, 40]  ┆ [33, 44]  │
└───────────┴───────────┴───────────┘

这种方式或结构方式可能是最好的。您必须对其进行基准测试才能确定。

您可以尝试使用 numba 和 guvectorize 装饰器创建一个 ufunc，但我不确定 numba 是否支持获取列表 dtype，所以这可能是一场白费力气的追逐。这里是一个以不同应用程序为起点的 numba 示例。如果您这样做，请将其作为答案发布，我想看看。

这里使用 numba 是行不通的。 Polars 不支持将列表转换为 C 类型以供 ufunc 摄取。即使将

Series.to_arrow()

放入 ufunc 也会出错。我认为这是 ufunc 的一个限制，不能与所有 Polars 类型 100% 互操作，而不是一个 Polars 的缺点，可以在 Polars 方面升级，但我在这里很容易犯错误。

另一个可行的想法：是通过 numpy 进行往返

(
    df
    .with_columns(
        c=pl.Series([
            pl.Series(x[0] + x[1]) 
            # Even though we're using column indices above, 
            # we can make the columns in whichever position
            # we want using by ordering them in select so this 
            # works even with a lot of columns
            for x in df.select('a','b').to_numpy()
            ])
        )
    )

另一个潜在的鹅追逐：

我之前忘记了this的存在。您可以用 Rust 编写表达式，编译它，然后拥有自定义向量化表达式。

我应该如何在Polars中对向量/矩阵进行运算

问题描述投票：0回答：1

1个回答

最新问题

我应该如何在Polars中对向量/矩阵进行运算

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1