数据帧内数组中的连续序列 - 极坐标

Question

自从问this以来，我也开始研究使用极坐标来处理我的数据。从

map_rows

的文档来看，一般来说，有很多关于使用 python 函数而不是他们自己的表达系统的警告，因为它慢得多。是否可以使用极坐标表达式来查找数据帧行中的连续序列？

一个额外的问题是我现有的 UDF 上的 @njit 装饰器（来自上一个问题）是否对极坐标的运行有任何影响

map_rows

。如果相关的话，我目前正在使用极坐标数组类型而不是列表（但如果需要我可以更改）。

编辑：

我的数据框（csv 格式）的形状如下：

Myname,"1,2,3,4,5,5,5,5",AnotherName

中间数组的长度是固定的，但我不一定知道步幅有多长（在本例中为 4 长）。

如前所述，我目前正在使用我之前的问题和 UDF，我将其应用于许多行以过滤它们。

Answer 1

听起来你有：

df = pl.read_csv(b"""
name,array,other
a,"1,2,3,4,5,5,5,5",e
b,"1,2,3,4,6,6,6",f
c,"1,2,2,2,2,3,4,6,6,6",g
d,"1,1,1,1,1",h
""").with_columns(
   array = pl.format("[{}]", "array").str.json_decode()
).with_row_index()

shape: (4, 4)
┌───────┬──────┬────────────────────────────────┬───────┐
│ index ┆ name ┆ array                          ┆ other │
│ ---   ┆ ---  ┆ ---                            ┆ ---   │
│ u32   ┆ str  ┆ list[i64]                      ┆ str   │
╞═══════╪══════╪════════════════════════════════╪═══════╡
│ 0     ┆ a    ┆ [1, 2, 3, 4, 5, 5, 5, 5]       ┆ e     │
│ 1     ┆ b    ┆ [1, 2, 3, 4, 6, 6, 6]          ┆ f     │
│ 2     ┆ c    ┆ [1, 2, 2, 2, 2, 3, 4, 6, 6, 6] ┆ g     │
│ 3     ┆ d    ┆ [1, 1, 1, 1, 1]                ┆ h     │
└───────┴──────┴────────────────────────────────┴───────┘

并且您想测试 array 列中是否存在任何

运行长度编码

>= N。

Polars 带有

rle

和

rle_id

表情：

.rle()

.flatten()

+

.over("index")

是“模拟”基于行的列表操作的一种方法。

df.with_columns(
   pl.col("array").flatten().rle().struct["lengths"].max().over("index")
     .alias("rle_max_len")
)

shape: (4, 5)
┌───────┬──────┬────────────────────────────────┬───────┬─────────────┐
│ index ┆ name ┆ array                          ┆ other ┆ rle_max_len │
│ ---   ┆ ---  ┆ ---                            ┆ ---   ┆ ---         │
│ u32   ┆ str  ┆ list[i64]                      ┆ str   ┆ i32         │
╞═══════╪══════╪════════════════════════════════╪═══════╪═════════════╡
│ 0     ┆ a    ┆ [1, 2, 3, 4, 5, 5, 5, 5]       ┆ e     ┆ 4           │
│ 1     ┆ b    ┆ [1, 2, 3, 4, 6, 6, 6]          ┆ f     ┆ 3           │
│ 2     ┆ c    ┆ [1, 2, 2, 2, 2, 3, 4, 6, 6, 6] ┆ g     ┆ 4           │
│ 3     ┆ d    ┆ [1, 1, 1, 1, 1]                ┆ h     ┆ 5           │
└───────┴──────┴────────────────────────────────┴───────┴─────────────┘

本质上与做：

相同

(df.group_by("index")
   .agg(
      pl.all().first(),
      rle_max_len = pl.col("array").flatten().rle().struct["lengths"].max()
   )
)

如果不需要长度，可以直接传给

.filter()

。

df.filter(
   pl.col("array").flatten().rle().struct["lengths"].max().over("index") > 3
)

shape: (3, 4)
┌───────┬──────┬────────────────────────────────┬───────┐
│ index ┆ name ┆ array                          ┆ other │
│ ---   ┆ ---  ┆ ---                            ┆ ---   │
│ u32   ┆ str  ┆ list[i64]                      ┆ str   │
╞═══════╪══════╪════════════════════════════════╪═══════╡
│ 0     ┆ a    ┆ [1, 2, 3, 4, 5, 5, 5, 5]       ┆ e     │
│ 2     ┆ c    ┆ [1, 2, 2, 2, 2, 3, 4, 6, 6, 6] ┆ g     │
│ 3     ┆ d    ┆ [1, 1, 1, 1, 1]                ┆ h     │
└───────┴──────┴────────────────────────────────┴───────┘

数据帧内数组中的连续序列 - 极坐标

问题描述投票：0回答：1

1个回答

最新问题

数据帧内数组中的连续序列 - 极坐标

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1