Python Polars 匹配每一行的模式列表

问题描述 投票:0回答:1

我有一个

polars.DataFrame
Strings
:

| Strings                     |
-------------------------------
| "Apples are red"            |
| "Apples are not Bananas"    |
| "Bananas are not red"       |
| "I like Apples and Bananas" |
| "Oranges are kinda red"     |
-------------------------------

还有另一个

DataFrame
Patterns

| Patterns              │ 
-------------------------
| ["Bananas", "red"]    |
| ["Apples"]            |
| ["Apples", "Bananas"] |
-------------------------

我想将

Patterns
中的每个模式与
Strings
中的每个字符串进行匹配。那
Pattern
以列表形式给出,意味着必须找到列表中的all单词才能认为找到了列表。因此,结果应该如下所示:

| Strings                     | matches_index | matches                            |
|-----------------------------|---------------|------------------------------------|
| "Apples are red"            | [1]           | [["Apples"]]                       |
| "Apples are not Bananas"    | [1, 2]        | [["Apples"], ["Apples", "Bananas"]]|
| "Bananas are not red"       | [0]           | [["Bananas", "red"]]               |
| "I like Apples and Bananas" | [1, 2]        | [["Apples"], ["Apples", "Bananas"]]|
| "Oranges are kinda red"     | []            | []                                 |  

现在我的解决方案是迭代模式,用

|
连接每个列表,使用
extract_all
对于连接模式并检查唯一匹配的数量是否等于原始模式列表的长度。这可行,但也许有人知道如何在原生极坐标中实现这一点(在模式上没有
iter_rows()
)。

当前的实现如下所示:

    patterns = (
        keys.with_columns(
            pl.col("patterns")
            .list.unique()
            #ignore case when matching
            .list.eval(pl.element().str.strip().str.to_lowercase())
            .alias("kw_lower")
        )
         # get the number of keywords in each list
        .with_columns(pl.col("kw_lower").list.lengths()              
          .alias("n_kw") 
         )
         # create a single regex for each list               
        .with_columns(pl.col("kw_lower").list.join("|")
          .alias("kw_regex"))
       )
    # turn strings to lowercase
    strings =   
        strings.with_columns(pl.col(aff_column).str.to_lowercase())
    i = 0
    for n_kw, kw_regex in patterns.select(
        "n_kw", "kw_regex"
    ).iter_rows():
        strings = strings.with_columns(
            pl.col("Strings")
            .str.extract_all(rf"{kw_regex}")
            .list.unique()
            .list.lengths()
            # append column i for pattern i, containing boolean value      
            # if all words from pattern i's list are contained
            .alias(f"matches_{i}")
            == n_kw
        )
        i += 1
    strings = (
        # concatenate all bool values from each column
        strings.with_columns(
            pl.concat_list(pl.col(r"matches_[0-9]+$"))
                .alias("matches_list")
        )
        .drop(pl.selectors.matches(r"^matches_[0-9]+$"))
        # get the indices of the matches by compressing range(len(x))
        .with_columns(
            pl.col("matches_list")
            .apply(lambda x: list(compress(range(len(x)), x)))
            .alias("matches_indices")
        )
        # get the matches itself by compressing the pattern column     
        # itself
        .with_columns(
            pl.col("matches_list")
            .apply(
                lambda x: list(
                    compress(patterns.to_series(patterns.columns.index("Patterns")).to_list(), x)
                )
            )
            .alias("matches")
        )
    )

谢谢:)

python dataframe pattern-matching python-polars
1个回答
0
投票

可以使用 cross join 来比较所有行。

然后您可以

.explode()
模式列表,运行
.str.contains()
然后使用
.group_by()
创建 match_index 和模式列表。

df_a = pl.DataFrame({
   "Strings": [
      "Apples are red",
      "Apples are not Bananas",
      "Bananas are not red",
      "I like Apples and Bananas",
      "Oranges are kinda red",
   ]
})

df_b = pl.DataFrame(
    {"Patterns": [["Bananas", "red"], ["Apples"], ["Apples", "Bananas"]]}
)
(
    df_a.with_row_count()
    .join(
        df_b.with_row_count("match_index"), 
        how = "cross"
    )
    .explode("Patterns")
    .with_columns( 
        # could be put directly in .filter
        match = pl.col("Strings").str.contains(pl.col("Patterns")) # literal=True?
    )
    .filter(pl.col("match").all().over("row_nr", "match_index"))
    .groupby("row_nr", "match_index", maintain_order=True)
    .agg(pl.first("Strings"), "Patterns")
    .groupby("row_nr")
    .agg(pl.first("Strings"), "match_index", "Patterns")
)

shape: (4, 4)
┌────────┬───────────────────────────┬─────────────┬───────────────────────────────────┐
│ row_nr ┆ Strings                   ┆ match_index ┆ Patterns                          │
│ ---    ┆ ---                       ┆ ---         ┆ ---                               │
│ u32    ┆ str                       ┆ list[u32]   ┆ list[list[str]]                   │
╞════════╪═══════════════════════════╪═════════════╪═══════════════════════════════════╡
│ 0      ┆ Apples are red            ┆ [1]         ┆ [["Apples"]]                      │
│ 1      ┆ Apples are not Bananas    ┆ [1, 2]      ┆ [["Apples"], ["Apples", "Bananas… │
│ 2      ┆ Bananas are not red       ┆ [0]         ┆ [["Bananas", "red"]]              │
│ 3      ┆ I like Apples and Bananas ┆ [1, 2]      ┆ [["Apples"], ["Apples", "Bananas… │
└────────┴───────────────────────────┴─────────────┴───────────────────────────────────┘

这会过滤掉不匹配的行 - 但您可以执行左连接:

df_a.join(..., how="left")
以获得所需的结果。

© www.soinside.com 2019 - 2024. All rights reserved.