我有一个
polars.DataFrame
Strings
:
| Strings |
-------------------------------
| "Apples are red" |
| "Apples are not Bananas" |
| "Bananas are not red" |
| "I like Apples and Bananas" |
| "Oranges are kinda red" |
-------------------------------
还有另一个
DataFrame
Patterns
:
| Patterns │
-------------------------
| ["Bananas", "red"] |
| ["Apples"] |
| ["Apples", "Bananas"] |
-------------------------
我想将
Patterns
中的每个模式与 Strings
中的每个字符串进行匹配。那
Pattern
以列表形式给出,意味着必须找到列表中的all单词才能认为找到了列表。因此,结果应该如下所示:
| Strings | matches_index | matches |
|-----------------------------|---------------|------------------------------------|
| "Apples are red" | [1] | [["Apples"]] |
| "Apples are not Bananas" | [1, 2] | [["Apples"], ["Apples", "Bananas"]]|
| "Bananas are not red" | [0] | [["Bananas", "red"]] |
| "I like Apples and Bananas" | [1, 2] | [["Apples"], ["Apples", "Bananas"]]|
| "Oranges are kinda red" | [] | [] |
现在我的解决方案是迭代模式,用
|
连接每个列表,使用
extract_all
对于连接模式并检查唯一匹配的数量是否等于原始模式列表的长度。这可行,但也许有人知道如何在原生极坐标中实现这一点(在模式上没有iter_rows()
)。
当前的实现如下所示:
patterns = (
keys.with_columns(
pl.col("patterns")
.list.unique()
#ignore case when matching
.list.eval(pl.element().str.strip().str.to_lowercase())
.alias("kw_lower")
)
# get the number of keywords in each list
.with_columns(pl.col("kw_lower").list.lengths()
.alias("n_kw")
)
# create a single regex for each list
.with_columns(pl.col("kw_lower").list.join("|")
.alias("kw_regex"))
)
# turn strings to lowercase
strings =
strings.with_columns(pl.col(aff_column).str.to_lowercase())
i = 0
for n_kw, kw_regex in patterns.select(
"n_kw", "kw_regex"
).iter_rows():
strings = strings.with_columns(
pl.col("Strings")
.str.extract_all(rf"{kw_regex}")
.list.unique()
.list.lengths()
# append column i for pattern i, containing boolean value
# if all words from pattern i's list are contained
.alias(f"matches_{i}")
== n_kw
)
i += 1
strings = (
# concatenate all bool values from each column
strings.with_columns(
pl.concat_list(pl.col(r"matches_[0-9]+$"))
.alias("matches_list")
)
.drop(pl.selectors.matches(r"^matches_[0-9]+$"))
# get the indices of the matches by compressing range(len(x))
.with_columns(
pl.col("matches_list")
.apply(lambda x: list(compress(range(len(x)), x)))
.alias("matches_indices")
)
# get the matches itself by compressing the pattern column
# itself
.with_columns(
pl.col("matches_list")
.apply(
lambda x: list(
compress(patterns.to_series(patterns.columns.index("Patterns")).to_list(), x)
)
)
.alias("matches")
)
)
谢谢:)
可以使用 cross join 来比较所有行。
然后您可以
.explode()
模式列表,运行 .str.contains()
然后使用 .group_by()
创建 match_index 和模式列表。
df_a = pl.DataFrame({
"Strings": [
"Apples are red",
"Apples are not Bananas",
"Bananas are not red",
"I like Apples and Bananas",
"Oranges are kinda red",
]
})
df_b = pl.DataFrame(
{"Patterns": [["Bananas", "red"], ["Apples"], ["Apples", "Bananas"]]}
)
(
df_a.with_row_count()
.join(
df_b.with_row_count("match_index"),
how = "cross"
)
.explode("Patterns")
.with_columns(
# could be put directly in .filter
match = pl.col("Strings").str.contains(pl.col("Patterns")) # literal=True?
)
.filter(pl.col("match").all().over("row_nr", "match_index"))
.groupby("row_nr", "match_index", maintain_order=True)
.agg(pl.first("Strings"), "Patterns")
.groupby("row_nr")
.agg(pl.first("Strings"), "match_index", "Patterns")
)
shape: (4, 4)
┌────────┬───────────────────────────┬─────────────┬───────────────────────────────────┐
│ row_nr ┆ Strings ┆ match_index ┆ Patterns │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ list[u32] ┆ list[list[str]] │
╞════════╪═══════════════════════════╪═════════════╪═══════════════════════════════════╡
│ 0 ┆ Apples are red ┆ [1] ┆ [["Apples"]] │
│ 1 ┆ Apples are not Bananas ┆ [1, 2] ┆ [["Apples"], ["Apples", "Bananas… │
│ 2 ┆ Bananas are not red ┆ [0] ┆ [["Bananas", "red"]] │
│ 3 ┆ I like Apples and Bananas ┆ [1, 2] ┆ [["Apples"], ["Apples", "Bananas… │
└────────┴───────────────────────────┴─────────────┴───────────────────────────────────┘
这会过滤掉不匹配的行 - 但您可以执行左连接:
df_a.join(..., how="left")
以获得所需的结果。