将在 Polars 字符串列中找到的列表中的所有匹配子字符串作为列

Question

如何返回字符串中找到的所有匹配术语或子字符串的列？我怀疑有一种方法可以按照这些评论中的建议用

pl.any_horizontal()

来做到这一点，但我无法将其拼凑在一起。

import re

terms = ['a', 'This', 'e']

(pl.DataFrame({'col': 'This is a sentence'})
   .with_columns(matched_terms = pl.col('col').map_elements(lambda x: list(set(re.findall('|'.join(terms), x)))))
)

该列应返回：['a', 'This', 'e']

编辑：这里的获胜解决方案：

.str.extract_all('|'.join(terms)).list.unique()

与

this密切相关的问题

的获胜解决方案：pl.col('col').str.split(' ').list.set_intersection(terms)不同，因为

.set_intersection()

没有获取列表元素的子字符串（例如部分，不完整，单词）。

Answer 1

pl.col('a').str.extract_all('|'.join(terms))

的 each_term 列对我来说是最好的解决方案。

pl.Config.set_fmt_table_cell_list_len(4)

terms = ['A', 'u', 'bug', 'g']

(pl.DataFrame({'a': 'A bug in a rug.'})
 .select(has_term = pl.col('a').str.contains_any(terms),
         has_term2 = pl.col('a').str.contains('|'.join(terms)),
         has_term3 = pl.any_horizontal(pl.col("a").str.contains(t) for t in terms),
         
         each_term = pl.col('a').str.extract_all('|'.join(terms)),
         
         whole_terms = pl.col('a').str.split(' ').list.set_intersection(terms),
         n_matched_terms = pl.col('a').str.count_matches('|'.join(terms)),
        )
)

shape: (1, 6)
┌──────────┬───────────┬───────────┬────────────────────────┬──────────────┬─────────────────┐
│ has_term ┆ has_term2 ┆ has_term3 ┆ each_term              ┆ whole_terms  ┆ n_matched_terms │
│ ---      ┆ ---       ┆ ---       ┆ ---                    ┆ ---          ┆ ---             │
│ bool     ┆ bool      ┆ bool      ┆ list[str]              ┆ list[str]    ┆ u32             │
╞══════════╪═══════════╪═══════════╪════════════════════════╪══════════════╪═════════════════╡
│ true     ┆ true      ┆ true      ┆ ["A", "bug", "u", "g"] ┆ ["A", "bug"] ┆ 4               │
└──────────┴───────────┴───────────┴────────────────────────┴──────────────┴─────────────────┘

将在 Polars 字符串列中找到的列表中的所有匹配子字符串作为列

问题描述投票：0回答：1

1个回答

最新问题

将在 Polars 字符串列中找到的列表中的所有匹配子字符串作为列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1