极性:将函数应用于过滤字符串列的有效方法

问题描述 投票:0回答:1

我有一列长字符串(如句子),我想对其执行以下操作:

  1. 替换某些字符
  2. 创建剩余字符串的列表
  3. 如果字符串都是文本,请查看它是否在字典中,如果是,则保留它
  4. 如果字符串都是数字,则保留它
  5. 如果字符串是数字/文本的混合,则找到数字与字母的比率,如果高于阈值则保留

我目前的做法如下:

            for memo_field in self.memo_columns:
                data = data.with_columns(
                    pl.col(memo_field).map_elements(
                        lambda x: self.filter_field(text=x, word_dict=word_dict))
                    )

filter_field方法使用纯Python,所以:

  • text_sani = re.sub(r'[^a-zA-Z0-9\s\_\-\%]', ' ', text)
    替换
  • text_sani = text_sani.split(' ')
    分裂
  • len(re.findall(r'[A-Za-z]', x))
    查找 text_sani 列表中每个元素的 num 个字母(与 num 位数字类似),比率是差值除以总 num 个字符
  • 列表理解和
    if
    过滤单词列表

实际上还不错,128M 行大约需要 10 分钟。不幸的是,未来的文件将会更大。在大约 300M 行文件上,此方法逐渐增加内存消耗,直到操作系统 (Ubuntu) 终止该进程。此外,所有处理似乎都在单个核心上进行。

我已经开始尝试使用 Polars 字符串表达式,并且下面提供了代码和玩具示例

此时看来我唯一的选择是调用函数来完成其余的工作。我的问题是:

  1. 在我原来的方法中,内存消耗增长是否正常?
    map_elements
    是否会创建原始系列的副本,从而消耗更多内存?
  2. 我原来的方法正确还是有更好的方法,例如。我刚刚开始阅读《Polars》中关于
    struct
    的内容?
  3. 是否可以使用 just Polars 表达式来做我想做的事?

更新:

Toy data:

temp = pl.DataFrame({"foo": ['COOOPS.autom.SAPF124',
                            'OSS REEE PAAA comp. BEEE  atm 6079 19000000070 04-04-2023',
                            'ABCD 600000000397/7667896-6/REG.REF.REE PREPREO/HMO',
                            'OSS REFF pagopago cost. Becf  atm 9682 50012345726 10-04-2023']
                    })

Code Functions:

def num_dec(x):
    return len(re.findall(r'[0-9_\/]', x))

def num_letter(x):
    return len(re.findall(r'[A-Za-z]', x))

def letter_dec_ratio(x):
    if len(x) == 0:
        return None
    nl = num_letter(x)
    nd = num_dec(x)
    if (nl + nd) == 0:       
        return None
    ratio = (nl - nd)/(nl + nd)
    return ratio

def filter_field(text=None, word_dict=None):

    if type(text) is not str or word_dict is None:
        return 'no memo and/or dictionary'

    if len(text) > 100:
        text = text[0:101]
    print("TEXT: ",text)
    text_sani = re.sub(r'[^a-zA-Z0-9\s\_\-\%]', ' ', text) # parse by replacing most artifacts and symbols with space 

    words = text_sani.split(' ') # create words separated by spaces
    print("WORDS: ",words)

    kept = []
    ratios = [letter_dec_ratio(w) for w in words]
    [kept.append(w.lower()) for i, w in enumerate(words) if ratios[i] is not None and ((ratios[i] == -1 or (-0.7 <= ratios[i] <= 0)) or (ratios[i] == 1 and w.lower() in word_dict))]
    print("FINAL: ",' '.join(kept))

    return ' '.join(kept)

Code Current Implementation:

temp.with_columns(
                pl.col("foo").map_elements(
                    lambda x: filter_field(text=x, word_dict=['cost','atm'])).alias('clean_foo') # baseline
                )

Code Partial Attempt w/Polars:

这让我得到了正确的

WORDS
(参见下一个代码块)

temp.with_columns(
    (
        pl.col(col)
        .str.replace_all(r'[^a-zA-Z0-9\s\_\-\%]',' ')
        .str.split(' ')
    )
)

Expected Result
(在每一步,请参阅上面的
print
陈述):

TEXT:  COOOPS.autom.SAPF124
WORDS:  ['COOOPS', 'autom', 'SAPF124']
FINAL:  
TEXT:  OSS REEE PAAA comp. BEEE  atm 6079 19000000070 04-04-2023
WORDS:  ['OSS', 'REEE', 'PAAA', 'comp', '', 'BEEE', '', 'atm', '6079', '19000000070', '04-04-2023']
FINAL:  atm 6079 19000000070 04-04-2023
TEXT:  ABCD 600000000397/7667896-6/REG.REF.REE PREPREO/HMO
WORDS:  ['ABCD', '600000000397', '7667896-6', 'REG', 'REF', 'REE', 'PREPREO', 'HMO']
FINAL:  600000000397 7667896-6
TEXT:  OSS REFF pagopago cost. Becf  atm 9682 50012345726 10-04-2023
WORDS:  ['OSS', 'REFF', 'pagopago', 'cost', '', 'Becf', '', 'atm', '9682', '50012345726', '10-04-2023']
FINAL:  cost atm 9682 50012345726 10-04-2023
python string string-matching python-polars
1个回答
1
投票

可以使用 Polars 的原生表达式 API 来实现过滤,如下所示。我从问题中的简单实现中获取了正则表达式。

word_list = ["cost", "atm"]

# to avoid long expressions in ``pl.Expr.list.eval``
num_dec_expr = pl.element().str.count_matches(r'[0-9_\/]').cast(pl.Int32)
num_letter_expr = pl.element().str.count_matches(r'[A-Za-z]').cast(pl.Int32)
ratio_expr = (num_letter_expr - num_dec_expr) / (num_letter_expr + num_dec_expr)

(
    df
    .with_columns(
        pl.col("foo")
        # convert to lowercase
        .str.to_lowercase()
        # replace special characters with space
        .str.replace_all(r"[^a-z0-9\s\_\-\%]", " ")
        # split string at spaces into list of words
        .str.split(" ")
        # filter list of words
        .list.eval(
            pl.element().filter(
                # only keep non-empty string...
                pl.element().str.len_chars() > 0,
                # ...that either 
                # - are in the list of words,
                # - consist only of characters related to numbers,
                # - have a ratio between -0.7 and 0
                pl.element().is_in(word_list) | num_letter_expr.eq(0) | ratio_expr.is_between(-0.7, 0)
            )
        )
        # join list of words into string
        .list.join(" ")
        .alias("foo_clean")
    )
)
shape: (4, 2)
┌───────────────────────────────────────────────────────────────┬──────────────────────────────────────┐
│ foo                                                           ┆ foo_clean                            │
│ ---                                                           ┆ ---                                  │
│ str                                                           ┆ str                                  │
╞═══════════════════════════════════════════════════════════════╪══════════════════════════════════════╡
│ COOOPS.autom.SAPF124                                          ┆                                      │
│ OSS REEE PAAA comp. BEEE  atm 6079 19000000070 04-04-2023     ┆ atm 6079 19000000070 04-04-2023      │
│ ABCD 600000000397/7667896-6/REG.REF.REE PREPREO/HMO           ┆ 600000000397 7667896-6               │
│ OSS REFF pagopago cost. Becf  atm 9682 50012345726 10-04-2023 ┆ cost atm 9682 50012345726 10-04-2023 │
└───────────────────────────────────────────────────────────────┴──────────────────────────────────────┘
© www.soinside.com 2019 - 2024. All rights reserved.