修改列中的正则表达式捕获组

Question

如何修改 pandas 中的捕获组

df.replace()

？我尝试向每个单元格字符串内的数字添加数千个分隔符。这应该发生在方法链中。这是我到目前为止的代码：

import pandas as pd

df = pd.DataFrame({'a_column': ['1000 text', 'text', '25000 more text', '1234567', 'more text'],
        "b_column": [1, 2, 3, 4, 5]})

df = (df.reset_index()
      .replace({"a_column": {"(\d+)": r"\1"}}, regex=True))

问题是我不知道如何用

r"\1"

做某事，例如

str(float(r"\1"))

不起作用。

预期输出：

   index         a_column  b_column
0      0        1,000 text       1
1      1            text         2
2      2  25,000 more text       3
3      3        1,234,567        4
4      4        more text        5

Answer 1

您可以在管道中使用

replace

，使用此正则表达式查找前面有数字、后跟 3 位数的倍数的点：

(?<=\d)(?=(?:\d{3})+\b)

然后可以用逗号替换 (

)。

df = (df
    .reset_index()
    .replace({ 'a_column' : { r'(?<=\d)(?=(?:\d{3})+\b)' : ',' } }, regex=True)
)

输出：

   index               a_column  b_column
0      0             1,000 text         1
1      1                   text         2
2      2       25,000 more text         3
3      3              1,234,567         4
4      4              more text         5
5      5  563 and 45 and 9 text         6

注意，我在 df 中添加了额外的一行，以表明您不会在不应该出现的地方出现逗号。

Answer 2

您可以使用正则表达式和匹配组：

import re
import pandas as pd

df = pd.DataFrame({'a_column': ['1000 text', 'text', '25000 more text', '1234567', 'more text'],
        "b_column": [1, 2, 3, 4, 5]})

def add_commas(text):
    def format_number(match):
        return "{:,}".format(int(match.group()))
    return re.sub(r'\b\d+\b', format_number, text)

df.a_column.apply(add_commas)

输出：

0          1,000 text
1                text
2    25,000 more text
3           1,234,567
4           more text

Answer 3

另一种方法：

df["a_column"] = df["a_column"].str.replace(
    r"\b\d+\b", lambda g: f"{int(g.group(0)):,}", regex=True
)
print(df)

打印：

           a_column  b_column
0        1,000 text         1
1              text         2
2  25,000 more text         3
3         1,234,567         4
4         more text         5

Answer 4

仅仅捕获一群人是不够的；我们还需要操作捕获的值，这意味着我们需要一个函数作为替换器。但是

DataFrame.replace

不接受函数，而

str.replace

则接受。但是，由于您想在方法链中使用它，因此我们可以在这里使用

assign

。基本上，添加千位分隔符并将该列分配回同一列标签。所以你可以尝试以下方法（这与其他答案基本相同）：

(
    df
    .reset_index()
    .assign(a_column=df['a_column'].str.replace(r"(\d+)", lambda g: f"{int(g[1]):,}", regex=True))
)

输出

   index          a_column  b_column
0      0        1,000 text         1
1      1              text         2
2      2  25,000 more text         3
3      3         1,234,567         4
4      4         more text         5

修改列中的正则表达式捕获组

问题描述投票：0回答：4

4个回答

最新问题

修改列中的正则表达式捕获组

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4