如何用另一列向量化的值更改一列中的子字符串 - pandas

问题描述 投票:0回答:1

我有一个数据框,其中一列是 xml 字符串 (XML),其中一列 (ICCID) 保存需要用于替换每行 xml 列中的子字符串的值。如果可能的话,我想进行矢量化,所以我尝试了以下代码:

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.expand_frame_repr', False)

columns = ['MSISDN', 'XML', 'ICCID', 'IMSI']
data = [['0123456789', '<subscriberInfo><msisdn>0123456789</msisdn><iccId>12345678998765432100</iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>', 89410123456789123456, 112233445566778],
['9876543210', '<subscriberInfo><msisdn>9876543210</msisdn><iccId>98765432112365478900</iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>', 89410987654321987456, 228024357302211],
['0123987456', '<subscriberInfo><msisdn>0123987456</msisdn><iccId>98765432198765432100</iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>', 89410987654321098765, 228024357302212]]

df = pd.DataFrame(data=data, columns=columns)
df['NEW_XML'] = df['XML'].replace(to_replace=[r'<iccId>\d{20}</iccId>'], value=[fr'<iccId>{df["ICCID"]}</iccId>'], regex=True)

这种方法不起作用,因为 xml 字符串中的目标部分/子字符串被替换为 Series 的实际 pandas 表示形式,如下所示:

0    89410123456789123456\n1    89410987654321987456\n2    89410987654321098765\nName: ICCID, dtype: object

NEW_XML 列的预期结果如下:

NEW_XML
'<subscriberInfo><msisdn>0123456789</msisdn><iccId>89410123456789123456</iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>'
'<subscriberInfo><msisdn>9876543210</msisdn><iccId>89410987654321987456</iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>'
'<subscriberInfo><msisdn>0123987456</msisdn><iccId>89410987654321098765</iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>'
pandas dataframe replace substring
1个回答
0
投票

您可以使用

str.extract

pat = r'(?P<before>.*<iccId>)(?P<iccid>\d{20})(?P<after></iccId>.*)'
xml = df['XML'].str.extract(pat).assign(iccid=df['ICCID'].astype(str))
df['NEW_XML'] = xml['before'] + xml['iccid'] + xml['after']

输出:

>>> df['NEW XML']
0    <subscriberInfo><msisdn>0123456789</msisdn><iccId>89410123456789123456</iccId><imsi>112233445566778</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
1    <subscriberInfo><msisdn>9876543210</msisdn><iccId>89410987654321987456</iccId><imsi>998877665544332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
2    <subscriberInfo><msisdn>0123987456</msisdn><iccId>89410987654321098765</iccId><imsi>665544998877332</imsi><serviceProviderId>x</serviceProviderId><paymentType>POSTPAID</paymentType></subscriberInfo>
Name: NEW_XML, dtype: object
© www.soinside.com 2019 - 2024. All rights reserved.