根据 df 中的其他值更改 df 的时间复杂度最低的 pandas 技术是什么？

Question

我有一个 df，其中包含从网络上抓取的 350 万条数据条目（行）。一个网站会产生干净的数据，而另一个网站则不会。我想替换所有 M 公司文章的品牌、类别、系列（如果这些文章在 R 公司中）。但是代码运行时间太长。时间复杂度最低的 pandas 解决方案是什么？

import pandas as pd

d = {
    'company': [
        'R', 
        'R',
        'M',
        'M'
        ],
    'article': [
        'a',
        'a',
        'a',
        'a',
         ],
    'brand': [
        'brand1',
        'brand1',
        'brand1x',
        'brand1x',
        ],
    'category':[
        'cat',
        'cat',
        'catx',
        'catx'
        ],
    'series':[
        'series1',
        'series1',
        'series1x',
        'series1x'
        ],
    'price':[
        2,
        2.2,
        2.25,
        2.25
        ],
    'date':[
        '2023-11-05',
        '2023-11-11',
        '2023-11-05',
        '2023-11-11'
        ],
    }

df = pd.DataFrame(data=d)

#create a new df containing only company R-data without duplicates
selected_columns = df[df['company'] == 'R'][['article', 'brand', 'category', 'series']].drop_duplicates()

#Function to update brand,category and series, if article 
def update_values(row):
    article = row['article']
    if article in selected_columns['article'].values:
        selected_row = selected_columns[selected_columns['article'] == article]
        row['brand'] = selected_row['brand'].values[0]
        row['category'] = selected_row['category'].values[0]
        row['series'] = selected_row['series'].values[0]
    return row

#Use function on part M
new = df[df['company'] == 'M'].apply(update_values, axis=1)

#Filter df for R
df = df[df['company'] == 'R']

#Concat both
frames = [new,df]
df = pd.concat(frames)

Answer 1

您很可能可以

merge

和

fillna

:

out = (df[['company']].merge(selected_columns.assign(company='M'), how='left')
       .reindex_like(df)
       .fillna(df)
      )

输出：

  company article   brand category   series price        date
0       R       a  brand1      cat  series1   2.0  2023-11-05
1       R       a  brand1      cat  series1   2.2  2023-11-11
2       M       a  brand1      cat  series1  2.25  2023-11-05
3       M       a  brand1      cat  series1  2.25  2023-11-11

根据 df 中的其他值更改 df 的时间复杂度最低的 pandas 技术是什么？

问题描述投票：0回答：1

1个回答

最新问题

根据 df 中的其他值更改 df 的时间复杂度最低的 pandas 技术是什么？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1