我有一个 df,其中包含从网络上抓取的 350 万条数据条目(行)。一个网站会产生干净的数据,而另一个网站则不会。我想替换所有 M 公司文章的品牌、类别、系列(如果这些文章在 R 公司中)。但是代码运行时间太长。时间复杂度最低的 pandas 解决方案是什么?
import pandas as pd
d = {
'company': [
'R',
'R',
'M',
'M'
],
'article': [
'a',
'a',
'a',
'a',
],
'brand': [
'brand1',
'brand1',
'brand1x',
'brand1x',
],
'category':[
'cat',
'cat',
'catx',
'catx'
],
'series':[
'series1',
'series1',
'series1x',
'series1x'
],
'price':[
2,
2.2,
2.25,
2.25
],
'date':[
'2023-11-05',
'2023-11-11',
'2023-11-05',
'2023-11-11'
],
}
df = pd.DataFrame(data=d)
#create a new df containing only company R-data without duplicates
selected_columns = df[df['company'] == 'R'][['article', 'brand', 'category', 'series']].drop_duplicates()
#Function to update brand,category and series, if article
def update_values(row):
article = row['article']
if article in selected_columns['article'].values:
selected_row = selected_columns[selected_columns['article'] == article]
row['brand'] = selected_row['brand'].values[0]
row['category'] = selected_row['category'].values[0]
row['series'] = selected_row['series'].values[0]
return row
#Use function on part M
new = df[df['company'] == 'M'].apply(update_values, axis=1)
#Filter df for R
df = df[df['company'] == 'R']
#Concat both
frames = [new,df]
df = pd.concat(frames)
merge
和 fillna
:
out = (df[['company']].merge(selected_columns.assign(company='M'), how='left')
.reindex_like(df)
.fillna(df)
)
输出:
company article brand category series price date
0 R a brand1 cat series1 2.0 2023-11-05
1 R a brand1 cat series1 2.2 2023-11-11
2 M a brand1 cat series1 2.25 2023-11-05
3 M a brand1 cat series1 2.25 2023-11-11