我在提出执行以下操作所需的代码方面确实非常具有挑战性。 this也有类似的问题,但我无法弄清楚如何使代码适应我的特别需求。
我有一个熊猫数据框,长度超过10万行。这是当前地址和公寓号码格式的外观:
当前DF:
temp = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'], 'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan']}
data = pd.DataFrame(temp)
data
col1 col2
0 220 CENTRAL STREET, 50 50
1 165 EAST 66TH ST, RESI RESI
2 106 SPRUCE STREET, 1 nan
3 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A
6 520 PARK LANE DPH60
7 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A
9 520 PARK SOUTH, DPH54 DPH54
10 520 PARK LANE DPH52
11 62 VEST STREET 21F
12 256 FLARIN AVENUE nan
所需的DF(data1),它添加了3个新列,以便以后可以使用不同级别的粒度:
temp1 = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'building_address':['220 CENTRAL STREET', '165 EAST 66TH ST', '106 SPRUCE STREET', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING', '520 PARK SOUTH', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'apt_no': ['50', 'RESI', '1', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'full_address':['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, 2A', '520 PARK LANE, DPH60', '520 PARK LANE, DPH56', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE, DPH52', '62 VEST STREET, 21F', '256 FLARIN AVENUE']}
data1 = pd.DataFrame(temp1)
data1
col1 col2 building_address apt_no \
0 220 CENTRAL STREET, 50 50 220 CENTRAL STREET 50
1 165 EAST 66TH ST, RESI RESI 165 EAST 66TH ST RESI
2 106 SPRUCE STREET, 1 nan 106 SPRUCE STREET 1
3 14 EAST 67TH STREET nan 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A 200 EAST 1ST STREET 2A
6 520 PARK LANE DPH60 520 PARK LANE DPH60
7 520 PARK LANE DPH56 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A 80 BAY STREET LANDING 1A
9 520 PARK SOUTH, DPH54 DPH54 520 PARK SOUTH DPH54
10 520 PARK LANE DPH52 520 PARK LANE DPH52
11 62 VEST STREET 21F 62 VEST STREET 21F
12 256 FLARIN AVENUE nan 256 FLARIN AVENUE nan
full_address
0 220 CENTRAL STREET, 50
1 165 EAST 66TH ST, RESI
2 106 SPRUCE STREET, 1
3 14 EAST 67TH STREET
4 1131 OGEN AVENUE
5 200 EAST 1ST STREET, 2A
6 520 PARK LANE, DPH60
7 520 PARK LANE, DPH56
8 80 BAY STREET LANDING, 1A
9 520 PARK SOUTH, DPH54
10 520 PARK LANE, DPH52
11 62 VEST STREET, 21F
12 256 FLARIN AVENUE
在现有DF(数据)中,col1是可能包含或不包含公寓号码的街道地址。为了简单起见,我假设如果有逗号,col1下的值将具有一个公寓号。逗号后的部分可以视为公寓号。
col2仅包含公寓号。它在列中有nan。在某些情况下,例如第5行,col2('2A')中的公寓编号与col1('RU')中逗号后面的部分不匹配。在其他情况下,例如第2行,col2可能是nan,但col1在逗号后有一个公寓号。
我想做的是添加3个新列(如所需DF数据1所示:]
['building_address']本质上将只包含逗号之前的所有内容,因此它将说“ 220 CENTRAL STREET”,而col1则说“ 220 CENTRAL STREET,50']
['apt_no']将检查是否存在nan。如果有,它将在col1中检查逗号后的值。如果检查成功,它将在col2中填充该值。因此,例如,在data1第2行中,apt_no将采用值'1',该值是从col1中逗号后的部分获得的。它还将检查col1中逗号后是否有一部分,并且col2中是否有值,并且如果它们不同,它将采用col2中的值。例如,在第5行中,即使col1在逗号后显示“ RU”,apt_no的值也是从col2取的值“ 2A”。最后,如果col1中没有逗号,而col2是nan,则'apt_no'仍然是nan。
[[full_address']最后,'full address'将以建筑物地址apt_no的格式将['building address']和['apt_no']连接为1个字符串(如上所示)。如果“ apt_no”为nan,则“完整地址”将与“ col1”相同]
我已经为此苦苦挣扎了好几个小时,但还没有想办法。感谢您的关注。
这里的代码可以提供您想要的结果。我最后将apt_no重置为null以匹配您的解决方案。
data['building_address']=data['col1'].str.split(',').str[0]
data['apt_no']=data['col1'].str.split(',').str[1]
data['apt_no'][data['apt_no'].isnull()]=data['col2'][data['apt_no'].isnull()]
data['apt_no'][(data['apt_no'].isnull()) | (data['apt_no']=='nan')]=''
data['full_address']=(data['building_address']+', '+data['apt_no']).str.rstrip(', ')
#Reset to null
data['apt_no'][data['apt_no']=='']=np.nan