提取并组合熊猫的街道地址和公寓号

问题描述 投票:0回答:1

我在提出执行以下操作所需的代码方面确实非常具有挑战性。 this也有类似的问题,但我无法弄清楚如何使代码适应我的特别需求。

我有一个熊猫数据框,长度超过10万行。这是当前地址和公寓号码格式的外观:

当前DF:

temp = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'], 'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan']}
data = pd.DataFrame(temp)
data

               col1                   col2
0      220 CENTRAL STREET, 50            50
1      165 EAST 66TH ST, RESI          RESI
2        106 SPRUCE STREET, 1           nan
3         14 EAST 67TH STREET           nan
4            1131 OGEN AVENUE           nan
5     200 EAST 1ST STREET, RU            2A
6               520 PARK LANE         DPH60
7               520 PARK LANE         DPH56
8   80 BAY STREET LANDING, 1A            1A
9       520 PARK SOUTH, DPH54         DPH54
10              520 PARK LANE         DPH52
11             62 VEST STREET           21F
12          256 FLARIN AVENUE           nan


所需的DF(data1),它添加了3个新列,以便以后可以使用不同级别的粒度:

temp1 = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
         'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'], 
         'building_address':['220 CENTRAL STREET', '165 EAST 66TH ST', '106 SPRUCE STREET', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING', '520 PARK SOUTH', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
         'apt_no': ['50', 'RESI', '1', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
         'full_address':['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, 2A', '520 PARK LANE, DPH60', '520 PARK LANE, DPH56', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE, DPH52', '62 VEST STREET, 21F', '256 FLARIN AVENUE']}

data1 = pd.DataFrame(temp1)
data1


                col1                   col2       building_address   apt_no  \
0      220 CENTRAL STREET, 50            50     220 CENTRAL STREET     50   
1      165 EAST 66TH ST, RESI          RESI       165 EAST 66TH ST   RESI   
2        106 SPRUCE STREET, 1           nan      106 SPRUCE STREET      1   
3         14 EAST 67TH STREET           nan    14 EAST 67TH STREET    nan   
4            1131 OGEN AVENUE           nan       1131 OGEN AVENUE    nan   
5     200 EAST 1ST STREET, RU            2A    200 EAST 1ST STREET     2A   
6               520 PARK LANE         DPH60          520 PARK LANE  DPH60   
7               520 PARK LANE         DPH56          520 PARK LANE  DPH56   
8   80 BAY STREET LANDING, 1A            1A  80 BAY STREET LANDING     1A   
9       520 PARK SOUTH, DPH54         DPH54         520 PARK SOUTH  DPH54   
10              520 PARK LANE         DPH52          520 PARK LANE  DPH52   
11             62 VEST STREET           21F         62 VEST STREET    21F   
12          256 FLARIN AVENUE           nan      256 FLARIN AVENUE    nan   

                 full_address  
0      220 CENTRAL STREET, 50  
1      165 EAST 66TH ST, RESI  
2        106 SPRUCE STREET, 1  
3         14 EAST 67TH STREET  
4            1131 OGEN AVENUE  
5     200 EAST 1ST STREET, 2A  
6        520 PARK LANE, DPH60  
7        520 PARK LANE, DPH56  
8   80 BAY STREET LANDING, 1A  
9       520 PARK SOUTH, DPH54  
10       520 PARK LANE, DPH52  
11        62 VEST STREET, 21F  
12          256 FLARIN AVENUE  


在现有DF(数据)中,col1是可能包含或不包含公寓号码的街道地址。为了简单起见,我假设如果有逗号,col1下的值将具有一个公寓号。逗号后的部分可以视为公寓号。

col2仅包含公寓号。它在列中有nan。在某些情况下,例如第5行,col2('2A')中的公寓编号与col1('RU')中逗号后面的部分不匹配。在其他情况下,例如第2行,col2可能是nan,但col1在逗号后有一个公寓号。

我想做的是添加3个新列(如所需DF数据1所示:]

['building_address']本质上将只包含逗号之前的所有内容,因此它将说“ 220 CENTRAL STREET”,而col1则说“ 220 CENTRAL STREET,50']

['apt_no']将检查是否存在nan。如果有,它将在col1中检查逗号后的值。如果检查成功,它将在col2中填充该值。因此,例如,在data1第2行中,apt_no将采用值'1',该值是从col1中逗号后的部分获得的。它还将检查col1中逗号后是否有一部分,并且col2中是否有值,并且如果它们不同,它将采用col2中的值。例如,在第5行中,即使col1在逗号后显示“ RU”,apt_no的值也是从col2取的值“ 2A”。最后,如果col1中没有逗号,而col2是nan,则'apt_no'仍然是nan。

[[full_address']最后,'full address'将以建筑物地址apt_no的格式将['building address']和['apt_no']连接为1个字符串(如上所示)。如果“ apt_no”为nan,则“完整地址”将与“ col1”相同]

我已经为此苦苦挣扎了好几个小时,但还没有想办法。感谢您的关注。

python regex pandas street-address
1个回答
0
投票

这里的代码可以提供您想要的结果。我最后将apt_no重置为null以匹配您的解决方案。

data['building_address']=data['col1'].str.split(',').str[0]
data['apt_no']=data['col1'].str.split(',').str[1]
data['apt_no'][data['apt_no'].isnull()]=data['col2'][data['apt_no'].isnull()]
data['apt_no'][(data['apt_no'].isnull()) | (data['apt_no']=='nan')]=''
data['full_address']=(data['building_address']+', '+data['apt_no']).str.rstrip(', ')
#Reset to null
data['apt_no'][data['apt_no']=='']=np.nan
© www.soinside.com 2019 - 2024. All rights reserved.