我有两个数据帧:df_geo
和df_event
。我想在df_event
中创建两个新列。数据帧类似于以下内容,尽管为简单起见已删除了其他列:
data_geo = [['010','00','000','00000','00000','00000','United States'],
['040','01','000','00000','00000','00000','Alabama'],
['050','01','001','00000','00000','00000','Autauga County'],
['040','02','000','00000','00000','00000','Alaska'],
['050','02','090','00000','00000','00000','Fairbanks North Star Borough'],
['162','02','000','00000','24230','00000','Fairbanks city'],
['040','09','000','00000','00000','00000','Connecticut']
['050','09','001','00000','00000','00000','Fairfield County']
['061','09','001','04720','00000','00000','Bethel town'],
['040','17','000','00000','00000','00000','Illinois']
['061','17','109','05638','00000','00000','Bethel township']]
dfgeo = pd.DataFrame(data_geo, columns = ['summary_level', 'state_fips','county_fips','subdivision_code_fips','place_code_fips','city_code_fips','area_name'])
df_geo.info()
RangeIndex: 43847 entries, 0 to 43846
Data columns (total 7 columns):
summary_level 43847 non-null object
state_fips 43847 non-null object
county_fips 43847 non-null object
subdivision_code_fips 43847 non-null object
place_code_fips 43847 non-null object
city_code_fips 43847 non-null object
area_name 43847 non-null object
data_event = [['Event Id','_','Alabama'],
['Event Id','_','Connecticut'],
['Event Id','Autauga County','Alabama'],
['Event Id','Fairfield County','Connecticut'],
['Event Id','Fairbanks North Star Borough','Alaska']]
df_event = pd.DataFrame(data_event, columns = ['unique_str','county','state'])
df_event.info()
RangeIndex: 1261 entries, 0 to 1260
Data columns (total 3 columns):
unique_str 1261 non-null object
county 999 non-null object
state 1261 non-null object
dtypes: object(3)
在df_event
中,只要事件在状态级别发生,“ _”将替换NaN值。
GOAL创建一个函数,该函数可以从county
中获取state
和df_event
输入,并在同一数据帧中创建两个新列。新列基于state_fips
中county_fips
和df_geo
的值。一个示例如下所示:
inputA map_new_col('df_geo','Connecticut','Fairfield County'):
resultA = ['Event Id','Connecticut','Fairfield County','09','001']
^New columns
inputB map_new_col('df_geo','Alaska','Fairbanks North Star Borough'):
resultB = ['Event Id','Alaska','Fairbanks North Star Borough','02','090']
^New columns
因为我还需要在1200个(且正在增长的)事件列表上使用此函数,所以该函数必须在lamba函数或可以在整个数据帧中映射的其他对象中起作用。
最终,我的目标是能够一直执行相同的搜索,直到city_code_fips
,但我什至无法在最初的搜索过程中全神贯注!如果我做到了这一点,我知道在搜索“ Bethel镇”时,所有搜索词都必须完全相同,以免出现“ Bethel乡镇”。
我知道这是一个多步骤的过程,但是感谢您的帮助。谢谢。
用途:
df = dfgeo.melt(id_vars=['state_fips','county_fips'], value_vars='area_name')
print (df)
state_fips county_fips variable value
0 00 000 area_name United States
1 01 000 area_name Alabama
2 01 001 area_name Autauga County
3 02 000 area_name Alaska
4 02 090 area_name Fairbanks North Star Borough
5 02 000 area_name Fairbanks city
6 09 000 area_name Connecticut
7 09 001 area_name Fairfield County
8 09 001 area_name Bethel town
9 17 000 area_name Illinois
10 17 109 area_name Bethel township
df_event['state_fips'] = df_event['state'].map(df.set_index('value')['state_fips'])
df_event['county_fips'] = df_event['county'].map(df.set_index('value')['county_fips'])
print (df_event)
unique_str county state state_fips county_fips
0 Event Id _ Alabama 01 NaN
1 Event Id _ Connecticut 09 NaN
2 Event Id Autauga County Alabama 01 001
3 Event Id Fairfield County Connecticut 09 001
4 Event Id Fairbanks North Star Borough Alaska 02 090