我以前曾发布过这个问题,但我添加了一些新的评论--我有一个大的数据框,我想知道如何将不同值的单元格连成一个单元格,给定的数据框如下。DF1:以数据和名称为标题
Data, Name
Address State1, Name1
Household = 1, Name1
1012 Address 123 City, Name1
1013 Address Zip 12345, Name1
1012 Address 234 City, Name1
1013 Address Zip 23456, Name1
Address State2, Name2
Household = 2, Name2
1012 Address 345 City, Name2
1013 Address Zip 34567, Name2
1012 Address 456 City, Name2
1013 Address Zip 45678, Name2
.......... dataframe repeats with different values for 10,000+ lines
1012和1013是一个不同的重复序列X次。我不能只用一个 groupby
函数,因为1012和1013单元格中的值在变化。我试图将地址,家庭,1012...,1013...,合并到一个单元格中。DFOut:
Data, Name
Address State1 Household = 1 1012 Address 123 City 1013 Address Zip 12345, Name1
Address State1 Household = 1 1012 Address 234 City 1013 Address Zip 23456, Name1
Address State2 Household = 2 1012 Address 345 City 1013 Address Zip 34567, Name2
Address State2 Household = 2 1012 Address 456 City 1013 Address Zip 45678, Name2
..... repeats for entire dataframe 10,000+ lines in DF1
或者,在这个单元格中 Data
DFOut中的列也可以分开。
Data, Number, Seq, Seq1, Name
Address State1, Household = 1, 1012 Address 123 City, 1013 Address Zip 12345, Name1
Address State1, Household = 1, 1012 Address 234 City, 1013 Address Zip 23456, Name1
Address State2, Household = 2, 1012 Address 345 City, 1013 Address Zip 34567, Name2
Address State2, Household = 2, 1012 Address 456 City, 1013 Address Zip 45678, Name2
..... repeats for entire dataframe 10,000+ lines in DF1
我试图用一些 for
循环搜索 Data
列,然后将不同的值连接到一列中,但我松开了基于值的 Name
列,但在这样做之后,不知为何会出现这种情况。我对Python相当陌生,任何帮助都将非常感激。先谢谢你
我试着用str.match来定位地址文本。
# generating mock data:
col_data = ['Address State1','Household = 1','1012 Address 123 City',
'1013 Address Zip 12345','1012 Address 234 City','1013 Address Zip 23456',
'Address State2','Household = 2', '1012 Address 345 City',
'1013 Address Zip 34567','1012 Address 456 City','1013 Address Zip 45678']
col_name = ['Name1','Name1','Name1','Name1','Name1','Name1','Name2' ,'Name2','Name2' ,'Name2' ,'Name2','Name2']
df = pd.DataFrame({'Data': col_data, 'Name':col_name})
# solution:
df.loc[df['Data'].str.match('Address'), 'Address'] = df['Data']
df.loc[df['Data'].str.match('Household'), 'Household'] = df['Data']
df.loc[df['Data'].str.match('1012 Address'), '1012 Address'] = df['Data']
df.loc[df['Data'].str.match('1013 Address'), '1013 Address'] = df['Data']
df['Address'] = df['Address'].fillna(method='ffill')
df['Household'] = df['Household'].fillna(method='ffill')
df['1012 Address'] = df['1012 Address'].fillna(method='ffill')
df['1013 Address'] = df['1013 Address']
df = df.dropna()
而结果是。
Name Address Household 1012 Address 1013 Address
3 Name1 Address State1 Household = 1 1012 Address 123 City 1013 Address Zip 12345
5 Name1 Address State1 Household = 1 1012 Address 234 City 1013 Address Zip 23456
9 Name2 Address State2 Household = 2 1012 Address 345 City 1013 Address Zip 34567
11 Name2 Address State2 Household = 2 1012 Address 456 City 1013 Address Zip 45678
如果你知道总是有相同的字段以相同的顺序排列 你可以用numpy reshape来做这样的事情。
df = pd.DataFrame({'Data': ['a1', 'a2', 'a3', 'b1', 'b2', 'b3']})
to_reshape = np.array(df['Data'])
reshaped = to_reshape.reshape((2, 3))
df = pd.DataFrame(data=reshaped, columns=['1', '2', '3'])
print(df)
>>> 1 2 3
>>> 0 a1 a2 a3
>>> 1 b1 b2 b3
然后你可以添加name列。要知道有多少行,你可以计算唯一的名字。
由于只有10,000行,你可以使用这个循环。
DFOut = DF1.copy()
j = ''
k = ''
l = ''
row = 0
DFOut['Data'] = DFOut['Data,']
for i in DFOut['Data']:
row +=1
if 'Address State' in i:
j = i
elif 'Household' in i:
k = i
elif 'City' in i:
l = i
elif 'Zip' in i:
DFOut.loc[row - 1, 'Data'] = f'{j} {k} {l} {i}'.replace(',', '')
DFOut = DFOut.loc[DFOut['Data'].str.contains('Zip'), ['Data', 'Name']]
DFOut = DFOut.rename({'Data' : 'Data,'}, axis=1)
DFOut
输出:
Data, Name
3 Address State1 Household = 1 1012 Address 123 ... Name1
5 Address State1 Household = 1 1012 Address 234 ... Name1
9 Address State2 Household = 2 1012 Address 345 ... Name2
11 Address State2 Household = 2 1012 Address 456 ... Name2