如何在Python中通过改变数值对不同的单元格进行分组或连接

问题描述 投票:0回答:1

我以前曾发布过这个问题,但我添加了一些新的评论--我有一个大的数据框,我想知道如何将不同值的单元格连成一个单元格,给定的数据框如下。DF1:以数据和名称为标题

    Data,                          Name
    Address State1,                Name1
    Household = 1,                 Name1
    1012 Address 123 City,         Name1
    1013 Address Zip 12345,        Name1
    1012 Address 234 City,         Name1
    1013 Address Zip 23456,        Name1
    Address State2,                Name2
    Household = 2,                 Name2
    1012 Address 345 City,         Name2
    1013 Address Zip 34567,        Name2
    1012 Address 456 City,         Name2
    1013 Address Zip 45678,        Name2
    .......... dataframe repeats with different values for 10,000+ lines

1012和1013是一个不同的重复序列X次。我不能只用一个 groupby 函数,因为1012和1013单元格中的值在变化。我试图将地址,家庭,1012...,1013...,合并到一个单元格中。DFOut:

    Data,                                                                                        Name
    Address State1   Household = 1   1012 Address 123 City        1013 Address Zip 12345,        Name1
    Address State1   Household = 1   1012 Address 234 City        1013 Address Zip 23456,        Name1
    Address State2   Household = 2   1012 Address 345 City        1013 Address Zip 34567,        Name2
    Address State2   Household = 2   1012 Address 456 City        1013 Address Zip 45678,        Name2
    ..... repeats for entire dataframe 10,000+ lines in DF1

或者,在这个单元格中 Data DFOut中的列也可以分开。

    Data,            Number,         Seq,                         Seq1,                          Name
    Address State1,  Household = 1,  1012 Address 123 City,       1013 Address Zip 12345,        Name1
    Address State1,  Household = 1,  1012 Address 234 City,       1013 Address Zip 23456,        Name1
    Address State2,  Household = 2,  1012 Address 345 City,       1013 Address Zip 34567,        Name2
    Address State2,  Household = 2,  1012 Address 456 City,       1013 Address Zip 45678,        Name2
    ..... repeats for entire dataframe 10,000+ lines in DF1

我试图用一些 for 循环搜索 Data 列,然后将不同的值连接到一列中,但我松开了基于值的 Name 列,但在这样做之后,不知为何会出现这种情况。我对Python相当陌生,任何帮助都将非常感激。先谢谢你

python python-3.x excel pandas csv
1个回答
0
投票

我试着用str.match来定位地址文本。

# generating mock data:
col_data = ['Address State1','Household = 1','1012 Address 123 City', 
        '1013 Address Zip 12345','1012 Address 234 City','1013 Address Zip 23456',
        'Address State2','Household = 2', '1012 Address 345 City',
        '1013 Address Zip 34567','1012 Address 456 City','1013 Address Zip 45678']
col_name = ['Name1','Name1','Name1','Name1','Name1','Name1','Name2' ,'Name2','Name2' ,'Name2' ,'Name2','Name2']
df = pd.DataFrame({'Data': col_data, 'Name':col_name})
# solution:
df.loc[df['Data'].str.match('Address'), 'Address'] = df['Data']
df.loc[df['Data'].str.match('Household'), 'Household'] = df['Data']
df.loc[df['Data'].str.match('1012 Address'), '1012 Address'] = df['Data']
df.loc[df['Data'].str.match('1013 Address'), '1013 Address'] = df['Data']

df['Address'] = df['Address'].fillna(method='ffill')
df['Household'] = df['Household'].fillna(method='ffill')
df['1012 Address'] = df['1012 Address'].fillna(method='ffill')
df['1013 Address'] = df['1013 Address']

df = df.dropna()

而结果是。

     Name         Address      Household           1012 Address              1013 Address  
3   Name1  Address State1  Household = 1  1012 Address 123 City    1013 Address Zip 12345  
5   Name1  Address State1  Household = 1  1012 Address 234 City    1013 Address Zip 23456     
9   Name2  Address State2  Household = 2  1012 Address 345 City    1013 Address Zip 34567     
11  Name2  Address State2  Household = 2  1012 Address 456 City    1013 Address Zip 45678     

0
投票

如果你知道总是有相同的字段以相同的顺序排列 你可以用numpy reshape来做这样的事情。


df = pd.DataFrame({'Data': ['a1', 'a2', 'a3', 'b1', 'b2', 'b3']})
to_reshape = np.array(df['Data'])
reshaped = to_reshape.reshape((2, 3))
df = pd.DataFrame(data=reshaped, columns=['1', '2', '3'])
print(df)

>>>     1   2   3
>>> 0  a1  a2  a3
>>> 1  b1  b2  b3

然后你可以添加name列。要知道有多少行,你可以计算唯一的名字。


0
投票

由于只有10,000行,你可以使用这个循环。

DFOut = DF1.copy()
j = ''
k = ''
l = ''
row = 0
DFOut['Data'] = DFOut['Data,']
for i in DFOut['Data']:
    row +=1
    if 'Address State' in i:
        j = i
    elif 'Household' in i:
        k = i
    elif 'City' in i:
        l = i
    elif 'Zip' in i:
        DFOut.loc[row - 1, 'Data'] = f'{j} {k} {l} {i}'.replace(',', '')
DFOut = DFOut.loc[DFOut['Data'].str.contains('Zip'), ['Data', 'Name']]
DFOut = DFOut.rename({'Data' : 'Data,'}, axis=1)
DFOut

输出:

    Data,                                               Name
3   Address State1 Household = 1 1012 Address 123 ...   Name1
5   Address State1 Household = 1 1012 Address 234 ...   Name1
9   Address State2 Household = 2 1012 Address 345 ...   Name2
11  Address State2 Household = 2 1012 Address 456 ...   Name2
© www.soinside.com 2019 - 2024. All rights reserved.