如何在 python 中将一行分成多行?

问题描述 投票:0回答:4

假设我有一个这样的示例图:

l = [
['Visitors', '1 February 2020', 'Saturday', 'Shop A', 'In', '100', '20', '30','150', 'Out', '90', '10', '15', '115'],
['Visitors', '1 February 2020', 'Saturday', 'Shop B', 'In', '20', '10', '40', '70', 'Out', '10', '9', '0', '19'],
['Visitors', '1 February 2020', 'Saturday', 'Shop C', 'In', '42', '18', '20', '80', 'Out', '40', '10', '20', '70'],
['Visitors', '1 February 2020', 'Saturday', 'Shop D', 'In', '0', '0', '0', '0', 'Out', '0', '0', '0', '0'],
['Visitors', '1 February 2020', 'Saturday', 'Shop E', 'In', '0', '0', '0', '0', 'Out', '0', '0', '0', '0'],
['Visitors', '1 February 2020', 'Saturday', 'Shop F', 'In', '20', '19', '11', '50', 'Out', '10', '9', '5', '24'],
['Visitors', '1 February 2020', 'Saturday', 'Shop G', 'In', '25', '8', '33', '66', 'Out', '20', '6', '30', '56'],
['Visitors', '1 February 2020', 'Saturday', 'Shop H', 'In', '180', '88', '6', '274', 'Out', '170', '80', '5', '255'],
['Visitors', '1 February 2020', 'Saturday', 'Shop I', 'In', '0', '0', '0', '0', 'Out', '0', '0', '0', '0'],
['Visitors', '1 February 2020', 'Saturday', 'Total', 'In', '387', '163', '140', '690', 'Out', '340', '124', '75', '539'],
]

数字显示每天有多少男人/女人/孩子光顾一家商店,并记录他们的进出记录。上图可以解释如下:

[Ppl_type, Date, Weekday, Shop, In, Men, Women, Children, Total, Out, Men, Women, Children, Total]

enter image description here<-- the expected result

这是我希望看到的结果。将上图生成excel,标题如下:

header= ['Ppl_type', 'Date', 'Weekday', 'Shop', 'In/Out', 'Visitor_Type', 'Number']

因此,每个商店将有六行(即三行“In”和三行“Out”)总结以上数字。

我想知道如何通过 python 完成并生成结果以达到 excel。我试过 worksheet.write 但似乎只适用于前四列。非常感谢。

python pandas xlsxwriter
4个回答
2
投票

对于完全程序化的解决方案,您可以使用:

header= ['Ppl_type', 'Date', 'Weekday', 'Shop',
         'In', 'Men', 'Women', 'Children', 'Total',
         'Out', 'Men', 'Women', 'Children', 'Total']

df = pd.DataFrame(l, columns=header)

m1 = df.columns.isin(['In', 'Out'])
grp = df.columns.to_series().where(m1).ffill()
m2 = grp.notna()
m = m2 & ~m1

out = (
 df.loc[:, m2==m]
   .set_index(list(grp[~m2].index))
   .astype(int)
   .set_axis(pd.MultiIndex.from_arrays([df.columns[m], grp[m]],
                                       names=('Visitor_Type', 'In/Out')), axis=1)
   .stack(['In/Out', 'Visitor_Type']).reset_index(name='Number')
   # uncomment the line below to remove the Total
   #.loc[lambda d: d['Visitor_Type'].ne('Total') 
)

输出:

    Ppl_type             Date   Weekday    Shop In/Out Visitor_Type  Number
0   Visitors  1 February 2020  Saturday  Shop A     In     Children      30
1   Visitors  1 February 2020  Saturday  Shop A     In          Men     100
2   Visitors  1 February 2020  Saturday  Shop A     In        Total     150
3   Visitors  1 February 2020  Saturday  Shop A     In        Women      20
4   Visitors  1 February 2020  Saturday  Shop A    Out     Children      15
5   Visitors  1 February 2020  Saturday  Shop A    Out          Men      90
6   Visitors  1 February 2020  Saturday  Shop A    Out        Total     115
7   Visitors  1 February 2020  Saturday  Shop A    Out        Women      10
8   Visitors  1 February 2020  Saturday  Shop B     In     Children      40
9   Visitors  1 February 2020  Saturday  Shop B     In          Men      20
10  Visitors  1 February 2020  Saturday  Shop B     In        Total      70
...

1
投票

您可以使用

1
2
更改列名称以删除重复的列名称,因此如果需要原始数据的顺序,可以使用
wide_to_long
DataFrame.stack
进行重塑:

L = [['Visitors', '1 February 2020', 'Saturday', 'Shop A', 'In', '100', '20', '30','150', 'Out', '90', '10', '15', '115'],
['Visitors', '1 February 2020', 'Saturday', 'Shop B', 'In', '20', '10', '40', '70', 'Out', '10', '9', '0', '19'],
['Visitors', '1 February 2020', 'Saturday', 'Shop C', 'In', '42', '18', '20', '80', 'Out', '40', '10', '20', '70'],
['Visitors', '1 February 2020', 'Saturday', 'Shop D', 'In', '0', '0', '0', '0', 'Out', '0', '0', '0', '0'],
['Visitors', '1 February 2020', 'Saturday', 'Shop E', 'In', '0', '0', '0', '0', 'Out', '0', '0', '0', '0'],
['Visitors', '1 February 2020', 'Saturday', 'Shop F', 'In', '20', '19', '11', '50', 'Out', '10', '9', '5', '24'],
['Visitors', '1 February 2020', 'Saturday', 'Shop G', 'In', '25', '8', '33', '66', 'Out', '20', '6', '30', '56'],
['Visitors', '1 February 2020', 'Saturday', 'Shop H', 'In', '180', '88', '6', '274', 'Out', '170', '80', '5', '255'],
['Visitors', '1 February 2020', 'Saturday', 'Shop I', 'In', '0', '0', '0', '0', 'Out', '0', '0', '0', '0'],
['Visitors', '1 February 2020', 'Saturday', 'Total', 'In', '387', '163', '140', '690', 'Out', '340', '124', '75', '539']]

cols = ['Ppl_type', 'Date', 'Weekday', 'Shop', 
        'In/Out1', 'Men1', 'Women1', 'Children1', 'Total1', 
        'In/Out2', 'Men2', 'Women2', 'Children2', 'Total2']
df = pd.DataFrame(L, columns=cols)
print (df)


df = (pd.wide_to_long(df, 
                      stubnames=['In/Out','Men','Women','Children','Total'], 
                      i=['Ppl_type', 'Date', 'Weekday', 'Shop'],
                      j='tmp').set_index('In/Out', append=True)
        .droplevel(-2)
        .rename_axis('Visitor_Type', axis=1)
        .stack()
        .reset_index(name='Number'))

print (df)
    Ppl_type             Date   Weekday    Shop In/Out Visitor_Type Number
0   Visitors  1 February 2020  Saturday  Shop A     In          Men    100
1   Visitors  1 February 2020  Saturday  Shop A     In        Women     20
2   Visitors  1 February 2020  Saturday  Shop A     In     Children     30
3   Visitors  1 February 2020  Saturday  Shop A     In        Total    150
4   Visitors  1 February 2020  Saturday  Shop A    Out          Men     90
..       ...              ...       ...     ...    ...          ...    ...
75  Visitors  1 February 2020  Saturday   Total     In        Total    690
76  Visitors  1 February 2020  Saturday   Total    Out          Men    340
77  Visitors  1 February 2020  Saturday   Total    Out        Women    124
78  Visitors  1 February 2020  Saturday   Total    Out     Children     75
79  Visitors  1 February 2020  Saturday   Total    Out        Total    539

[80 rows x 7 columns]

如果需要在最终输出中删除

Total

df = (pd.wide_to_long(df, 
                      stubnames=['In/Out','Men','Women','Children','Total'], 
                      i=['Ppl_type', 'Date', 'Weekday', 'Shop'],
                      j='tmp').set_index('In/Out', append=True)
        .droplevel(-2)
        .rename_axis('Visitor_Type', axis=1)
        .stack()
        .reset_index(name='Number')
        .query('Visitor_Type != "Total"'))

print (df.head(10))
    Ppl_type             Date   Weekday    Shop In/Out Visitor_Type Number
0   Visitors  1 February 2020  Saturday  Shop A     In          Men    100
1   Visitors  1 February 2020  Saturday  Shop A     In        Women     20
2   Visitors  1 February 2020  Saturday  Shop A     In     Children     30
4   Visitors  1 February 2020  Saturday  Shop A    Out          Men     90
5   Visitors  1 February 2020  Saturday  Shop A    Out        Women     10
6   Visitors  1 February 2020  Saturday  Shop A    Out     Children     15
8   Visitors  1 February 2020  Saturday  Shop B     In          Men     20
9   Visitors  1 February 2020  Saturday  Shop B     In        Women     10
10  Visitors  1 February 2020  Saturday  Shop B     In     Children     40
12  Visitors  1 February 2020  Saturday  Shop B    Out          Men     10

0
投票

在 Python 中,要将一行分成多行,可以使用 split() 方法根据指定的分隔符将字符串拆分为子字符串列表。这是一个示例代码片段:

row = "John,Smith,25,New York"
   delimiter = ","
   split_row = row.split(delimiter)
        print(split_row)

在这个例子中,行变量包含一个字符串,其中有四个值,用逗号分隔。我们将定界符变量定义为逗号,我们将使用它来拆分行。然后,我们使用带分隔符参数的行变量的 split() 方法将字符串拆分为子字符串列表。生成的 split_row 列表将包含四个元素:“John”、“Smith”、“25”和“New York”。

一旦有了子字符串列表,就可以使用它们来创建多行。例如,您可以使用循环遍历列表并为每个值创建一个新行:

对于 split_row 中的值: 新行 = 值 打印(新行) 这将为 split_row 列表中的每个值创建一个新行。结果输出将是:

约翰 史密斯 25 纽约


0
投票

你可以为此硬编码一个解析器:

def split_rows(row):
    base = [row[0], parsed_date(row[1]), row[2], row[3]]
    return [
        base + ['In', 'Man', row[5]],
        base + ['In', 'Woman', row[6]],
        base + ['In', 'Children', row[7]],
        base + ['Out', 'Man', row[10]],
        base + ['Out', 'Woman', row[11]],
        base + ['Out', 'Children', row[12]]
    ]

然后假设数据是包含数据的列表列表:

final_rows = []
for d in data:
    for row in split_rows(d):
        final_rows.append(row)

with open('test.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(header)
    writer.writerows(final_rows)

然后只需执行 parsed_date

PS:我写这篇文章时发布的其他解决方案肯定比这个好

© www.soinside.com 2019 - 2024. All rights reserved.