我目前有一个长脚本,其目标是:提取多个csv表数据,将它们合并为一个,同时沿途执行各种计算,然后输出最终的csv表。
我本来就是这种布局的(请参阅布局A),但是发现这使它不得不查看要添加或合并的列,因为清洁和操作方法列在所有内容的下方,因此您必须上下移动查看表如何更改的文件。这是一种尝试,遵循我所读过的全部“保持事物模块化和小的方法”:
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = clean_table_1('table1.csv')
df2 = clean_table_2('table2.csv')
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
def some_operation(x,y,z):
#<calculations for performing on table column>
def some_other_operation(a,b):
#<some calculation>
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
return df
def clean_table_2(fn_2):
#<similar to clean_table_1>
def clean_table_3(fn_3):
#<similar to clean_table_1>
if __name__=='__main__':
main()
我的下一个倾向是将所有功能与主脚本内联,因此显而易见,这是在做什么(请参见布局B)。这使您更容易看到正在执行的操作的线性度,但同时也使操作变得有些混乱,因此您不能只是快速阅读主函数来获得所有正在执行的操作的“概述”。
# LAYOUT B
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
def clean_table_1(fn_1):
df = pd.read_csv(fn_1)
df['some_col1'] = 400
def do_operations_unique_to_table1(df):
#<operations>
return df
df = do_operations_unique_to_table1(df)
df1 = clean_table_1('table1.csv')
def clean_table_2(fn_2):
#<similar to clean_table_1>
df2 = clean_table_2('table2.csv')
def clean_table_3(fn_3):
#<similar to clean_table_1>
df3 = clean_table_3('table3.csv')
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
所以我想,为什么还要具有这些功能;如果它们都处于同一级别,可能会更容易遵循,就像这样的脚本(LAYOUT C):
# LAYOUT C
import pandas as pd
#...
SOME_MAPPER = {'a':1, 'b':2, ...}
COLUMNS_RENAMER = {'new_col1': 'aaa', ...}
def main():
df1 = pd.read_csv('table1.csv)
df1['some_col1'] = 400
df1 = #<operations on df1>
df2 = pd.read_csv('table2.csv)
df2['some_col2'] = 200
df2 = #<operations on df2>
df3 = pd.read_csv('table3.csv)
df3['some_col3'] = 800
df3 = #<operations on df3>
df = pd.merge(df1, df2, on='col_a')
def some_operation(x,y,z):
#<calculations for performing on table column>
df['new_col1'] = df.apply(lambda r: some_operation(r['x'], r['y'], r['z']), axis=1)
df['new_col2'] = df['new_col1'].map(SOME_MAPPER)
df = pd.merge(df, df3, on='new_col2')
def some_other_operation(a,b):
#<some calculation>
df['new_col3'] = df['something']+df['new_col2']
df['new_col4'] = df.apply(lambda r: some_other_operation(r['a'], r['b']), axis=1)
df = df.rename(columns=COLUMNS_RENAMER)
return df
if __name__=='__main__':
main()
问题的症结在于在清楚地记录哪些列正在更新,更改,删除,重命名,合并等之间找到平衡,同时仍然保持足够的模块化以适应“干净代码”的范式。
而且,实际上,此脚本和其他脚本要更长得多,并且有更多的表被合并到混合表中,因此这很快就成为一长串操作。我应该将操作分解为较小的文件并输出中间文件,还是只是要求引入错误?还可以查看沿途所做的所有假设以及它们如何影响最终状态的数据,而不必在文件之间跳转或向上滚动,向下滚动等,以便将数据从A跟踪到B,如果有道理。
[如果有人对如何最好地编写这些类型的数据清理/操作脚本有见解,我希望听到他们。
这是一个非常主观的话题,但这是我的典型做法/备注/提示:
inplace=True
main
函数,如果您按照上述规则保留脚本,则df
全局变量中没有任何问题read_csv
参数执行的操作,例如解析日期