最快的方法来重新格式化数TB的数据

问题描述 投票:-3回答:2

我有100个文件,每个文件都是10 GB。我需要重新格式化文件并组合成更可用的表格格式,以便对数据进行分组,求和,平均等。使用Python重新格式化数据将花费一周时间。即使我将其重新格式化为表格后,我也不知道它对于数据帧是否太大,但一次只有一个问题。

任何人都可以建议更快的方法重新格式化文本文件?我会考虑任何C ++,perl等等。

样本数据:

Scenario:  Modeling_5305 (0.0001)

Position:  NORTHERN UTILITIES SR NT,

"  ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",

"2017/12/31",16.0137 T,4.4194 % SEMI 30/360,0.4980 % SEMI 30/360,"6,934,452.0000 USD","6,884,052.0000 USD","7,000,000.0000 USD",371.6160 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8548 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,

"2018/01/12",15.5666 T,4.8499 % SEMI 30/360,0.4980 % SEMI 30/360,"6,477,803.7492 USD","6,418,163.7492 USD","7,000,000.0000 USD",356.9428 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8219 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,

Scenario:  Modeling_5305 (0.0001)

Position:  OLIVIA ISSUER TR SER A (A,

"  ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",

"2017/12/31",1.3160 T,19.0762 % SEMI 30/360,0.2990 % SEMI 30/360,"3,862,500.0000 USD","3,862,500.0000 USD","5,000,000.0000 USD",2.3811 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.3288 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,

"2018/01/12",1.2766 T,21.9196 % SEMI 30/360,0.2990 % SEMI 30/360,"3,815,391.3467 USD","3,815,391.3467 USD","5,000,000.0000 USD",2.2565 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.2959 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,

我想重新格式化到这个csv表,以便我可以导入到dataframe:

Position, Scenario, TimeSteps, THEO/Value

NORTHERN UTILITIES SR NT, Modeling_5305, 2018/01/12, 6477803.7492

OLIVIA ISSUER TR SER A (A, Modeling_5305, 2018/01/12, 3815391.3467
python pandas bigdata
2个回答
0
投票

当你必须操纵大文件或大量文件时,有两个很大的瓶颈。一个是您的文件系统,受HDD或SSD(存储介质),存储介质连接和操作系统的限制。通常你不能改变它。但你必须问自己,我的最高速度是多少?系统读写速度有多快?你永远不会比那更快。粗略估计最大速度将是您需要读取所有数据的时间加上编写所有数据所需的时间。

另一个瓶颈是您使用的库以进行更改。并非所有Python软件包都是相同的,存在巨大的速度差异。我建议在一个小的测试样本上尝试几种方法,直到找到适合你的方法。

请记住,大多数文件系统都喜欢读取或写入大量数据。因此,您应该尽量避免在读取一行然后写一行之间交替的情况。换句话说,不仅图书馆很重要,而且你如何使用图书馆。

不同的编程语言,虽然它们可能为这项任务提供一个好的库并且可以是一个好的理想,但不会以任何有意义的方式加速这个过程(所以你不会获得10倍的速度或任何东西)。


0
投票

我会使用带内存映射的C / C ++。

使用内存映射,您可以像处理大量字节一样检查数据(这也会阻止从内核空间到用户空间的数据副本(在Windows上,不确定Linux))。

对于非常大的文件,您可以一次映射一个块(例如10GB)。

对于写入,使用缓冲区(比如1MB)来存储结果,然后每次将该缓冲区写入文件(使用fwrite())。

无论你做什么,都不要使用流媒体I / O或readline()

该过程应该不再(或至少不长)复制磁盘上的文件(或因为您使用网络文件存储而通过网络)所花费的时间。

如果您有此选项,请将数据写入与您正在读取的磁盘不同的(物理)磁盘。

© www.soinside.com 2019 - 2024. All rights reserved.