优化`data.table :: fread`速度建议

Question

我正在尝试以fread读取R中相当大的文件时，读取速度比预期的要慢得多。

文件约为6000万行x 147列，其中我仅选择27列，直接使用fread在select调用中；在实际文件中仅找到27个中的23个。（可能我输入的某些字符串不正确，但我想这没多大关系。）

data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
                     verbose = TRUE,
                     select = cols2Select)

正在使用的系统是具有16核Intel Xeon和114 GB RAM的Azure VM，运行Windows 10。我也在使用R 3.5.2，RStudio 1.2.1335和data.table 1.12.0

我还应该补充一点，该文件是我已传输到VM本地驱动器上的csv文件，因此不涉及网络/以太网。我不确定Azure VM的工作方式以及使用的驱动器，但是我认为它等同于SSD。没有其他东西同时在VM上运行/正在处理。

请在verbose的fread输出下面找到：

omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments   Using 16 threads (omp_get_max_threads()=16, nth=16)   NAstrings = [<<NA>>]   None of the NAstrings look like numbers.   show progress = 1   0/1 column will be read as integer [02] Opening the file   Opening file ..\TOI\TOI_RAW_APextracted.csv   File opened, size = 49.00GB (52608776250 bytes).   Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated   \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed   Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns   Detecting sep automatically ...   sep=','  with 100 lines of 147 fields using quote rule 0   Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>>   Quote rule picked = 0  fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216   Type codes (jump 000)    : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0   Type codes (jump 001)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0   Type codes (jump 002)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0   Type codes (jump 003)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0   Type codes (jump 010)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0   Type codes (jump 031)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0   Type codes (jump 098)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0   Type codes (jump 100)    : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0   'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows   =====   Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points   Bytes from first data row on line 2 to the end of last row: 52608774311   Line length: mean=956.51 sd=35.58 min=823 max=1063   Estimated number of rows: 52608774311 /
956.51 = 55000757   Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]  
===== [08] Assign column names [09] Apply user overrides on column types   After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable   Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data   jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================|   jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable   Type counts:
       124 : drop      '0'
         3 : int32     '5'
         7 : float64   '7'
        13 : string    'A'
=============================
   0.000s (  0%) Memory map 48.996GB file
   0.035s (  0%) sep=',' ncol=147 and header detection
   0.001s (  0%) Column type detection using 10045 sample rows
   6.000s (  0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads    + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times)    +   22.774s (  1%) Transpose    + 
144.273s (  8%) Waiting
  24.545s (  1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s        Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14

[基本上，我想知道这是否正常，或者我可以采取什么措施来提高这些读取速度。根据我所见过的各种基准，以及我对使用较小文件进行读取的经验和直觉，我希望可以更快地读取它。

[我也想知道多核功能是否已被充分利用，因为我听说在Windows下这可能并不总是那么简单。不幸的是，我对此主题的了解非常有限，但是从verbose输出中确实可以看出fread正在检测16个内核。

Answer 1

想法：

（1）如果使用Windows，请使用Microsoft Open R；如果云是Azure，则更是如此。实际上，Open R和Azure客户端之间可能存在协调。由于英特尔的MKL和Microsoft内置的增强功能，我发现Windows上的Microsoft Open R更快。

（（2）我怀疑在读取完整文件后，“选择”和“删除”工作。也许以后读取所有文件，子集或过滤器。

（（3）我认为重新启动是过大的。我经常这样运行三次gc：'gc（）; gc（）; gc（）;''我听说其他人说这无能为力。但这至少使我感觉好些。实际上，我注意到它在Windows上对我有帮助。

（（4）data.table fread的最新版本正在实现'YAML'。这看起来很有希望。

（5）setDTthread（0）使用所有内核。过多的拼写可能会不利于您。尝试将内核减半。

优化`data.table :: fread`速度建议

问题描述投票：1回答：1

1个回答

最新问题

优化`data.table :: fread`速度建议

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1