如何在Haskell中重复读取大数据文件的随机行？

Question

我有一个60k行的数据文件，其中每行都有〜1k逗号分隔的Ints（我想立即变成Doubles）。

我想遍历32行的随机“批次”序列，其中批次是所有行的随机子集，并且没有一个批次共享同一行。由于每批次有60k行和32行，因此应该有1875个批次。

我愿意在必要时进行更改，但我希望它们以延迟评估的（批次）列表的形式出现。需要此代码的是foldM，我在这里使用它，例如：

resulting_struct <- foldM fold_fn my_struct batch_list

使得它在当前累加器fold_fn和my_struct的下一个元素的结果上反复调用batch_list。>

我很困惑。当我不需要洗牌时，这很容易。我只是读入它们并将它们分块，然后对它们进行了懒惰的评估，因此我没有任何问题。现在我完全被困住了，觉得自己一定缺少一些简单的东西。

我尝试了以下操作：

将文件读入行列表，然后天真的改组输入。这是行不通的，因为readFile的计算是延迟的，但是它需要将整个文件读到内存中以随机对其进行随机播放，并且它很快会耗尽我所有的〜8 GB RAM。
获取文件的长度，然后创建一个从0到60k的混洗indices

ind_batches <- get_shuffled_ind_batches_from_file fname
batch_list <- mapM (get_data_batch_from_ind_batch fname) ind_batches

其中：

get_shuffled_ind_batches_from_file :: String -> IO [[Int]]
get_shuffled_ind_batches_from_file fname = do
  contents <- get_contents_from_file fname -- uses readFile, returns [[Double]]
  let n_samps = length contents
      ind = [0..(n_samps-1)]
  shuffled_indices <- shuffle_list ind
  let shuffled_ind_chunks = take 1800 $ chunksOf 32 shuffled_indices
  return shuffled_ind_chunks

get_data_batch_from_ind_batch :: String -> [Int] -> IO [[Double]]
get_data_batch_from_ind_batch fname ind_chunk = do
  contents <- get_contents_from_file fname
  let data_batch = get_elems_at_indices contents ind_chunk
  return data_batch

shuffle_list :: [a] -> IO [a]
shuffle_list xs = do
        ar <- newArray n xs
        forM [1..n] $ \i -> do
            j <- randomRIO (i,n)
            vi <- readArray ar i
            vj <- readArray ar j
            writeArray ar j vi
            return vj
  where
    n = length xs
    newArray :: Int -> [a] -> IO (IOArray Int a)
    newArray n xs =  newListArray (1,n) xs

get_elems_at_indices :: [a] -> [Int] -> [a]
get_elems_at_indices my_list ind_list = (map . (!!)) my_list ind_list

但是，似乎mapM立即求值，然后尝试反复读取文件内容（我认为RAM仍然会耗尽）。

多一点的搜索告诉我，我可以尝试使用unsafeInterleaveIO制作它，以便它懒惰地评估一个动作，因此我尝试像这样粘贴它：

get_data_batch_from_ind_batch :: String -> [Int] -> IO [[Double]]
get_data_batch_from_ind_batch fname ind_chunk = unsafeInterleaveIO $ do
  contents <- get_contents_from_file fname
  let data_batch = get_elems_at_indices contents ind_chunk
  return data_batch

但没有运气，与上述相同。

我觉得我一直在撞墙，必须丢失一些非常简单的东西。有人建议改为使用流或管道，但是当我查看它们的文档时，对我来说还不清楚如何使用它们来解决此问题。

我如何读入一个大数据文件并对其进行混洗，而不会耗尽我的全部内存？

我有一个60k行的数据文件，其中每行都有〜1k以逗号分隔的Ints（我想立即变成Doubles）。我想遍历32行的随机“批”序列，其中...

Answer 1

hGetContents将延迟返回文件的内容，但是如果您对结果做很多事情，您将立即实现整个文件。我建议一次读取文件，然后在文件上扫描换行符，以便您可以建立一个索引，该索引以哪个字节偏移量开始。该索引会很小，因此您可以轻松地对其进行洗牌。然后，您可以遍历索引，每次打开文件并仅读取其定义的子范围，然后仅解析该块。

如何在Haskell中重复读取大数据文件的随机行？

问题描述投票：0回答：1

1个回答

最新问题

如何在Haskell中重复读取大数据文件的随机行？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1