如何在Java中获取文本文件的随机行？

Question

假设有一个文件太大而无法放入内存。我怎样才能从中得到一条随机线？谢谢。

更新：我希望每条线的概率相等。

Answer 1

如果您只想要一行，则读取整个文件似乎有点过分。以下应该更有效：

使用 RandomAccessFile 查找文件中的随机字节位置。
向左和向右寻找下一行终止符。让L作为它们之间的线。
以概率 (MIN_LINE_LENGTH / L.length) 返回 L。否则，从步骤 1 重新开始。

这是拒绝采样的变体。

行长度包括行终止符，因此 MIN_LINE_LENGTH >= 1。（如果您知道行长度的更严格限制，那就更好了）。

值得注意的是，该算法的运行时间不取决于文件大小，仅取决于行长度，即它的扩展性比读取整个文件要好得多。

Answer 2

这是一个解决方案。看看 select() 方法，它做了真正的事情（main() 方法重复执行 select()，以表明分布确实相当均匀）。

这个想法很简单：当你阅读第一行时，它有 100% 的机会被选为结果。当您阅读第二行时，它有 50% 的机会替换第一行作为结果。当你读到第 3 行时，它有 33% 的机会成为结果。第四行有25%，依此类推....

import java.io.*;
import java.util.*;

public class B {

  public static void main(String[] args) throws FileNotFoundException {
     Map<String,Integer> map = new HashMap<String,Integer>();
     for(int i = 0; i < 1000; ++i)
     {
        String s = choose(new File("g:/temp/a.txt"));
        if(!map.containsKey(s))
           map.put(s, 0);
        map.put(s, map.get(s) + 1);
     }

     System.out.println(map);
  }

  public static String choose(File f) throws FileNotFoundException
  {
     String result = null;
     Random rand = new Random();
     int n = 0;
     for(Scanner sc = new Scanner(f); sc.hasNext(); )
     {
        ++n;
        String line = sc.nextLine();
        if(rand.nextInt(n) == 0)
           result = line;         
     }

     return result;      
  }
}

Answer 3

无论是你

读取文件两次 - 一次计算行数，第二次提取随机行，或者
使用水库采样

Answer 4

查看 Itay 的答案，看起来好像在对一行代码进行采样后读取文件一千次，而真正的水库采样应该只对“磁带”进行一次。我设计了一些代码，根据 this 以及网络上的各种描述，通过真实的水库采样来检查一次代码。

import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;

public class reservoirSampling {

    public static void main(String[] args) throws FileNotFoundException, IOException{
        Sampler mySampler = new Sampler();
        List<String> myList = mySampler.sampler(10);
        for(int index = 0;index<myList.size();index++){
            System.out.println(myList.get(index));
        }
    }
}

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Scanner;

public class Sampler {

    public Sampler(){}
    public List<String> sampler (int reservoirSize) throws FileNotFoundException, IOException
    {
        String currentLine=null;
        //reservoirList is where our selected lines stored
        List <String> reservoirList= new ArrayList<String>(reservoirSize); 
        // we will use this counter to count the current line number while iterating
        int count=0; 

        Random ra = new Random();
        int randomNumber = 0;
        Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
        while (sc.hasNext())
        {
            currentLine = sc.next();
            count ++;
            if (count<=reservoirSize)
            {
                reservoirList.add(currentLine);
            }
            else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
            {
                reservoirList.set(randomNumber, currentLine);
            }
        }
        return reservoirList;
    }
}

基本前提是你填满水库，然后返回并以 1/ReservoirSize 的机会填充随机线。我希望这提供更有效的代码。如果这对你不起作用，请告诉我，因为我在半小时内就把它搞定了。

Answer 5

使用随机访问文件：

构造一个RandomAccessFile，文件
通过调用 file.length() 获取该文件的长度 filelen
生成一个随机数 pos，介于 0 和 filelen
调用 file.seek(pos) 寻找随机位置
调用 file.readLine() 到达当前行末尾
再次调用 file.readLine() 读取下一行

使用这种方法，我从布朗语料库中随机采样行，并且可以在几秒钟内轻松地从随机选择的文件中检索 1000 个随机样本。如果我尝试通过逐行阅读每个文件来执行相同的操作，则需要花费更长的时间。

同样的原理可以用于从列表中选择随机元素。如果您生成一个介于 0 和列表长度之间的随机数，那么您可以直接索引到列表中，而不是通读列表并停在随机位置。

Answer 6

从java文件中读取随机行：

public String getRandomLineFromTheFile(String filePathWithFileName) throws Exception {
        File file = new File(filePathWithFileName); 
        final RandomAccessFile f = new RandomAccessFile(file, "r");
        final long randomLocation = (long) (Math.random() * f.length());
        f.seek(randomLocation);
        f.readLine(); //will move file pointer to the next line
        String randomLine = f.readLine();
        f.close();
        return randomLine;
    }

请参阅 RandomAccessFile 文档。

Answer 7

使用 BufferedReader 并按行读取。使用 java.util.Random 对象随机停止;)

如何在Java中获取文本文件的随机行？

问题描述投票：0回答：7

7个回答

最新问题

如何在Java中获取文本文件的随机行？

问题描述 投票：0回答：7

7个回答

最新问题

问题描述投票：0回答：7