Word中UTF-8编码的2GB文本文件中每个单词的频率

Question

我正在研究项目，在那里我需要找出超过1亿孟加拉语单词的大型语料库中每个单词的频率。文件大小约为2GB。实际上，我需要频率最高的20个单词和最不频繁的20个单词。我在PHP中完成了相同的代码，但它花了这么长时间（代码在一周后仍在运行）。因此，我试图用Java做到这一点。

在此代码中，它应该如下工作，

- 从语料库nahidd_filtered.txt读取一行

-split使用空格

对于每个spitted word，读取整个频率文件freq 3.text 找到的单词然后增加频率计数并存储在该文件中 else count = 1（新单词）并存储该文件中的频率计数

我试图使用循环从nahidd_filtered.txt语料库中读取大量文本，并将频率字存储在freq3.txt中。 freq3.txt文件存储频率计数如下，

Word1 Frequncy1（中间有一个空格）

词频2

...........

简单地说，我需要从大型语料库文件编码的UTF-8中获得前20个最频繁和20个最不频繁的单词以及它们的频率计数。请检查代码，并告诉我为什么这不起作用或任何其他建议。非常感谢你。

import java.io.*;
import java.util.*;
import java.util.concurrent.TimeUnit;

public class Main {


private static String fileToString(String filename) throws IOException {
    FileInputStream inputStream = null;
    Scanner reader = null;
    inputStream = new FileInputStream(filename);
    reader = new Scanner(inputStream, "UTF-8");

    /*BufferedReader reader = new BufferedReader(new FileReader(filename));*/
    StringBuilder builder = new StringBuilder();


    // For every line in the file, append it to the string builder
    while (reader.hasNextLine()) {
        String line = reader.nextLine();
        builder.append(line);
    }

    reader.close();
    return builder.toString();
}

public static final String UTF8_BOM = "\uFEFF";

private static String removeUTF8BOM(String s) {
    if (s.startsWith(UTF8_BOM)) {
        s = s.substring(1);
    }
    return s;
}

public static void main(String[] args) throws IOException {

    long startTime = System.nanoTime();
    System.out.println("-------------- Start Contents of file: ---------------------");
    FileInputStream inputStream = null;
    Scanner sc = null;
    String path = "C:/xampp/htdocs/thesis_freqeuncy_2/nahidd_filtered.txt";
    try {
        inputStream = new FileInputStream(path);
        sc = new Scanner(inputStream, "UTF-8");
        int countWord = 0;
        BufferedWriter writer = null;
        while (sc.hasNextLine()) {
            String word = null;
            String line = sc.nextLine();
            String[] wordList = line.split("\\s+");

            for (int i = 0; i < wordList.length; i++) {
                word = wordList[i].replace("।", "");
                word = word.replace(",", "").trim();
                ArrayList<String> freqword = new ArrayList<>();
                String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
                /*freqword = freq.split("\\r?\\n");*/
                Collections.addAll(freqword, freq.split("\\r?\\n"));
                int flag = 0;
                String[] freqwordsp = null;
                int k;
                for (k = 0; k < freqword.size(); k++) {
                    freqwordsp = freqword.get(k).split("\\s+");
                    String word2 = freqwordsp[0];
                    word = removeUTF8BOM(word);
                    word2 = removeUTF8BOM(word2);
                    word.replaceAll("\\P{Print}", "");
                    word2.replaceAll("\\P{Print}", "");
                    if (word2.toString().equals(word.toString())) {

                        flag = 1;
                        break;
                    }
                }

                int count = 0;
                if (flag == 1) {
                    count = Integer.parseInt(freqwordsp[1]);
                }
                count = count + 1;
                word = word + " " + count + "\n";
                freqword.add(word);

                System.out.println(freqword);
                writer = new BufferedWriter(new FileWriter("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt"));
                writer.write(String.valueOf(freqword));
            }
        }
        // writer.close();
        System.out.println(countWord);
        System.out.println("-------------- End Contents of file: ---------------------");
        long endTime = System.nanoTime();
        long totalTime = (endTime - startTime);
        System.out.println(TimeUnit.MINUTES.convert(totalTime, TimeUnit.NANOSECONDS));

        // note that Scanner suppresses exceptions
        if (sc.ioException() != null) {
            throw sc.ioException();
        }
    } finally {
        if (inputStream != null) {
            inputStream.close();
        }
        if (sc != null) {
            sc.close();
        }
    }

}

}

Answer 1

首先：

对于每个spitted word，读取整个频率文件freq3.txt

不要这样做！磁盘IO操作非常慢。你有足够的内存将文件读入内存吗？似乎是的：

String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
Collections.addAll(freqword, freq.split("\\r?\\n"));

如果你真的需要这个文件，那么加载一次并使用内存。同样在这种情况下，Map（单词到频率）可能比List更舒服。计算完成后，将集合保存在磁盘上。

接下来，您可以将qazxsw poi输入流，它可以显着提高性能：

bufferize

并且不要忘记关闭流/读者/作者。显式或使用inputStream = new BufferedInputStream(new FileInputStream(path));语句。

一般而言，可以根据使用的API简化代码。例如：

try-with-resource

Word中UTF-8编码的2GB文本文件中每个单词的频率

问题描述投票：0回答：1

1个回答

最新问题

Word中UTF-8编码的2GB文本文件中每个单词的频率

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1