我正在研究项目,在那里我需要找出超过1亿孟加拉语单词的大型语料库中每个单词的频率。文件大小约为2GB。实际上,我需要频率最高的20个单词和最不频繁的20个单词。我在PHP中完成了相同的代码,但它花了这么长时间(代码在一周后仍在运行)。因此,我试图用Java做到这一点。
在此代码中,它应该如下工作,
- 从语料库nahidd_filtered.txt读取一行
-split使用空格
我试图使用循环从nahidd_filtered.txt语料库中读取大量文本,并将频率字存储在freq3.txt中。 freq3.txt文件存储频率计数如下,
Word1 Frequncy1(中间有一个空格)
词频2
...........
简单地说,我需要从大型语料库文件编码的UTF-8中获得前20个最频繁和20个最不频繁的单词以及它们的频率计数。请检查代码,并告诉我为什么这不起作用或任何其他建议。非常感谢你。
import java.io.*;
import java.util.*;
import java.util.concurrent.TimeUnit;
public class Main {
private static String fileToString(String filename) throws IOException {
FileInputStream inputStream = null;
Scanner reader = null;
inputStream = new FileInputStream(filename);
reader = new Scanner(inputStream, "UTF-8");
/*BufferedReader reader = new BufferedReader(new FileReader(filename));*/
StringBuilder builder = new StringBuilder();
// For every line in the file, append it to the string builder
while (reader.hasNextLine()) {
String line = reader.nextLine();
builder.append(line);
}
reader.close();
return builder.toString();
}
public static final String UTF8_BOM = "\uFEFF";
private static String removeUTF8BOM(String s) {
if (s.startsWith(UTF8_BOM)) {
s = s.substring(1);
}
return s;
}
public static void main(String[] args) throws IOException {
long startTime = System.nanoTime();
System.out.println("-------------- Start Contents of file: ---------------------");
FileInputStream inputStream = null;
Scanner sc = null;
String path = "C:/xampp/htdocs/thesis_freqeuncy_2/nahidd_filtered.txt";
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
int countWord = 0;
BufferedWriter writer = null;
while (sc.hasNextLine()) {
String word = null;
String line = sc.nextLine();
String[] wordList = line.split("\\s+");
for (int i = 0; i < wordList.length; i++) {
word = wordList[i].replace("।", "");
word = word.replace(",", "").trim();
ArrayList<String> freqword = new ArrayList<>();
String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
/*freqword = freq.split("\\r?\\n");*/
Collections.addAll(freqword, freq.split("\\r?\\n"));
int flag = 0;
String[] freqwordsp = null;
int k;
for (k = 0; k < freqword.size(); k++) {
freqwordsp = freqword.get(k).split("\\s+");
String word2 = freqwordsp[0];
word = removeUTF8BOM(word);
word2 = removeUTF8BOM(word2);
word.replaceAll("\\P{Print}", "");
word2.replaceAll("\\P{Print}", "");
if (word2.toString().equals(word.toString())) {
flag = 1;
break;
}
}
int count = 0;
if (flag == 1) {
count = Integer.parseInt(freqwordsp[1]);
}
count = count + 1;
word = word + " " + count + "\n";
freqword.add(word);
System.out.println(freqword);
writer = new BufferedWriter(new FileWriter("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt"));
writer.write(String.valueOf(freqword));
}
}
// writer.close();
System.out.println(countWord);
System.out.println("-------------- End Contents of file: ---------------------");
long endTime = System.nanoTime();
long totalTime = (endTime - startTime);
System.out.println(TimeUnit.MINUTES.convert(totalTime, TimeUnit.NANOSECONDS));
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
}
}
首先:
对于每个spitted word,读取整个频率文件freq3.txt
不要这样做!磁盘IO操作非常慢。你有足够的内存将文件读入内存吗?似乎是的:
String freq = fileToString("C:/xampp/htdocs/thesis_freqeuncy_2/freq3.txt");
Collections.addAll(freqword, freq.split("\\r?\\n"));
如果你真的需要这个文件,那么加载一次并使用内存。同样在这种情况下,Map(单词到频率)可能比List更舒服。计算完成后,将集合保存在磁盘上。
接下来,您可以将qazxsw poi输入流,它可以显着提高性能:
bufferize
并且不要忘记关闭流/读者/作者。显式或使用inputStream = new BufferedInputStream(new FileInputStream(path));
语句。
一般而言,可以根据使用的API简化代码。例如:
try-with-resource