我试图拆分多个线程的文本文件。该文件为1 GB。我正在通过char读取文件。执行时间为24分54秒。而不是通过char读取文件是他们可以减少执行时间的更好方法。我很难找到一种可以缩短执行时间的方法。如果还有其他更好的方法来分割多个线程的文件,请同时建议我。我是java的新手。
任何帮助将不胜感激。 :)
public static void main(String[] args) throws Exception {
RandomAccessFile raf = new RandomAccessFile("D:\\sample\\file.txt", "r");
long numSplits = 10;
long sourceSize = raf.length();
System.out.println("file length:" + sourceSize);
long bytesPerSplit = sourceSize / numSplits;
long remainingBytes = sourceSize % numSplits;
int maxReadBufferSize = 9 * 1024;
List<String> filePositionList = new ArrayList<String>();
long startPosition = 0;
long endPosition = bytesPerSplit;
for (int i = 0; i < numSplits; i++) {
raf.seek(endPosition);
String strData = raf.readLine();
if (strData != null) {
endPosition = endPosition + strData.length();
}
String str = startPosition + "|" + endPosition;
if (sourceSize > endPosition) {
startPosition = endPosition;
endPosition = startPosition + bytesPerSplit;
} else {
break;
}
filePositionList.add(str);
}
for (int i = 0; i < filePositionList.size(); i++) {
String str = filePositionList.get(i);
String[] strArr = str.split("\\|");
String strStartPosition = strArr[0];
String strEndPosition = strArr[1];
long startPositionFile = Long.parseLong(strStartPosition);
long endPositionFile = Long.parseLong(strEndPosition);
MultithreadedSplit objMultithreadedSplit = new MultithreadedSplit(startPositionFile, endPositionFile);
objMultithreadedSplit.start();
}
long endTime = System.currentTimeMillis();
System.out.println("It took " + (endTime - startTime) + " milliseconds");
}
}
public class MultithreadedSplit extends Thread {
public static String filePath = "D:\\tenlakh\\file.txt";
private int localCounter = 0;
private long start;
private long end;
public static String outPath;
List<String> result = new ArrayList<String>();
public MultithreadedSplit(long startPos, long endPos) {
start = startPos;
end = endPos;
}
@Override
public void run() {
try {
String threadName = Thread.currentThread().getName();
long currentTime = System.currentTimeMillis();
RandomAccessFile file = new RandomAccessFile("D:\\sample\\file.txt", "r");
String outFile = "out_" + threadName + ".txt";
System.out.println("Thread Reading started for start:" + start + ";End:" + end+";threadname:"+threadName);
FileOutputStream out2 = new FileOutputStream("D:\\sample\\" + outFile);
file.seek(start);
int nRecordCount = 0;
char c = (char) file.read();
StringBuilder objBuilder = new StringBuilder();
int nCounter = 1;
while (c != -1) {
objBuilder.append(c);
// System.out.println("char-->" + c);
if (c == '\n') {
nRecordCount++;
out2.write(objBuilder.toString().getBytes());
objBuilder.delete(0, objBuilder.length());
//System.out.println("--->" + nRecordCount);
// break;
}
c = (char) file.read();
nCounter++;
if (nCounter > end) {
break;
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
最快的方法是将文件逐段映射到内存中(将整个大文件映射可能会导致意外的副作用)。它将跳过一些相对昂贵的复制操作。操作系统会将文件加载到RAM中,JRE会将其公开给您的应用程序,以查看ByteBuffer
形式的堆外存储区。它通常会让你扼杀最后2x / 3x的性能。
内存映射方式需要相当多的辅助代码(参见底部的片段),它并不总是最好的战术方式。相反,如果您的输入是基于行的,并且您只需要合理的性能(您现在可能没有),那么只需执行以下操作:
import java.nio.Files;
import java.nio.Paths;
...
File.lines(Paths.get("/path/to/the/file"), StandardCharsets.ISO_8859_1)
// .parallel() // parallel processing is still possible
.forEach(line -> { /* your code goes here */ });
对于对比,通过内存映射处理文件的代码的工作示例如下所示。在固定大小的记录的情况下(当可以精确地选择段以匹配记录边界时)可以并行处理后续段。
static ByteBuffer mapFileSegment(FileChannel fileChannel, long fileSize, long regionOffset, long segmentSize) throws IOException {
long regionSize = min(segmentSize, fileSize - regionOffset);
// small last region prevention
final long remainingSize = fileSize - (regionOffset + regionSize);
if (remainingSize < segmentSize / 2) {
regionSize += remainingSize;
}
return fileChannel.map(FileChannel.MapMode.READ_ONLY, regionOffset, regionSize);
}
...
final ToIntFunction<ByteBuffer> consumer = ...
try (FileChannel fileChannel = FileChannel.open(Paths.get("/path/to/file", StandardOpenOption.READ)) {
final long fileSize = fileChannel.size();
long regionOffset = 0;
while (regionOffset < fileSize) {
final ByteBuffer regionBuffer = mapFileSegment(fileChannel, fileSize, regionOffset, segmentSize);
while (regionBuffer.hasRemaining()) {
final int usedBytes = consumer.applyAsInt(regionBuffer);
if (usedBytes == 0)
break;
}
regionOffset += regionBuffer.position();
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}