我正在用 C# 编写一个函数,将大的分隔文件拆分为较小的分隔文件。我之所以编写此函数,是因为一个 2.7 GB 的文件需要数小时才能进行 ETL,并且导致 ETL 批次出现瓶颈。我为每个文件分配最大行数,并在达到该限制后启动一个新文件。如果文件有标题,那么我将标题写入每个文件。
最初,我尝试使用许多 C# 帖子中推荐的
StreamReader
和 StreamWriter
,因为它们应该通过一次读取一行而不是将所有内容存储在内存中来处理大文件。此外,我还探索了使用 File.ReadLines
和 File.WriteAllLines
,因为它们也被推荐为节省内存的选项。
这两种方法都适用于较小的文件;然而,当处理 2.7 GB 的 1238951 行时,两种方法都因内存不足异常而失败。令人惊讶的是,他们在抛出异常之前成功处理了 1238950 行,只留下一行未完成。我的电脑有 32GB 内存,但内存远远不够。这里发生了什么?我有什么遗漏的吗?
internal void SplitFileWithStream(string localFilePath, bool hasHeader)
{
try
{
var targetDirectory = Path.GetDirectoryName(localFilePath);
var fileNameWithoutExtension = Path.GetFileNameWithoutExtension(localFilePath);
var extension = Path.GetExtension(localFilePath);
var fileSuffix = 0;
var maxLinesPerFile = 100000;
string? header = null;
using (var sr = new StreamReader(localFilePath))
{
while (!sr.EndOfStream)
{
var lineNumber = 0;
var newFileName = $"{fileNameWithoutExtension}__split_{(++fileSuffix)}{extension}";
var newFilePath = Path.Combine(targetDirectory, newFileName);
if (File.Exists(newFilePath))
File.Delete(newFilePath);
using (var sw = new StreamWriter(newFilePath))
{
if (!sr.EndOfStream && hasHeader)
{
if (header == null)
{
header = sr.ReadLine();
maxLinesPerFile++; //add one to max for header
}
sw.WriteLine(header);
lineNumber++;
}
while (!sr.EndOfStream && lineNumber < maxLinesPerFile)
{
sw.WriteLine(sr.ReadLine());
lineNumber++;
}
sw.Close();
}
}
sr.Close();
}
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown. at System.Text.StringBuilder.ToString() at System.IO.StreamReader.ReadLine() at MyFileUtilities.FileSplitter.SplitFileWithStream(String localFilePath, Boolean hasHeader) in C:\Repos\MySolution\MyFileUtilitites\FileSplitter.cs:line 341
internal void SplitFileWithReadLines(string localFilePath, bool hasHeader)
{
try
{
var targetDirectory = Path.GetDirectoryName(localFilePath);
var fileNameWithoutExtension = Path.GetFileNameWithoutExtension(localFilePath);
var fileName = Path.GetFileName(localFilePath);
var extension = Path.GetExtension(localFilePath);
var fileSuffix = 0;
var maxLinesPerFile = 100000;
long position = 0;
string? header = null;
int i = 0;
int skip = 0;
string workingFilePath = localFilePath;
while (true)
{
int take = maxLinesPerFile;
if (hasHeader && header == null)
{
header = File.ReadLines(localFilePath).Take(1).First();
skip = 1;
}
var linesToSplice = File.ReadLines(workingFilePath).Skip(skip).Take(take);
if (!linesToSplice.Any())
return;
linesToSplice.Append(Environment.NewLine); //added just to see if it would try to write the actual last line, but did not work.
var newFileName = $"{fileNameWithoutExtension}__split_{(++i)}{extension}";
var newFilePath = Path.Combine(targetDirectory, newFileName);
File.WriteAllLines(newFilePath, linesToSplice);
skip += take;
}
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown. at System.Text.StringBuilder.ToString() at System.IO.StreamReader.ReadLine() at System.IO.ReadLinesIterator.MoveNext() at System.Linq.Enumerable.EnumerablePartition`1.MoveNext() at System.IO.File.InternalWriteAllLines(TextWriter writer, IEnumerable`1 contents) at System.IO.File.WriteAllLines(String path, IEnumerable`1 contents) at MyFileUtilities.FileSplitter.SplitFileWithReadLines(String localFilePath, Boolean hasHeader) in C:\Repos\MySolution\MyFileUtilitites\FileSplitter.cs\FileSplitter.cs:line 402
我意识到有 linux 命令和其他工具可以做到这一点,但如果可能的话,我想要一个 C# 解决方案来将其添加到我拥有的 C# 实用程序库中。
编辑:我尝试了
File.ReadLines(workingFilePath).Skip(skip).Take(take).ToList();
并得到了另一个OutOfMemoryException。
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown. at System.Text.StringBuilder.ToString() at System.IO.StreamReader.ReadLine() at System.IO.ReadLinesIterator.MoveNext() at System.Linq.Enumerable.EnumerablePartition`1.ToList() at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source) at MyFileUtilities.FileSplitter.SplitFileWithReadLines(String localFilePath, Boolean hasHeader) in ...
File.ReadLines
返回 IEnumerable
。
尝试
File.ReadLines(workingFilePath).Skip(skip).Take(take).ToList();
降低内存压力。从您发布的调用堆栈来看,很明显
WriteAllLines
调用是开始枚举 IEnumerable 的第一个语句,最终循环遍历所有内容,这正是您首先想要避免的。