我有一个相当大的CSV数据集,大约13.5MB,大约有120,000行和13列。下面的代码是我现有的当前解决方案。
private IEnumerator readDataset()
{
starsRead = 0;
var totalLines = File.ReadLines(path).Count();
totalStars = totalLines - 1;
string firstLine = File.ReadLines(path).First();
int columnCount = firstLine.Count(f => f == ',');
string[,] datasetTable = new string[totalStars, columnCount];
int lineLength;
char bufferChar;
var bufferString = new StringBuilder();
int column;
int row;
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
string line = sr.ReadLine();
while ((line = sr.ReadLine()) != null)
{
row = 0;
column = 0;
lineLength = line.Length;
for (int i = 0; i < lineLength; i++)
{
bufferChar = line[i];
if (bufferChar == ',')
{
datasetTable[row, column] = bufferString.ToString();
column++;
}
else
{
bufferString.Append(bufferChar);
}
}
row++;
starsRead++;
yield return null;
}
}
}
幸运的是,当我通过Unity协程运行此程序时,程序没有冻结,但是此当前解决方案需要31分44秒才能读取整个CSV文件。
还有其他方法可以做到这一点吗?我正在尝试将解析时间控制在1分钟以内。
您正在犯的基本错误是仅每帧1条单行,因此您基本上可以计算出大约60fps所需的时间:
120,000 rows / 60fps = 2000 seconds = 33.3333 minutes
由于yield return null;
我认为您只需使用StopWatch
这样就可以大大提高速度>
StopWatch
这可以在尝试保持60fps帧频的同时消除一帧内的多行。您可能需要尝试一下,以找到帧率和持续时间之间的良好折衷。例如。也许您可以允许它仅以30fps的速度运行,但导入速度更快,因为这样它可以在一帧中处理更多行。
然后,您不应该通过每个字节/字符来“手动”阅读。而不是使用内置方法,例如...
var stopWatch = new Stopwatch();
stopWatch.Start();
// Use the last frame duration as a guide for how long one frame should take
var targetMilliseconds = Time.deltaTime * 1000f;
while ((line = sr.ReadLine()) != null)
{
....
// If you are too long in this frame render one and continue in the next frame
// otherwise keep going with the next line
if(stopWatch.ElapsedMilliseconds > targetMilliseconds)
{
yield return null;
stopWatch.Restart();
}
}
。
Regex.Split
private const char Quote = '\"'; private const string LineBreak = "\r\n"; private const string DoubleQuote = "\"\""; private IEnumerator readDataset(string path) { starsRead = 0; var totalLines = File.ReadLines(path).Count(); totalStars = totalLines - 1; string firstLine = File.ReadLines(path).First(); int columnCount = firstLine.Count(f => f == ','); string[,] datasetTable = new string[totalStars, columnCount]; using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) { using (BufferedStream bs = new BufferedStream(fs)) { using (StreamReader sr = new StreamReader(bs)) { var stopWatch = new Stopwatch(); stopWatch.Start(); // Use the last frame duration as a guide how long one frame should take // you can also try and experiment with hardcodd target framerates like e.g. "1000f / 30" for 30fps var targetMilliseconds = Time.deltaTime * 1000f; string row = sr.ReadLine(); var columns = new List<string>(); while ((row = sr.ReadLine()) != null) { // Look for the following expressions: // (?<x>(?=[,\r\n]+)) --> Creates a Match Group (?<x>...) of every expression it finds before a , a \r or a \n (?=[...]) // OR | // ""(?<x>([^""]|"""")+)"" --> An Expression wrapped in single-quotes (escaped by "") is matched into a Match Group that is neither NOT a single-quote [^""] or is a double-quote // OR | // (?<x>[^,\r\n]+)),?) --> Creates a Match Group (?<x>...) that does not contain , \r, or \n var matches = Regex.Matches(row, @"(((?<x>(?=[,\r\n]+))|""(?<x>([^""]|"""")+)""|(?<x>[^,\r\n]+)),?)", RegexOptions.ExplicitCapture); foreach (Match match in matches) { columns.Add(match.Groups[1].Value == "\"\"" ? "" : match.Groups[1].Value.Replace("\"\"", Quote.ToString())); } // If last thing is a `,` then there is an empty item missing at the end if (row.Length > 0 && row[row.Length - 1].Equals(',')) { columns.Add(""); } for (var colIndex = 0; colIndex < Mathf.Min(columnCount, columns.Count); colIndex++) { datasetTable[starsRead, colIndex] = columns[colIndex]; } columns.Clear(); starsRead++; // If you are too long in this frame render one and continue in the next frame // otherwise keep going with the next line if (stopWatch.ElapsedMilliseconds > targetMilliseconds) { yield return null; stopWatch.Restart(); } } } } } }
课程很复杂,涵盖了一些特殊情况。如果您非常了解CSV格式,则可能也可以坚持[
Regex.Split
总是在
var columns = row.Split(new []{ ','});
上分割它。
一般而言:我猜这可能会更快,而不是使用协程全部
,而是在线程中完成整个操作,只返回结果! FileIO和字符串解析总是很慢。30分钟太慢了!
您可能遇到内存问题。在代码运行时打开任务管理器,以查看您是否已达到最大内存量。