快速读取大型CSV文件的方法

问题描述 投票:0回答:3

我有一个相当大的CSV数据集,大约13.5MB,大约有120,000行和13列。下面的代码是我现有的当前解决方案。

private IEnumerator readDataset()
{
    starsRead = 0;
    var totalLines = File.ReadLines(path).Count();
    totalStars = totalLines - 1;

    string firstLine = File.ReadLines(path).First();
    int columnCount = firstLine.Count(f => f == ',');

    string[,] datasetTable = new string[totalStars, columnCount];

    int lineLength;
    char bufferChar;
    var bufferString = new StringBuilder();
    int column;
    int row;

    using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
    using (BufferedStream bs = new BufferedStream(fs))
    using (StreamReader sr = new StreamReader(bs))
    {
        string line = sr.ReadLine();
        while ((line = sr.ReadLine()) != null)
        {
            row = 0;
            column = 0;
            lineLength = line.Length;
            for (int i = 0; i < lineLength; i++)
            {
                bufferChar = line[i];
                if (bufferChar == ',')
                {
                    datasetTable[row, column] = bufferString.ToString();
                    column++;
                }
                else
                {
                    bufferString.Append(bufferChar);
                }
            }
            row++;
            starsRead++;
            yield return null;
        }
    }
}

幸运的是,当我通过Unity协程运行此程序时,程序没有冻结,但是此当前解决方案需要31分44秒才能读取整个CSV文件。

还有其他方法可以做到这一点吗?我正在尝试将解析时间控制在1分钟以内。

c# csv unity3d streamreader
3个回答
0
投票

您正在犯的基本错误是仅每帧1条单行,因此您基本上可以计算出大约60fps所需的时间:

120,000 rows / 60fps = 2000 seconds = 33.3333 minutes

由于yield return null;

我认为您只需使用StopWatch这样就可以大大提高速度>

StopWatch

这可以在尝试保持60fps帧频的同时消除一帧内的多行。您可能需要尝试一下,以找到帧率和持续时间之间的良好折衷。例如。也许您可以允许它仅以30fps的速度运行,但导入速度更快,因为这样它可以在一帧中处理更多行。


然后,您不应该通过每个字节/字符来“手动”阅读。而不是使用内置方法,例如... var stopWatch = new Stopwatch(); stopWatch.Start(); // Use the last frame duration as a guide for how long one frame should take var targetMilliseconds = Time.deltaTime * 1000f; while ((line = sr.ReadLine()) != null) { .... // If you are too long in this frame render one and continue in the next frame // otherwise keep going with the next line if(stopWatch.ElapsedMilliseconds > targetMilliseconds) { yield return null; stopWatch.Restart(); } }

Regex.Split

private const char Quote = '\"'; private const string LineBreak = "\r\n"; private const string DoubleQuote = "\"\""; private IEnumerator readDataset(string path) { starsRead = 0; var totalLines = File.ReadLines(path).Count(); totalStars = totalLines - 1; string firstLine = File.ReadLines(path).First(); int columnCount = firstLine.Count(f => f == ','); string[,] datasetTable = new string[totalStars, columnCount]; using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite)) { using (BufferedStream bs = new BufferedStream(fs)) { using (StreamReader sr = new StreamReader(bs)) { var stopWatch = new Stopwatch(); stopWatch.Start(); // Use the last frame duration as a guide how long one frame should take // you can also try and experiment with hardcodd target framerates like e.g. "1000f / 30" for 30fps var targetMilliseconds = Time.deltaTime * 1000f; string row = sr.ReadLine(); var columns = new List<string>(); while ((row = sr.ReadLine()) != null) { // Look for the following expressions: // (?<x>(?=[,\r\n]+)) --> Creates a Match Group (?<x>...) of every expression it finds before a , a \r or a \n (?=[...]) // OR | // ""(?<x>([^""]|"""")+)"" --> An Expression wrapped in single-quotes (escaped by "") is matched into a Match Group that is neither NOT a single-quote [^""] or is a double-quote // OR | // (?<x>[^,\r\n]+)),?) --> Creates a Match Group (?<x>...) that does not contain , \r, or \n var matches = Regex.Matches(row, @"(((?<x>(?=[,\r\n]+))|""(?<x>([^""]|"""")+)""|(?<x>[^,\r\n]+)),?)", RegexOptions.ExplicitCapture); foreach (Match match in matches) { columns.Add(match.Groups[1].Value == "\"\"" ? "" : match.Groups[1].Value.Replace("\"\"", Quote.ToString())); } // If last thing is a `,` then there is an empty item missing at the end if (row.Length > 0 && row[row.Length - 1].Equals(',')) { columns.Add(""); } for (var colIndex = 0; colIndex < Mathf.Min(columnCount, columns.Count); colIndex++) { datasetTable[starsRead, colIndex] = columns[colIndex]; } columns.Clear(); starsRead++; // If you are too long in this frame render one and continue in the next frame // otherwise keep going with the next line if (stopWatch.ElapsedMilliseconds > targetMilliseconds) { yield return null; stopWatch.Restart(); } } } } } } 课程很复杂,涵盖了一些特殊情况。如果您非常了解CSV格式,则可能也可以坚持[

Regex.Split

总是在var columns = row.Split(new []{ ','}); 上分割它。


一般而言:我猜这可能会更快,而不是使用协程全部

,而是在线程中完成整个操作,只返回结果! FileIO和字符串解析总是很慢。

1
投票

30分钟太慢了!


-2
投票

您可能遇到内存问题。在代码运行时打开任务管理器,以查看您是否已达到最大内存量。

© www.soinside.com 2019 - 2024. All rights reserved.