在c ++中删除一行中的所有字符直到模式匹配的最快方法是什么？

Question

我有很大的文件需要读取到内存中。这些文件必须是人类可读的格式，因此它们会被制表符缩进污染，直到出现正常字符为止。例如，以下文本前面带有3个空格（等同于一个制表符缩进）

    /// There is a tab before this text.
    Sample Text   There is a tab in between the word "Text" and "There" in this line.
    9919
    2250
    {
        ""
        5
        255
    }

当前，我只是运行以下代码替换选项卡（在文件加载到内存中之后...

void FileParser::ReplaceAll(
   std::string& the_string,
   const std::string& from,
   const std::string& to) const
{
   size_t start_pos = 0;
   while ((start_pos = the_string.find(from, start_pos)) != std::string::npos)
   {
      the_string.replace(start_pos, from.length(), to);
      start_pos += to.length(); // In case 'to' contains 'from', like replacing 'x' with 'yx'
   }
}

此代码有两个问题...

完成此文本的替换需要18秒。
这将替换所有选项卡，我只希望这些选项卡直到第一个非选项卡字符为止。因此，如果该行在非制表符后面有制表符...。这些将不会被删除。

任何人都可以提供一种解决方案来加速该过程，并且仅删除每行的初始制表符缩进吗？

Answer 1

我会这样：

std::string without_leading_chars(const std::string& in, char remove)
{
    std::string out;
    out.reserve(in.size());
    bool at_line_start = true;
    for (char ch : in)
    {
        if (ch == '\n')
            at_line_start = true;
        else if (at_line_start)
        {
            if (ch == remove)
                continue; // skip this char, do not copy
            else
                at_line_start = false;
        }
        out.push_back(ch);
    }
    return out;
}

这是一次内存分配和一次通过，因此非常接近最佳。

Answer 2

一如既往。我们通常可以通过考虑好的算法并创建好的设计来提高速度。

第一条评论。我用100MB的源文件测试了您的方法，并且在我的计算机上以“释放”模式花费了至少30分钟并启用了所有优化功能。

而且，正如你自己提到的。它会重新排列所有空格，而不仅是文件开头的空格。因此，我们需要提出一个更好的解决方案

首先，我们考虑如何识别行首的空格。为此，我们需要一些布尔标志来指示我们在一行的开头。我们将其命名为beginOfLine，并将其初始设置为true，因为文件始终以一行开头。

然后，接下来，我们检查下一个字符是空格' '还是制表符'\t'。与其他解决方案相比，我们将同时检查两者。

如果是这种情况，那么我们就不必考虑输出中的空格或制表符，这取决于我们是否位于行首。因此，结果是beginOfLine的倒数。

如果字符不是空格或制表符，则我们检查换行符。如果找到一个，则将beginOfLine标志设置为true，否则设置为false。无论如何，我们都想使用该字符。

所有这些都可以放入一个简单的有状态Lambda中

        auto check = [beginOfLine = true](const char c) mutable -> bool {
            if ((c == ' ') || (c == '\t')  ) 
                return !beginOfLine;    
            beginOfLine = (c == '\n'); 
            return true; };

或更紧凑：

auto check = [beginOfLine = true](const char c) mutable -> bool {
    if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };

然后，下一个。我们不会删除原始字符串中的空格，因为这是一项巨大的内存转移活动，需要花费很长时间。相反，我们将数据（字符）复制到新的字符串中，但只复制一次。

为此，我们可以使用标准库中的std::copy_if。

std::copy_if(data.begin(), data.end(), data2.begin(), check);

这将完成工作。而对于100MB的数据，则需要160ms。与30分钟相比，这节省了很多时间。

请参阅示例代码（当然需要根据您的需要进行修改）：

#include <iostream>
#include <fstream>
#include <filesystem>
#include <iterator>
#include <algorithm>
#include <string>

namespace fs = std::filesystem;
constexpr size_t SizeOfIOStreamBuffer = 1'000'000;
static char ioBuffer[SizeOfIOStreamBuffer];

int main() {

    // Path to text file
    const fs::path file{ "r:\\test.txt" };

    // Open the file and check, if it could be opened
    if (std::ifstream fileStream(file); fileStream) {

        // Lambda, that identifies, if we have a spece or tab at the begin of a line or not
        auto check = [beginOfLine = true](const char c) mutable -> bool {
            if (c == ' ' || c == '\t') return !beginOfLine; beginOfLine = (c == '\n'); return true; };

        // Huge string with all file data
        std::string data{};

        // Reserve space to spped up things and to avoid uncessary allocations
        data.resize(fs::file_size(file));

        // Used buffered IO with a huge iobuffer
        fileStream.rdbuf()->pubsetbuf(ioBuffer, SizeOfIOStreamBuffer);

        // Elimiate spaces and tabs at the beginning of the line
        std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);
    }
    return 0;
}

如您所见，所有代码都归结为一个语句。对于100MB的文件，此操作（在我的计算机上）运行160毫秒。

可以进一步优化什么？当然，我们看到软件中有2个100MB std::string。真是浪费最终的优化是，将用于读取文件并删除一行开头的空格和制表符的2条语句放到一条语句中。

std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check);

然后，我们在内存中只有1倍的数据，消除了我们从不需要的文件中读取数据的废话。它的优点在于，通过使用现代C ++语言元素，只需要进行很小的修改即可。只需交换源迭代器即可：

std::copy_if(std::istreambuf_iterator<char>(fileStream), {}, data.begin(), check2);

是的，我知道字符串的结尾太大了，但是可以很容易地将其设置为实际值。例如，使用data.reserve（...）和back::inserter

在c ++中删除一行中的所有字符直到模式匹配的最快方法是什么？

问题描述投票：2回答：2

2个回答

最新问题

在c ++中删除一行中的所有字符直到模式匹配的最快方法是什么？

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2