如何知道std :: string是否正确编码？

Question

在我的程序中，我从另一个函数中获取std::string值，该函数从不同的源读取字符串，此处的字符串始终包含非ASCII字符。

我正在使用Visual Studio调试程序。有时当VS调试器中的字符串内容看起来正确时，下一步就可以了（例如使用此字符串作为输入和输出的目录）。但有时字符串内容看起来不正确，这导致下一步出错。

目前我使用QString作为桥梁将“不正确的”字符串转换为“正确的”字符串，代码如下所示。

// get string from somewhere else, sometimes correct sometimes incorrect
string str = getString(); 
QString strQ = QString::fromStdString(str);
str = string(strQ.toLocal8bit);

但有时str在转换之前已经“正确”了，在这种情况下，如果我用上面的代码转换它，就会出错。

所以我想这里的问题是如何才能知道std::string是否具有正确的编码？因为我不能总是用眼睛来判断它。

是的，编码是Stack Overflow上广泛讨论的主题，但我仍然找不到合适的解决方案。

附：正确的字符串值在VS调试器中看起来像孙夏^4735，不正确的字符串值看起来像????。

Answer 1

您必须检查字符串是否已经过UTF-8编码。类似下面的代码（从未测试，使用它为你的灵感）。

#include <string>

enum DetectedCoding {ASCII, UTF8, OTHER};

DetectedCoding DetectEncoding(const std::string & s)
{
  const char * cs = s.c_str();
  DetectedCoding d = ASCII;
  while (*cs)
  {
    unsigned char b = (unsigned char)*(cs++);
    if (b & 0x80) { // not a plain ASCII character
      // if the string is already UTF8 encoded, then it must conform to a multibyte sequence standard. Let's verify it
      if (b < 0xC0) // first of all, b must start with 11
        return OTHER; //  no multibyte sequence starts with 10xxxxxx
      // now we expect a number of continuation bytes, depending on the number of ones following the 11
      size_t nCont = 0;
      if (b < 0xE0) // two bytes sequence: 110xxxxx 10xxxxxx
        nCont = 1;
      else if (b < 0xF0) // three bytes sequence: 1110xxxx 10xxxxxx 10xxxxxx
        nCont = 2;
      else if (b < 0xF8) // four bytes sequence: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
        nCont = 3;
      else if (b < 0xFC) // five bytes sequence: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        nCont = 4;
      else if (b < 0xFE) // six bytes sequence: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
        nCont = 5;
      else
        return OTHER; //  no multibyte sequence starts with 1111111x
      while (nCont--)
        if (((unsigned char)*(cs++) & 0xC0) != 0xC0) // in case string ends, 0 is found so the following test prevents us from illegal memory access
          return OTHER; //  each continuation byte must starts with 10xxxxxx
      d = UTF8;
    }
  }
  return d;
}

如何知道std :: string是否正确编码？

问题描述投票：0回答：1

1个回答

最新问题

如何知道std :: string是否正确编码？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1