无法从C ++ std :: string中提取Unicode符号

Question

我正在阅读C ++ std :: string，然后将该std :: string传递给将对其进行分析的函数，然后从中提取Unicode符号和简单ASCII符号。

我在网上搜索了许多教程，但是所有人都提到标准C ++并不完全支持Unicode格式。其中许多人提到使用ICU C ++。

这是我的C ++程序，用于了解上述功能的基本知识。它读取原始字符串，转换为ICU Unicode字符串并打印：

#include <iostream>
#include <string>
#include "unicode/unistr.h"

int main()
{
    std::string s="Hello☺";
    // at this point s contains a line of text
    // which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
      ws += static_cast<wchar_t>(ucs[i]);

    std::wcout << ws << std::endl;
}

预期输出：

Hello☺

实际输出：

Hello?

请指出我在做什么错。还建议任何替代/更简单的方法

谢谢

更新：工作代码如下。特别感谢@ <>的解决方案。我希望我可以给他的解决方案多一滴滴！！！：）

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"

void f(const std::string & s)
{
  std::wcout << "Inside called function" << std::endl;
  constexpr char locale_name[] = "";
  setlocale( LC_ALL, locale_name );
  std::locale::global(std::locale(locale_name));
  std::ios_base::sync_with_stdio(false);
  std::wcin.imbue(std::locale());
  std::wcout.imbue(std::locale());

  // at this point s contains a line of text which may be ANSI or UTF-8 encoded

  // convert std::string to ICU's UnicodeString
  icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

  // convert UnicodeString to std::wstring
  std::wstring ws;
  for (int i = 0; i < ucs.length(); ++i)
  {
    ws += static_cast<wchar_t>(ucs[i]);
    std::wcout << static_cast<wchar_t>(ucs[i]) << std::endl;
  }

  std::wcout << ws << std::endl;
}

int main()
{
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::wcout << "Inside main function" << std::endl;

    std::string s=u8"hello☺";
    // at this point s contains a line of text which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
    {
      ws += static_cast<wchar_t>(ucs[i]);
      std::wcout << static_cast<wchar_t>(ucs[i]) << std::endl;
    }

    std::wcout << ws << std::endl;
    std::wcout << "--------------------------------" << std::endl;
    f(s);
    return 0;
}

Answer 1

有很多绊脚石可以解决这个问题：

首先，您的文件（及其中的笑脸）应编码为UTF-8。笑脸应包含文字字节0xE2 0x98 0xBA。
您应该使用u8装饰器将字符串标记为包含UTF-8数据：u8"Hello☺"
接下来，icu::UnicodeString的文档说明它将Unicode存储为UTF-16。在这种情况下，您很幸运，因为U + 263A可以容纳一个UTF-16字符。其他表情符号可能不会！您应该将其转换为UTF-32，或者非常小心并使用GetChar32At函数。
最后，wcout使用的编码应配置为imbue以匹配您的环境期望的编码。请参阅this question的答案。

无法从C ++ std :: string中提取Unicode符号

问题描述投票：1回答：1

1个回答

最新问题

无法从C ++ std :: string中提取Unicode符号

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1