如何从（w）字符串中获取Unicode字符的utf-8 int值？

Question

情况

我需要一个期望一个字符串并将所有非ASCII字符编码为utf-8的十六进制数字并将其替换为十六进制数字的函数。

例如，在类似“djvӷdio”的单词中的ӷ应该用“ d3b7”代替，而其余部分保持不变。

Explanation:
ӷ equals int 54199 and in hexadecimal d3b7
djvӷdio --> djvd3b7dio

我已经有一个返回整数的十六进制值的函数。

我的机器

kubuntu，19.10
编译器：g ++（Ubuntu 9.2.1-9ubuntu2）9.2.1 20191008

我的想法

1。想法

std::string encode_utf8(const std::string &str);

使用上面的函数，我遍历了包含unicode的整个字符串，如果当前char是非ascii，则将其替换为其十六进制值。

问题：

使用unicode遍历字符串并不明智，因为unichar字符最多由4个字节组成，这与普通char不同。因此，输出垃圾的unicode char可被视为多个chars。用简单的话来说，字符串不能被索引。

2。想法

std::string encode_utf8(const std::wstring &wstr);

同样，我用unicode字符遍历整个字符串，如果当前字符不是ascii，则将其替换为十六进制值。

问题：

现在可以使用索引，但是它返回带有对应的utf-32数字的wchar_t，但是我绝对需要utf-8数字。

如何从字符串中获取字符，从中可以获取utf-8十进制数字？

Answer 1

您的输入字符串是UTF8编码的，这意味着每个字符都由一个至四个字节的任何内容编码。您不能只扫描字符串并转换它们，除非您的循环了解如何在UTF8中编码Unicode字符。

您需要一个UTF8解码器。

幸运的是，如果需要解码的话，确实可以使用轻量级的解码器。 UTF8-CPP几乎是一个标头，并具有为您提供单个Unicode字符的功能。 utf8::next将为您提供uint32_t（“最大”字符的代码点适合此类型的对象）。现在，您可以简单地查看该值是否小于128：如果是，则强制转换为char并追加；如果不是，则以您认为合适的方式对整数进行序列化。

不过，恳请您考虑是否确实要这样做。您的输出将是不明确的。不可能确定其中的一堆数字是实际数字还是某些非ASCII字符的表示形式。为什么不只坚持原始的UTF8编码，还是使用类似HTML实体编码或带引号的可打印内容？这些编码已被广泛理解和广泛支持。

Answer 2

我刚刚解决了这个问题：

std::string Tools::encode_utf8(const std::wstring &wstr)
{
    std::string utf8_encoded;

    //iterate through the whole string
    for(size_t j = 0; j < wstr.size(); ++j)
    {
        if(wstr.at(j) <= 0x7F)
            utf8_encoded += wstr.at(j);
        else if(wstr.at(j) <= 0x7FF)
        {
            //our template for unicode of 2 bytes
            int utf8 = 0b11000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the last 5 remaining bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00000111'11000000) << 2;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\\x").insert(4, "\\x"));
        }
        else if(wstr.at(j) <= 0xFFFF)
        {
            //our template for unicode of 3 bytes
            int utf8 = 0b11100000'10000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the next 6 bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00001111'11000000) << 2;

            /*
             * get the last 4 remaining bits
             * put them 4 to the left so that the 10xx from 10xxxxxx (second byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b11110000'00000000) << 4;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\\x").insert(4, "\\x").insert(8, "\\x"));
        }
        else if(wstr.at(j) <= 0x10FFFF)
        {
            //our template for unicode of 4 bytes
            int utf8 = 0b11110000'10000000'10000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the next 6 bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00001111'11000000) << 2;

            /*
             * get the next 6 bits
             * put them 4 to the left so that the 10xx from 10xxxxxx (second byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00000011'11110000'00000000) << 4;

            /*
             * get the last 3 remaining bits
             * put them 6 to the left so that the 10xxxx from 10xxxxxx (third byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00011100'00000000'00000000) << 4;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\\x").insert(4, "\\x").insert(8, "\\x").insert(12, "\\x"));
        }
    }
    return utf8_encoded;
}

如何从（w）字符串中获取Unicode字符的utf-8 int值？

问题描述投票：-1回答：2

情况

我的机器

我的想法

1。想法

2。想法

2个回答

最新问题

如何从（w）字符串中获取Unicode字符的utf-8 int值？

问题描述 投票：-1回答：2

情况

我的机器

我的想法

1。想法

2。想法

2个回答

最新问题

问题描述投票：-1回答：2