计算 std::string（不是字符）中的实际字符数？

Question

我可以计算 std::string 包含的“字符数”而不是字节数吗？例如，

std::string::size

和

std::string::length

返回字节数（字符）：

std::string m_string1 {"a"};
// This is 1
m_string1.size();

std::string m_string2 {"їa"};
// This is 3 because of Unicode
m_string2.size();

有没有办法获取字符数？例如要获得

m_string2

有 2 个字符。

Answer 1

也许你可以使用

utf8::distance

来自

utfcpp

图书馆：

#include "utf8.h"
#include <iostream>
#include <string>

int main() {
  std::string m_string1{"a"};
  std::string m_string2{"їa"};
  std::cout << "Get the number of characters using std::string::length():" << '\n';
  std::cout << m_string1 << ": " << m_string1.length() << '\n';
  std::cout << m_string2 << ": " << m_string2.length() << '\n';
  std::cout << "Get the number of unicode code points using utf8::distance():" << '\n';
  std::cout << m_string1 << ": " << utf8::distance(m_string1.begin(), m_string1.end()) << '\n';
  std::cout << m_string2 << ": " << utf8::distance(m_string2.begin(), m_string2.end()) << '\n';
}

输出：

Get the number of characters using std::string::length():
a: 1
їa: 3
Get the number of unicode code points using utf8::distance():
a: 1
їa: 2

Answer 2

一般来说，用 C++ 标准库中的任何东西来计算 Unicode 字符串中的“字符”是不可能的。目前尚不清楚“字符”的确切含义，最接近的是使用 UTF-32 文字和

std::u32string

来计算代码点。但是，即使对于

їa

.

，这也无法满足您的需求

例如

ї

可能是单个代码点

ї CYRILLIC SMALL LETTER YI' (U+0457)

或两个连续的代码点

і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I (U+0456)
◌̈ COMBINING DIAERESIS (U+0308)

如果不知道字符串被归一化，那么标准库无法区分两者，也没有办法强制归一化。即使对于 UTF-32 字符串文字，选择哪一个也取决于实现。计算代码点时，字符串

їa

会得到 2 或 3。

如果你想正确地做到这一点，你需要一些像 ICU 这样的 unicode 库。

计算 std::string（不是字符）中的实际字符数？

问题描述投票：0回答：2

2个回答

最新问题

计算 std::string（不是字符）中的实际字符数？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2