我正在实现一个 C++ 类,该类具有一个静态方法,可将 UTF-8 字符串范式 D (NFD) 转换为范式 D (NFC)。
当前方法实现如下:
std::string Graphemes::normalizeToNFC(const std::string& input) {
icu::UnicodeString source = icu::UnicodeString::fromUTF8(input);
icu::UnicodeString target;
UErrorCode status = U_ZERO_ERROR;
icu::Normalizer::normalize(source, UNORM_NFC, 0, target, status);
if (U_FAILURE(status)) {
CONSOLELOG_ERROR("Normalization failed: %s", u_errorName(status));
return {};
}
std::string result;
target.toUTF8String(result);
return result;
}
我对该方法的单元测试如下:
TEST(ConversionTest, Utf8ToLatin2) {
// "Soluções e Ações" at Normal Form D
char input_UTF8_NFD[] = "\x53\x6F\x6C\x75\xE7\xF5\x65\x20\x65\x20\x41\xE7\xF5\x65\x73\x00";
// "Soluções e Ações" at Normal Form C
char expected_UTF8_NFC[] = "\x53\x6F\x6C\x75\xC3\xA7\xC3\xB5\x65\x73\x20\x65\x20\x41\xC3\xA7\xC3\xB5\x65\x73\x0A\x00";
std::string NFCStr = Graphemes::normalizeToNFC(input_UTF8_NFD);
std::string expected = expected_UTF8_NFC;
ASSERT_EQ(expected, NFCStr);
}
我尝试运行测试,但得到了以下不一致的结果:
这些值的预期相等:
Expected equality of these values:
expected
Which is: "Solu\xC3\xA7\xC3\xB5" "es e A\xC3\xA7\xC3\xB5" "es\n"
As Text: "Soluções e Ações
"
rc1
Which is: "Solu\xEF\xBF\xBD\xEF\xBF\xBD" "e e A\xEF\xBF\xBD\xEF\xBF\xBD" "es"
As Text: "Solu��e e A��es"
With diff:
@@ -1,2 +1,1 @@
-Solu\xC3\xA7\xC3\xB5" "es e A\xC3\xA7\xC3\xB5" "es
-"
As Text: "Soluções e Ações
+Solu\xEF\xBF\xBD\xEF\xBF\xBD" "e e A\xEF\xBF\xBD\xEF\xBF\xBD" "es"
As Text: "Solu��e e A��es
我注意到在这种情况下 libicu 没有正确转换正常形式。您能否在此实施中提供一些建议?
按照评论上的提示,我成功地实现了我的测试,如下:
TEST(NormalizationTest, NFD_To_NFC) {
// NFD and NFC sequences for "Soluções e Ações" :D
char input_UTF8_NFD[] = "\x53\x6F\x6C\x75\x63\xCC\xA7\x6F\xCC\x83\x65\x73\x20\x65\x20\x41\x63\xCC\xA7\x6F\xCC\x83\x65\x73\x00";
char expected_UTF8_NFC[] = "\x53\x6F\x6C\x75\xC3\xA7\xC3\xB5\x65\x73\x20\x65\x20\x41\xC3\xA7\xC3\xB5\x65\x73\x00";
std::string NFCStr = Graphemes::normalizeToNFC(input_UTF8_NFD);
std::string expected = expected_UTF8_NFC;
ASSERT_EQ(expected, NFCStr);
}
现在正在按预期工作。