将 Unicode 代理对转换为文字字符串

Question

我正在尝试将一个高 Unicode 字符从一个字符串读取到另一个字符串中。为了简洁起见，我将简化我的代码，如下所示：

public static void UnicodeTest()
{
    var highUnicodeChar = "𝐀"; //Not the standard A

    var result1 = highUnicodeChar; //this works
    var result2 = highUnicodeChar[0].ToString(); // returns \ud835
}

当我直接将

highUnicodeChar

分配给

result1

时，它保留其字面值

𝐀

。当我尝试通过索引访问它时，它返回

\ud835

。据我了解，这是一对 UTF-16 字符的代理对，用于表示 UTF-32 字符。我很确定这个问题与尝试将

char

隐式转换为

string

有关。

最后，我希望

result2

产生与

result1

相同的值。我怎样才能做到这一点？

Answer 1

在 Unicode 中，您有 代码点。它们的长度为 21 位。您的角色 𝐀，

Mathematical Bold Capital A

，代码点为 U+1D400。

在 Unicode 编码中，有 代码单元。这些是编码的自然单位：8 位表示 UTF-8，16 位表示 UTF-16，依此类推。一个或多个代码单元编码单个代码点。

在 UTF-16 中，形成单个代码点的两个代码单元称为“代理对”。代理对用于对大于 16 位的任何代码点进行编码，即 U+10000 及以上。这在 .NET 中有点棘手，因为 .NET

Char

表示单个 UTF-16 代码单元，而 .NET

String

是代码单元的集合。

所以你的代码点 𝐀 (U+1D400) 无法容纳 16 位，需要一个代理对，这意味着你的字符串中有两个代码单元：

var highUnicodeChar = "𝐀"; char a = highUnicodeChar[0]; // code unit 0xD835 char b = highUnicodeChar[1]; // code unit 0xDC00

这意味着当您像这样索引字符串时，您实际上只获得了代理对的一半。

您可以使用

IsSurrogatePair

来测试代理对。例如： string GetFullCodePointAtIndex(string s, int idx) => s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);

值得注意的是，Unicode 中变量编码的兔子洞并没有在代码点结束。

字素簇

是大多数人在被问到时最终会称之为“字符”的“可见事物”。字素簇由一个或多个代码点组成：一个基本字符和零个或多个组合字符。组合字符的一个示例是元音变音或您可能想要添加的各种其他装饰/修饰符。请参阅这个答案，了解组合字符可以做什么的可怕示例。要测试组合字符，您可以使用

GetUnicodeCategory

检查封闭标记、非空格标记或空格标记。

Answer 2

highUnicodeChar

字符串中提取从用户角度来看的第一个“原子”字符（即第一个 Unicode grapheme cluster），其中“原子”字符包括

代理项的两半一对

. 您可以使用

StringInfo.GetTextElementEnumerator()

 来做到这一点，将

string 分解为原子块，然后取出第一个。

首先定义如下扩展方法：

public static class TextExtensions { public static IEnumerable<string> TextElements(this string s) { // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert if (s == null) yield break; var enumerator = StringInfo.GetTextElementEnumerator(s); while (enumerator.MoveNext()) yield return enumerator.GetTextElement(); } }

现在，您可以：

var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";

请注意，

StringInfo.GetTextElementEnumerator()

还将对Unicode

组合字符

进行分组，因此字符串Ĥ=T̂+V̂的第一个字素簇将是

Ĥ

而不是

。

小提琴样本

在这里

。

.NET Core 更新

在 .NET 6 及更高版本中

，您可以使用 StringInfo.GetNextTextElementLength(ReadOnlySpan<Char>)

 将字符串的文本元素作为切片序列进行迭代，如下所示：

public static class TextExtensions
{
    public static IEnumerable<ReadOnlyMemory<char>> TextElements(this string s) => (s ?? "").AsMemory().TextElements();

    public static IEnumerable<ReadOnlyMemory<char>> TextElements(this ReadOnlyMemory<char> s)
    {
        for (int index = 0, length = StringInfo.GetNextTextElementLength(s.Span); 
             length > 0; 
             index += length, length = StringInfo.GetNextTextElementLength(s.Span.Slice(index)))
            yield return s.Slice(index, length);
    }
}

这可以避免为每个字素分配

string

。

或者，如果你只想要第一个字素，你可以这样做：

var first = highUnicodeChar.AsSpan() .Slice(0, StringInfo.GetNextTextElementLength(highUnicodeChar));

演示小提琴 #2

这里

。在 .NET Core 3 及更高版本中

，如果您确实只想枚举

string 的 Unicode 代码点，将代理项对视为单个字符但忽略组合字符，则可以使用

String.EnumerateRunes()

将其枚举为一系列

Rune结构：

var highUnicodeChar = "𝐀"; //Not the standard A

foreach (var rune in highUnicodeChar.EnumerateRunes())
{
    Console.WriteLine($"{rune} = {rune.Value:X}"); // Prints 𝐀 = 1D400
}

Rune

结构：

表示 Unicode 标量值，表示除代理项范围 (U+D800..U+DFFF) 之外的任何代码点。该类型的构造函数和转换运算符会验证输入，因此使用者可以在假设底层 Rune 实例格式良好的情况下调用 API。

演示小提琴 #3
这里

。

将 Unicode 代理对转换为文字字符串

问题描述投票：0回答：2

2个回答

最新问题

将 Unicode 代理对转换为文字字符串

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2