如何计算 Java 中的字素簇或“感知”表情符号字符

Question

我正在计算所提供的 Java 字符串中感知到的表情符号字符的数量。我目前正在使用 emoji4j 库，但它不适用于像这样的字素簇：👩u200d👩u200d👦u200d👦

调用

EmojiUtil.getLength("👩‍👩‍👦‍👦")

返回

而不是

，同样调用

EmojiUtil.getLength("👻👩‍👩‍👦‍👦")

返回

而不是

。

Java中的

String

有没有API或方法可以很容易地计算字素簇？

我一直在四处寻找，但可以理解的是，

codePoints()

上的

String

方法不仅包括可见的表情符号，还包括零宽度连接符。

我也尝试使用

BreakIterator

:

public static int getLength(String emoji) {
    BreakIterator it = BreakIterator.getCharacterInstance();
    it.setText(emoji);
    int emojiCount = 0;
    while (it.next() != BreakIterator.DONE) {
        emojiCount++;
    }
    return emojiCount;
}

但它的行为似乎与

codePoints()

方法相同，返回

类似

"👻👩‍👩‍👦‍👦"

.

Answer 1

我最终使用了 ICU 库，它工作得更好。我的原始代码块不需要任何更改（除了 import 语句），因为它只是提供了

BreakIterator

.

的不同实现

Answer 2

JDK 15 添加了对extended grapheme clusters 的支持到

java.util.regex

包。这是基于此的解决方案：

/** Returns the number of grapheme clusters within `text` between positions
  * `start` and `end`.  Omits any partial cluster at the end of the span.
  */
int columnarSpan( String text, int start, int end ) {
    return columnarSpan( text, start, end, /*wholeOnly*/true ); }


/** @param wholeOnly Whether to omit any partial cluster at the end
  *   of the span.  Iff `true` and `end` bisects the final cluster,
  *   then the final cluster is omitted from the count.
  */
int columnarSpan( final String text, final int start, final int end,
      final boolean wholeOnly ) {
    graphemeMatcher.reset( text ).region( start, end );
    int count = 0;
    while( graphemeMatcher.find() ) ++count;
    if( wholeOnly  &&  count > 0  &&  end < text.length() ) {
        final int countNext = columnarSpan( text, start, end + 1, false );
        if( countNext == count ) --count; } /* The character at `end` bisects
          the final cluster, which therefore lies partly outside the span.
          Therefore exclude it from the count. */
    return count; }


final Matcher graphemeMatcher = graphemePattern.matcher( "" );


/** The pattern of a grapheme cluster.
  */
public static final Pattern graphemePattern = Pattern.compile( "\\X" ); } /*
  An alternative means of cluster discovery is `java.txt.BreakIterator`.
  Long outdated in this regard,  [https://bugs.openjdk.org/browse/JDK-8174266]
  it was updated for JDK 20.  [https://stackoverflow.com/a/76109241/2402790] */

这样称呼它：

String emoji = "👻👩‍👩‍👦‍👦";
int count = columnarSpan( emoji, 0, /*end*/emoji.length() );
System.out.println( count );

⇒ 2

请注意，它只计算整个集群。如果给定的

end

将最后一个簇一分为二——位置

end

处的字符与前面的字符属于同一扩展簇的一部分——那么最后一个簇将从计数中省略。例如：

int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1 );
System.out.println( count );

⇒ 1

这通常是您想要的行为，以便打印一行文本，字符指针位于其下方（例如‘

’）指向到给定索引处字符的簇。要阻止此行为（在集群后指向），请按如下方式调用基本方法。

int count = columnarSpan( emoji, 0, /*end*/emoji.length() - 1, false );
System.out.println( count );

⇒ 2

（根据 Skomisa 的评论更新。）

Answer 3

在提出这个问题六年多之后，在几周前发布的 Java 20 中终于实现了在

String

中正确处理字素簇的增强功能。请参阅 BreakIterator 中的JDK-8291660 Grapheme 支持。

BreakIterator 类的 API 没有变化，但其底层代码现在正确地将字素簇视为单个单元而不是多个字符。

这里是一个示例应用程序，使用问题中提供的方法和数据，没有任何更改：

import java.text.BreakIterator;

public class Main {
    public static void main(String[] args) {
        String emojis1 = "👩‍👩‍👦‍👦";
        System.out.println("Length of the emoji string " + emojis1 + " is " + Main.getLength(emojis1));
        String emojis2 = "👻👩‍👩‍👦‍👦";
        System.out.println("Length of the emoji string " + emojis2 + " is " + Main.getLength(emojis2));
    }

    // Returns the correct number of perceived characters in a String.
    // Requires JDK 20+ to work correctly.
    // JDK-8291660 "Grapheme support in BreakIterator" (https://bugs.openjdk.org/browse/JDK-8291660) refers.
    public static int getLength(String emoji) {
        BreakIterator it = BreakIterator.getCharacterInstance();
        it.setText(emoji);
        int emojiCount = 0;
        while (it.next() != BreakIterator.DONE) {
            emojiCount++;
        }
        return emojiCount;
    }
}

这是输出，显示正确的字素计数：

C:\Java\jdk-20\bin\java.exe -javaagent:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\231.8770.17\lib\idea_rt.jar=63197:C:\Users\johndoe\AppData\Local\JetBrains\Toolbox\apps\IDEA-U\ch-0\231.8770.17\bin -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath D:\II2023.1\Graphemes\out\production\Graphemes Main
Length of the emoji string 👩‍👩‍👦‍👦 is 1
Length of the emoji string 👻👩‍👩‍👦‍👦 is 2

Process finished with exit code 0

我在 Intellij IDEA 2023.1.1 Preview 使用 OpenJDK JDK 20.0.1.

测试了这个

如何计算 Java 中的字素簇或“感知”表情符号字符

问题描述投票：0回答：3

3个回答

最新问题

如何计算 Java 中的字素簇或“感知”表情符号字符

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3