我想检查一封信是否是表情符号。我发现了一些类似的问题,并找到了这个正则表达式:
private final String emo_regex = "([\\u20a0-\\u32ff\\ud83c\\udc00-\\ud83d\\udeff\\udbb9\\udce5-\\udbb9\\udcee])";
但是,当我在句子中执行以下操作时:
for (int k=0; k<letters.length;k++) {
if (letters[k].matches(emo_regex)) {
emoticon.add(letters[k]);
}
}
它不会添加任何带有任何表情符号的字母。我也尝试过使用
Matcher
和 Pattern
,但这也不起作用。正则表达式有问题还是我的代码中遗漏了一些明显的东西?
这就是我收到这封信的方式:
sentence = "Jij staat op 10 😂"
String[] letters = sentence.split("");
最后一个😂应该被识别并添加到
emoticon
您可以使用 emoji4j 库。以下应该可以解决问题。
String htmlifiedText = EmojiUtils.htmlify(text);
// regex to identify html entitities in htmlified text
Matcher matcher = htmlEntityPattern.matcher(htmlifiedText);
while (matcher.find()) {
String emojiCode = matcher.group();
if (isEmoji(emojiCode)) {
emojis.add(EmojiUtils.getEmoji(emojiCode).getEmoji());
}
}
我创建的这个函数检查给定的字符串是否仅包含表情符号。 换句话说,如果字符串包含正则表达式中未包含的任何字符,它将返回 false。
private static boolean isEmoji(String message){
return message.matches("(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|" +
"[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|" +
"[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|" +
"[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|" +
"[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|" +
"[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|" +
"[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|" +
"[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|" +
"[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|" +
"[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|" +
"[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)+");
}
实施示例:
public static int detectEmojis(String message){
int len = message.length(), NumEmoji = 0;
// if the the given String is only emojis.
if(isEmoji(message)){
for (int i = 0; i < len; i++) {
// if the charAt(i) is an emoji by it self -> ++NumEmoji
if (isEmoji(message.charAt(i)+"")) {
NumEmoji++;
} else {
// maybe the emoji is of size 2 - so lets check.
if (i < (len - 1)) { // some Emojis are two characters long in java, e.g. a rocket emoji is "\uD83D\uDE80";
if (Character.isSurrogatePair(message.charAt(i), message.charAt(i + 1))) {
i += 1; //also skip the second character of the emoji
NumEmoji++;
}
}
}
}
return NumEmoji;
}
return 0;
}
given 是一个在字符串(仅包含表情符号)上运行并返回其中表情符号数量的函数。 (在我在 StackOverFlow 上找到的其他答案的帮助下)。
这些表情符号似乎有两个字符长,但是使用
split("")
,您将在每个字符之间进行分割,因此这些字母都不是您正在寻找的表情符号。
相反,您可以尝试在单词之间进行拆分:
for (String word : sentence.split(" ")) {
if (word.matches(emo_regex)) {
System.out.println(word);
}
}
但是,这当然会错过连接到单词或标点符号的表情符号。
或者,您可以仅使用
Matcher
来 find
与正则表达式匹配的句子中的任何 group
。
Matcher matcher = Pattern.compile(emo_regex).matcher(sentence);
while (matcher.find()) {
System.out.println(matcher.group());
}
您可以使用 Character 类来确定字母是否是代理对的一部分。有一些有用的方法来处理代理对表情符号,例如:
String text = "💩";
if (text.length() > 1 && Character.isSurrogatePair(text.charAt(0), text.charAt(1))) {
int codePoint = Character.toCodePoint(text.charAt(0), text.charAt(1));
char[] c = Character.toChars(codePoint);
}
值得记住的是,Java 代码可以用 Unicode 编写。所以你可以就这样做:
@Test
public void containsEmoji_detects_smileys() {
assertTrue(containsEmoji("This 😂 is a smiley "));
assertTrue(containsEmoji("This 😄 is a different smiley"));
assertFalse(containsEmoji("No smiley here"));
}
private boolean containsEmoji(String s) {
String pattern = ".*[😂😄].*";
return s.matches(pattern);
}
尽管请参阅:源代码是否应该以 UTF-8 格式保存以讨论这是否是一个好主意。
在 Java 8 中,您可以使用
String.codePoints()
将字符串拆分为 Unicode 代码点,它会返回 IntStream
。这意味着您可以执行以下操作:
Set<Integer> emojis = new HashSet<>();
emojis.add("😂".codePointAt(0));
emojis.add("😄".codePointAt(0));
String s = "1😂34😄5";
s.codePoints().forEach( codepoint -> {
System.out.println(
new String(Character.toChars(codepoint))
+ " "
+ emojis.contains(codepoint));
});
...打印...
1 false
😂 true
3 false
4 false
😄 true
5 false
当然,如果您不想在代码中包含文字 unicode 字符,您可以在集合中放入数字:
emojis.add(0x1F601);
这就是 Telegram 的做法:
private static boolean isEmoji(String message){
return message.matches("(?:[\uD83C\uDF00-\uD83D\uDDFF]|[\uD83E\uDD00-\uD83E\uDDFF]|" +
"[\uD83D\uDE00-\uD83D\uDE4F]|[\uD83D\uDE80-\uD83D\uDEFF]|" +
"[\u2600-\u26FF]\uFE0F?|[\u2700-\u27BF]\uFE0F?|\u24C2\uFE0F?|" +
"[\uD83C\uDDE6-\uD83C\uDDFF]{1,2}|" +
"[\uD83C\uDD70\uD83C\uDD71\uD83C\uDD7E\uD83C\uDD7F\uD83C\uDD8E\uD83C\uDD91-\uD83C\uDD9A]\uFE0F?|" +
"[\u0023\u002A\u0030-\u0039]\uFE0F?\u20E3|[\u2194-\u2199\u21A9-\u21AA]\uFE0F?|[\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55]\uFE0F?|" +
"[\u2934\u2935]\uFE0F?|[\u3030\u303D]\uFE0F?|[\u3297\u3299]\uFE0F?|" +
"[\uD83C\uDE01\uD83C\uDE02\uD83C\uDE1A\uD83C\uDE2F\uD83C\uDE32-\uD83C\uDE3A\uD83C\uDE50\uD83C\uDE51]\uFE0F?|" +
"[\u203C\u2049]\uFE0F?|[\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE]\uFE0F?|" +
"[\u00A9\u00AE]\uFE0F?|[\u2122\u2139]\uFE0F?|\uD83C\uDC04\uFE0F?|\uD83C\uDCCF\uFE0F?|" +
"[\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA]\uFE0F?)+");
}
这是 ChatActivity 中的第 21,026 行。
Unicode 有关于此的整个文档。表情符号和表情符号序列比几个字符范围复杂得多。有表情符号修饰符(例如肤色)、区域指示符对(国旗)以及一些特殊序列,例如海盗旗。
您可以使用 Unicode 的表情符号数据文件来可靠地查找表情符号字符和表情符号序列。即使添加了新的复杂表情符号,这也将起作用:
import java.net.URL;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Collection;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class EmojiCollector {
private static String emojiSequencesBaseURI;
private final Pattern emojiPattern;
public EmojiCollector()
throws IOException {
StringBuilder sequences = new StringBuilder();
appendSequencesFrom(
uriOfEmojiSequencesFile("emoji-sequences.txt"),
sequences);
appendSequencesFrom(
uriOfEmojiSequencesFile("emoji-zwj-sequences.txt"),
sequences);
emojiPattern = Pattern.compile(sequences.toString());
}
private void appendSequencesFrom(String sequencesFileURI,
StringBuilder sequences)
throws IOException {
Path sequencesFile = download(sequencesFileURI);
Pattern range =
Pattern.compile("^(\\p{XDigit}{4,6})\\.\\.(\\p{XDigit}{4,6})");
Matcher rangeMatcher = range.matcher("");
try (BufferedReader sequencesReader =
Files.newBufferedReader(sequencesFile)) {
String line;
while ((line = sequencesReader.readLine()) != null) {
if (line.trim().isEmpty() || line.startsWith("#")) {
continue;
}
int semicolon = line.indexOf(';');
if (semicolon < 0) {
continue;
}
String codepoints = line.substring(0, semicolon);
if (sequences.length() > 0) {
sequences.append("|");
}
if (rangeMatcher.reset(codepoints).find()) {
String start = rangeMatcher.group(1);
String end = rangeMatcher.group(2);
sequences.append("[\\x{").append(start).append("}");
sequences.append("-\\x{").append(end).append("}]");
} else {
Scanner scanner = new Scanner(codepoints);
while (scanner.hasNext()) {
String codepoint = scanner.next();
sequences.append("\\x{").append(codepoint).append("}");
}
}
}
}
}
private static String uriOfEmojiSequencesFile(String baseName)
throws IOException {
if (emojiSequencesBaseURI == null) {
URL readme = new URL(
"https://www.unicode.org/Public/UCD/latest/ReadMe.txt");
try (BufferedReader reader = new BufferedReader(
new InputStreamReader(readme.openStream(), "UTF-8"))) {
String line;
while ((line = reader.readLine()) != null) {
if (line.startsWith("Public/emoji/")) {
emojiSequencesBaseURI =
"https://www.unicode.org/" + line.trim();
if (!emojiSequencesBaseURI.endsWith("/")) {
emojiSequencesBaseURI += "/";
}
break;
}
}
}
if (emojiSequencesBaseURI == null) {
// Where else can we get this reliably?
String version = "15.0";
emojiSequencesBaseURI =
"https://www.unicode.org/Public/emoji/" + version + "/";
}
}
return emojiSequencesBaseURI + baseName;
}
private static Path download(String uri)
throws IOException {
Path cacheDir;
String os = System.getProperty("os.name");
String home = System.getProperty("user.home");
if (os.contains("Windows")) {
Path appDataDir;
String appData = System.getenv("APPDATA");
if (appData != null) {
appDataDir = Paths.get(appData);
} else {
appDataDir = Paths.get(home, "AppData");
}
cacheDir = appDataDir.resolve("Local");
} else if (os.contains("Mac")) {
cacheDir = Paths.get(home, "Library", "Application Support");
} else {
cacheDir = Paths.get(home, ".cache");
String cacheHome = System.getenv("XDG_CACHE_HOME");
if (cacheHome != null) {
Path dir = Paths.get(cacheHome);
if (dir.isAbsolute()) {
cacheDir = dir;
}
}
}
String baseName = uri.substring(uri.lastIndexOf('/') + 1);
Path dataDir = cacheDir.resolve(EmojiCollector.class.getName());
Path dataFile = dataDir.resolve(baseName);
if (!Files.isReadable(dataFile)) {
Files.createDirectories(dataDir);
URL dataURL = new URL(uri);
try (InputStream data = dataURL.openStream()) {
Files.copy(data, dataFile);
}
}
return dataFile;
}
public Collection<String> getEmojisIn(String letters) {
Collection<String> emoticons = new ArrayList<>();
Matcher emojiMatcher = emojiPattern.matcher(letters);
while (emojiMatcher.find()) {
emoticons.add(emojiMatcher.group());
}
return emoticons;
}
public static void main(String[] args)
throws IOException {
EmojiCollector collector = new EmojiCollector();
for (String arg : args) {
Collection<String> emojis = collector.getEmojisIn(arg);
System.out.println(arg + " => " + String.join("", emojis));
}
}
}
给你 -
for (String word : sentence.split("")) {
if (word.matches(emo_regex)) {
System.out.println(word);
}
}
这里有一些依赖于 java.lang.Character api 的 java 逻辑,我发现它们可以非常可靠地将表情符号与单纯的“特殊字符”和非拉丁字母区分开来。尝试一下吧。
import static java.lang.Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS;
import static java.lang.Character.UnicodeBlock.MISCELLANEOUS_TECHNICAL;
import static java.lang.Character.UnicodeBlock.VARIATION_SELECTORS;
import static java.lang.Character.codePointAt;
import static java.lang.Character.codePointBefore;
import static java.lang.Character.isSupplementaryCodePoint;
import static java.lang.Character.isValidCodePoint;
public boolean checkStringEmoji(String someString) {
if(!someString.isEmpty() && someString.length() < 5) {
int firstCodePoint = codePointAt(someString, 0);
int lastCodePoint = codePointBefore(someString, someString.length());
if (isValidCodePoint(firstCodePoint) && isValidCodePoint(lastCodePoint)) {
if (isSupplementaryCodePoint(firstCodePoint) ||
isSupplementaryCodePoint(lastCodePoint) ||
Character.UnicodeBlock.of(firstCodePoint) == MISCELLANEOUS_SYMBOLS ||
Character.UnicodeBlock.of(firstCodePoint) == MISCELLANEOUS_TECHNICAL ||
Character.UnicodeBlock.of(lastCodePoint) == VARIATION_SELECTORS
) {
// string is emoji
return true;
}
}
}
return false;
}
Java 21 添加了
Character::isEmoji
(JavaDoc)。例如:
String sentence = "This string contains an emoji 😂!";
sentence.codePoints()
.filter(Character::isEmoji)
.mapToObj(Character::toString)
.forEach(System.out::println);
这些新方法也可以通过 property 构造在正则表达式中访问:
Pattern.compile("\\p{IsEmoji}")