如何从.NET中的字符串中删除变音符号(重音符号)?

问题描述 投票:388回答:18

我正在尝试转换一些法语加拿大语的字符串,基本上,我希望能够在保留字母的同时取出字母中的法语重音符号。 (例如,将é转换为e,因此crème brûlée将成为creme brulee

实现这一目标的最佳方法是什么?

.net string diacritics
18个回答
483
投票

我没有使用过这种方法,但迈克尔·卡普兰在他的博客文章(带有令人困惑的标题)中描述了这样做的方法,该文章讨论了剥离变音符号:Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();

    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

请注意,这是他早期帖子的后续内容:qazxsw poi

该方法使用Stripping diacritics....将输入字符串拆分为组成字形(基本上将“基本”字符与变音符号分开),然后扫描结果并仅保留基本字符。这有点复杂,但实际上你正在研究一个复杂的问题。

当然,如果你限制自己使用法语,你可能会按照@David Dibben的推荐,在String.Normalize中使用简单的基于表格的方法。


2
投票

这是VB版本(与GREEK合作):

导入System.Text

进口System.Globalization

    public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
    {
        str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();            
        return keepSpace ? str : str.RemoveSpace();
    }

2
投票

试试 [TestMethod()] public void LatinizeAndConvertToASCIITest() { string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ"; string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN"; string actual = europeanStr.LatinizeAndConvertToASCII(); Assert.AreEqual(expected, actual); }

有一个方法RemoveAccents:

Public Function RemoveDiacritics(ByVal s As String)
    Dim normalizedString As String
    Dim stringBuilder As New StringBuilder
    normalizedString = s.Normalize(NormalizationForm.FormD)
    Dim i As Integer
    Dim c As Char
    For i = 0 To normalizedString.Length - 1
        c = normalizedString(i)
        If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
            stringBuilder.Append(c)
        End If
    Next
    Return stringBuilder.ToString()
End Function

2
投票

这就是我在所有.NET程序中将变音字符替换为非变音符号的方法

C#:

HelperSharp package

VB .NET:

 public static string RemoveAccents(this string source)
 {
     //8 bit characters 
     byte[] b = Encoding.GetEncoding(1251).GetBytes(source);

     // 7 bit characters
     string t = Encoding.ASCII.GetString(b);
     Regex re = new Regex("[^a-zA-Z0-9]=-_/");
     string c = re.Replace(t, " ");
     return c;
 }

2
投票

您可以使用MMLib.Extensions nuget包中的字符串扩展名:

//Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e'
public string RemoveDiacritics(string s)
{
    string normalizedString = null;
    StringBuilder stringBuilder = new StringBuilder();
    normalizedString = s.Normalize(NormalizationForm.FormD);
    int i = 0;
    char c = '\0';

    for (i = 0; i <= normalizedString.Length - 1; i++)
    {
        c = normalizedString[i];
        if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().ToLower();
}

Nuget页面:'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"' Public Function RemoveDiacritics(ByVal s As String) As String Dim normalizedString As String Dim stringBuilder As New StringBuilder normalizedString = s.Normalize(NormalizationForm.FormD) Dim i As Integer Dim c As Char For i = 0 To normalizedString.Length - 1 c = normalizedString(i) If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then stringBuilder.Append(c) End If Next Return stringBuilder.ToString().ToLower() End Function Codeplex项目网站using MMLib.RapidPrototyping.Generators; public void ExtensionsExample() { string target = "aácčeéií"; Assert.AreEqual("aacceeii", target.RemoveDiacritics()); }


2
投票

TL; DR - https://www.nuget.org/packages/MMLib.Extensions/

我认为保留字符串含义的最佳解决方案是转换字符而不是剥离它们,这在示例https://mmlib.codeplex.com/C# string extension methodcrème brûlée中得到了很好的说明。

我检查了crme brle并看到Lucene.Net代码是Apache 2.0许可的,所以我已经将类修改为一个简单的字符串扩展方法。你可以像这样使用它:

creme brulee

这个函数太长了,无法在StackOverflow答案中发布(~30k的30k字符允许lol)所以Alexander's comment above

var originalString = "crème brûlée";
var maxLength = originalString.Length; // limit output length as necessary
var foldedString = originalString.FoldToASCII(maxLength); 
// "creme brulee"

希望能帮助别人,这是我发现的最强大的解决方案!



1
投票

/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /// <summary> /// This class converts alphabetic, numeric, and symbolic Unicode characters /// which are not in the first 127 ASCII characters (the "Basic Latin" Unicode /// block) into their ASCII equivalents, if one exists. /// <para/> /// Characters from the following Unicode blocks are converted; however, only /// those characters with reasonable ASCII alternatives are converted: /// /// <ul> /// <item><description>C1 Controls and Latin-1 Supplement: <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a></description></item> /// <item><description>Latin Extended-A: <a href="http://www.unicode.org/charts/PDF/U0100.pdf">http://www.unicode.org/charts/PDF/U0100.pdf</a></description></item> /// <item><description>Latin Extended-B: <a href="http://www.unicode.org/charts/PDF/U0180.pdf">http://www.unicode.org/charts/PDF/U0180.pdf</a></description></item> /// <item><description>Latin Extended Additional: <a href="http://www.unicode.org/charts/PDF/U1E00.pdf">http://www.unicode.org/charts/PDF/U1E00.pdf</a></description></item> /// <item><description>Latin Extended-C: <a href="http://www.unicode.org/charts/PDF/U2C60.pdf">http://www.unicode.org/charts/PDF/U2C60.pdf</a></description></item> /// <item><description>Latin Extended-D: <a href="http://www.unicode.org/charts/PDF/UA720.pdf">http://www.unicode.org/charts/PDF/UA720.pdf</a></description></item> /// <item><description>IPA Extensions: <a href="http://www.unicode.org/charts/PDF/U0250.pdf">http://www.unicode.org/charts/PDF/U0250.pdf</a></description></item> /// <item><description>Phonetic Extensions: <a href="http://www.unicode.org/charts/PDF/U1D00.pdf">http://www.unicode.org/charts/PDF/U1D00.pdf</a></description></item> /// <item><description>Phonetic Extensions Supplement: <a href="http://www.unicode.org/charts/PDF/U1D80.pdf">http://www.unicode.org/charts/PDF/U1D80.pdf</a></description></item> /// <item><description>General Punctuation: <a href="http://www.unicode.org/charts/PDF/U2000.pdf">http://www.unicode.org/charts/PDF/U2000.pdf</a></description></item> /// <item><description>Superscripts and Subscripts: <a href="http://www.unicode.org/charts/PDF/U2070.pdf">http://www.unicode.org/charts/PDF/U2070.pdf</a></description></item> /// <item><description>Enclosed Alphanumerics: <a href="http://www.unicode.org/charts/PDF/U2460.pdf">http://www.unicode.org/charts/PDF/U2460.pdf</a></description></item> /// <item><description>Dingbats: <a href="http://www.unicode.org/charts/PDF/U2700.pdf">http://www.unicode.org/charts/PDF/U2700.pdf</a></description></item> /// <item><description>Supplemental Punctuation: <a href="http://www.unicode.org/charts/PDF/U2E00.pdf">http://www.unicode.org/charts/PDF/U2E00.pdf</a></description></item> /// <item><description>Alphabetic Presentation Forms: <a href="http://www.unicode.org/charts/PDF/UFB00.pdf">http://www.unicode.org/charts/PDF/UFB00.pdf</a></description></item> /// <item><description>Halfwidth and Fullwidth Forms: <a href="http://www.unicode.org/charts/PDF/UFF00.pdf">http://www.unicode.org/charts/PDF/UFF00.pdf</a></description></item> /// </ul> /// <para/> /// See: <a href="http://en.wikipedia.org/wiki/Latin_characters_in_Unicode">http://en.wikipedia.org/wiki/Latin_characters_in_Unicode</a> /// <para/> /// For example, '&amp;agrave;' will be replaced by 'a'. /// </summary> public static partial class StringExtensions { /// <summary> /// Converts characters above ASCII to their ASCII equivalents. For example, /// accents are removed from accented characters. /// </summary> /// <param name="input"> The string of characters to fold </param> /// <param name="length"> The length of the folded return string </param> /// <returns> length of output </returns> public static string FoldToASCII(this string input, int? length = null) { // See https://gist.github.com/andyraddatz/e6a396fb91856174d4e3f1bf2e10951c } }

Imports System.Text Imports System.Globalization Public Function DECODE(ByVal x As String) As String Dim sb As New StringBuilder For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark) sb.Append(c) Next Return sb.ToString() End Function

它实际上分裂了What this person said:这是一个字符(这是字符代码Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(text));,而不是å加上看起来相同的修饰符00E5)到0061加上某种修饰符,然后ASCII转换删除修饰符,留下唯一030A


1
投票

如果你还没有考虑过,可以在这里弹出这个库。看起来有各种各样的单元测试。

a


0
投票

我非常喜欢a提供的简洁和功能性代码。所以,我已经改变了一点,将其转换为扩展方法:

https://github.com/thomasgalliker/Diacritics.NET

-1
投票

没有足够的声誉,显然我不能评论亚历山大的优秀链接。 - Lucene似乎是在合理通用案例中工作的唯一解决方案。

对于那些想要一个简单的复制粘贴解决方案的人来说,就是利用Lucene中的代码:

string testbed =“ÁÂÄÅÇÉÍÎÓÖØÚÜÞàáâãäåæçèéêëìíïïðñóôöøúüăăčĐęğıŁłńŌōřŞşšźžşţệủ”;

Console.WriteLine(Lucene.latinizeLucene(试验台));

AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu

//////////

azrafe7

143
投票

29
投票

如果有人感兴趣,我正在寻找类似的东西并结束了以下内容:

string accentedStr;
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);
string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);

18
投票

我需要一些转换所有主要unicode字符的东西,并且投票的答案留了一些,所以我创建了一个版本的CodeIgniter的 public static string NormalizeStringForUrl(string name) { String normalizedString = name.Normalize(NormalizationForm.FormD); StringBuilder stringBuilder = new StringBuilder(); foreach (char c in normalizedString) { switch (CharUnicodeInfo.GetUnicodeCategory(c)) { case UnicodeCategory.LowercaseLetter: case UnicodeCategory.UppercaseLetter: case UnicodeCategory.DecimalDigitNumber: stringBuilder.Append(c); break; case UnicodeCategory.SpaceSeparator: case UnicodeCategory.ConnectorPunctuation: case UnicodeCategory.DashPunctuation: stringBuilder.Append('_'); break; } } string result = stringBuilder.ToString(); return String.Join("_", result.Split(new char[] { '_' } , StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores } 到C#中,可以轻松定制:

convert_accented_characters($str)

用法

using System;
using System.Text;
using System.Collections.Generic;

public static class Strings
{
    static Dictionary<string, string> foreign_characters = new Dictionary<string, string>
    {
        { "äæǽ", "ae" },
        { "öœ", "oe" },
        { "ü", "ue" },
        { "Ä", "Ae" },
        { "Ü", "Ue" },
        { "Ö", "Oe" },
        { "ÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶА", "A" },
        { "àáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặа", "a" },
        { "Б", "B" },
        { "б", "b" },
        { "ÇĆĈĊČ", "C" },
        { "çćĉċč", "c" },
        { "Д", "D" },
        { "д", "d" },
        { "ÐĎĐΔ", "Dj" },
        { "ðďđδ", "dj" },
        { "ÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭ", "E" },
        { "èéêëēĕėęěέεẽẻẹềếễểệеэ", "e" },
        { "Ф", "F" },
        { "ф", "f" },
        { "ĜĞĠĢΓГҐ", "G" },
        { "ĝğġģγгґ", "g" },
        { "ĤĦ", "H" },
        { "ĥħ", "h" },
        { "ÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫ", "I" },
        { "ìíîïĩīĭǐįıηήίιϊỉịиыї", "i" },
        { "Ĵ", "J" },
        { "ĵ", "j" },
        { "ĶΚК", "K" },
        { "ķκк", "k" },
        { "ĹĻĽĿŁΛЛ", "L" },
        { "ĺļľŀłλл", "l" },
        { "М", "M" },
        { "м", "m" },
        { "ÑŃŅŇΝН", "N" },
        { "ñńņňʼnνн", "n" },
        { "ÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢО", "O" },
        { "òóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợо", "o" },
        { "П", "P" },
        { "п", "p" },
        { "ŔŖŘΡР", "R" },
        { "ŕŗřρр", "r" },
        { "ŚŜŞȘŠΣС", "S" },
        { "śŝşșšſσςс", "s" },
        { "ȚŢŤŦτТ", "T" },
        { "țţťŧт", "t" },
        { "ÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУ", "U" },
        { "ùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựу", "u" },
        { "ÝŸŶΥΎΫỲỸỶỴЙ", "Y" },
        { "ýÿŷỳỹỷỵй", "y" },
        { "В", "V" },
        { "в", "v" },
        { "Ŵ", "W" },
        { "ŵ", "w" },
        { "ŹŻŽΖЗ", "Z" },
        { "źżžζз", "z" },
        { "ÆǼ", "AE" },
        { "ß", "ss" },
        { "IJ", "IJ" },
        { "ij", "ij" },
        { "Œ", "OE" },
        { "ƒ", "f" },
        { "ξ", "ks" },
        { "π", "p" },
        { "β", "v" },
        { "μ", "m" },
        { "ψ", "ps" },
        { "Ё", "Yo" },
        { "ё", "yo" },
        { "Є", "Ye" },
        { "є", "ye" },
        { "Ї", "Yi" },
        { "Ж", "Zh" },
        { "ж", "zh" },
        { "Х", "Kh" },
        { "х", "kh" },
        { "Ц", "Ts" },
        { "ц", "ts" },
        { "Ч", "Ch" },
        { "ч", "ch" },
        { "Ш", "Sh" },
        { "ш", "sh" },
        { "Щ", "Shch" },
        { "щ", "shch" },
        { "ЪъЬь", "" },
        { "Ю", "Yu" },
        { "ю", "yu" },
        { "Я", "Ya" },
        { "я", "ya" },
    };

    public static char RemoveDiacritics(this char c){
        foreach(KeyValuePair<string, string> entry in foreign_characters)
        {
            if(entry.Key.IndexOf (c) != -1)
            {
                return entry.Value[0];
            }
        }
        return c;
    }

    public static string RemoveDiacritics(this string s) 
    {
        //StringBuilder sb = new StringBuilder ();
        string text = "";


        foreach (char c in s)
        {
            int len = text.Length;

            foreach(KeyValuePair<string, string> entry in foreign_characters)
            {
                if(entry.Key.IndexOf (c) != -1)
                {
                    text += entry.Value;
                    break;
                }
            }

            if (len == text.Length) {
                text += c;  
            }
        }
        return text;
    }
}

16
投票

如果有人感兴趣,这里是java等价物:

// for strings
"crème brûlée".RemoveDiacritics (); // creme brulee

// for chars
"Ã"[0].RemoveDiacritics (); // A

12
投票

我经常使用基于我在这里找到的另一个版本的扩展方法(参见qazxsw poi)快速解释:

  • 归一化以形成D分裂特征,如è到e和非间距`
  • 由此,删除了nospacing字符
  • 结果归一化为C(我不确定这是否是必要的)

码:

import java.text.Normalizer;

public class MyClass
{
    public static String removeDiacritics(String input)
    {
        String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
        StringBuilder stripped = new StringBuilder();
        for (int i=0;i<nrml.length();++i)
        {
            if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
            {
                stripped.append(nrml.charAt(i));
            }
        }
        return stripped.toString();
    }
}

9
投票

希腊语(ISO)的CodePage可以做到这一点

有关此代码页的信息,请访问Replacing characters in C# (ascii)。了解:using System.Linq; using System.Text; using System.Globalization; // namespace here public static class Utility { public static string RemoveDiacritics(this string str) { if (null == str) return null; var chars = from c in str.Normalize(NormalizationForm.FormD).ToCharArray() let uc = CharUnicodeInfo.GetUnicodeCategory(c) where uc != UnicodeCategory.NonSpacingMark select c; var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC); return cleanStr; } // or, alternatively public static string RemoveDiacritics2(this string str) { if (null == str) return null; var chars = str .Normalize(NormalizationForm.FormD) .ToCharArray() .Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark) .ToArray(); return new string(chars).Normalize(NormalizationForm.FormC); } }

希腊语(ISO)的代码页为28597,名称为iso-8859-7。

转到代码... \ o /

System.Text.Encoding.GetEncodings()

所以,写这个功能......

https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx

请注意...... string text = "Você está numa situação lamentável"; string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7")); //result: "Voce+esta+numa+situacao+lamentavel" string textDecode = System.Web.HttpUtility.UrlDecode(textEncode); //result: "Voce esta numa situacao lamentavel" 相当于public string RemoveAcentuation(string text) { return System.Web.HttpUtility.UrlDecode( System.Web.HttpUtility.UrlEncode( text, Encoding.GetEncoding("iso-8859-7"))); } ,因为第一个是名称,第二个是编码的代码页。


4
投票

这在java中工作正常。

它基本上将所有重音字符转换为deAccented对应字符,然后将它们组合为变音符号。现在你可以使用正则表达式去除变音符号。

Encoding.GetEncoding("iso-8859-7")

3
投票

有趣的是这样一个问题可以获得如此多的答案,但是没有一个符合我的要求:)有很多语言,完全与语言无关的解决方案是AFAIK真的不可能,因为其他人提到FormC或FormD提出问题。

由于最初的问题与法语有关,因此最简单的工作答案确实如此

Encoding.GetEncoding(28597)

1251应该由输入语言的编码代码替换。

但是,这只能用一个字符替换一个字符。由于我也在使用德语作为输入,我做了手动转换

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

它可能无法提供最佳性能,但至少它非常易于阅读和扩展。正则表达式是NO GO,比任何字符/字符串都要慢得多。

我还有一个非常简单的方法来删除空间:

    public static string ConvertWesternEuropeanToASCII(this string str)
    {
        return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
    }

最后,我使用了以上所有3个扩展的组合:

    public static string LatinizeGermanCharacters(this string str)
    {
        StringBuilder sb = new StringBuilder(str.Length);
        foreach (char c in str)
        {
            switch (c)
            {
                case 'ä':
                    sb.Append("ae");
                    break;
                case 'ö':
                    sb.Append("oe");
                    break;
                case 'ü':
                    sb.Append("ue");
                    break;
                case 'Ä':
                    sb.Append("Ae");
                    break;
                case 'Ö':
                    sb.Append("Oe");
                    break;
                case 'Ü':
                    sb.Append("Ue");
                    break;
                case 'ß':
                    sb.Append("ss");
                    break;
                default:
                    sb.Append(c);
                    break;
            }
        }
        return sb.ToString();
    }

并通过一个小单元测试(不详尽)成功通过。

    public static string RemoveSpace(this string str)
    {
        return str.Replace(" ", string.Empty);
    }
© www.soinside.com 2019 - 2024. All rights reserved.