将 Unicode 字符“ï¿½”替换为空格

Question

我正在从 .csv 文件上传大量信息，我需要将此字符非 ASCII“ï¿½”替换为普通空格“”。

字符“ï¿½”对应于C、C++和Java的“\uFFFD”，看起来它被称为REPLACMENT CHARACTER。还有其他的，比如C#官方文档中的U+FEFF、U+205F、U+200B、U+180E、U+202F等空格类型。

我正在尝试以这种方式进行替换：

public string Errors = "";

public void test(){

    string textFromCsvCell = "";
    string validCharacters = "^[0-9A-Za-z().:%-/ ]+$";
    textFromCsvCell = "This is my text from csv file"; //All spaces aren't normal space " "
    string cleaned = textFromCsvCell.Replace("\uFFFD", "\"")
      if (Regex.IsMatch(cleaned, validCharacters ))
        //All code for insert
      else
         Errors=cleaned;
         //print Errors
}

测试方法向我显示了这段文字：

“这是来自 csv 文件的我的文本”

我也尝试了一些解决方案：

尝试解决方案 1：使用 Trim

 Regex.Replace(value.Trim(), @"[^\S\r\n]+", " ");

尝试解决方案 2：使用替换

  System.Text.RegularExpressions.Regex.Replace(str, @"\s+", " ");

尝试解决方案 3：使用 Trim

  String.Trim(new char[]{'\uFEFF', '\u200B'});

尝试解决方案 4：添加 [\S ] 到有效字符

  string validCharacters = "^[\S\r\n0-9A-Za-z().:%-/ ]+$";

没有任何作用。

如何更换？

来源：

已编辑

这是原始字符串：

“葡萄糖连续监测系统”

以 0x... 表示法

0xA0 葡萄糖持续监测系统

解决方案

转到 Unicode 代码转换器。查看转换并执行 replace。

就我而言，我做了一个简单的替换：

 string value = "SYSTEM OF MONITORING CONTINUES OF GLUCOSE";
 //value contains non-breaking whitespace
 //value is "SYSTEM OFï¿½MONITORING CONTINUES OF GLUCOSE"
 string cleaned = "";
 string pattern = @"[^\u0000-\u007F]+";
 string replacement = " ";

 Regex rgx = new Regex(pattern);
 cleaned = rgx.Replace(value, replacement);

 if (Regex.IsMatch(cleaned,"^[0-9A-Za-z().:<>%-/ ]+$"){
    //all code for insert
 else
    //Error messages

这个表达式代表所有可能的空格：空格、制表符、分页符、换行符和回车符

[ \f\n\r\t\v\u00a0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]

参考文献

正则表达式 (MDN)

Answer 1

使用字符串.替换：

使用简单的

String.Replace()

。

我假设您想要删除的唯一字符是您在问题中提到的字符：

ï¿½

并且您想用普通空格替换它们。

string text = "impï¿½ortant";
string cleaned = text.Replace('\u00ef', ' ')
        .Replace('\u00bf', ' ')
        .Replace('\u00bd', ' ');
// Returns 'imp   ortant'

或使用Regex.Replace：

string cleaned = Regex.Replace(text, "[\u00ef\u00bf\u00bd]", " ");
// Returns 'imp   ortant'

尝试一下：Dotnet Fiddle

Answer 2

定义 ASCII 字符范围，并替换不在该范围内的任何内容。

我们只想查找 Unicode 字符，因此我们将匹配 Unicode 字符并替换。

Regex.Replace("This is my te\uFFFDxt from csv file", @"[^\u0000-\u007F]+", " ")

上面的模式将匹配此范围^的集合

[  ]

中的任何

not

\u0000-\u007F

（ASCII字符（\u007F之后的所有内容都是Unicode）），并将其替换为空格。

结果

This is my te xt from csv file

您可以根据需要调整提供的范围

\u0000-\u007F

，以扩大允许的字符范围以满足您的需求。

Answer 3

如果您只想要 ASCII，请尝试以下操作：

var ascii = new ASCIIEncoding();
byte[] encodedBytes = ascii.GetBytes(text);
var cleaned = ascii.GetString(encodedBytes).Replace("?", " ");

Answer 4

尝试使用 System.IO.File.ReadAllText(packageXmlfile, System.Text.Encoding.GetEncoding("Windows-1252")) 有时较旧的系统仍然使用此代码页，并且 dot net core 似乎无法很好地识别它.

将 Unicode 字符“ï¿½”替换为空格

问题描述投票：0回答：4

已编辑

解决方案

4个回答

最新问题

将 Unicode 字符“ï¿½”替换为空格

问题描述 投票：0回答：4

已编辑

解决方案

4个回答

最新问题

问题描述投票：0回答：4