如何通过去除字符串的非字母来计算单词频率?

问题描述 投票:2回答:2

我有一个字符串:

var text = @"
I have a long string with a load of words,
and it includes new lines and non-letter characters.
I want to remove all of them and split this text to have one word per line, then I can count how many of each word exist."

删除所有非字母字符,然后将每个单词拆分到新行,以便我可以存储和计算每个单词有多少的最佳方法是?

var words = text.Split(' ');

foreach(var word in words)
{
    word.Trim(',','.','-');
}

我尝试过各种操作,例如text.Replace(characters)whitespace,然后拆分。我已经尝试过正则表达式(我不想使用)。

我还尝试过使用StringBuilder类从文本(字符串)中获取字符,并且仅在字母a-z / A-Z后面附加字符。

[还尝试调用sb.Replace或sb.Remove我不需要的字符,然后将它们存储在Dictionary中。但是我似乎仍然会遇到不需要的字符?

[我尝试过的一切,我似乎都至少有一个我不想要的角色,也无法弄清楚为什么它不起作用。

谢谢!

c# string word-count distinct-values
2个回答
1
投票

使用没有RegEx或Linq的扩展方法] >>

static public class StringHelper
{
  static public Dictionary<string, int> CountDistinctWords(this string text)
  {
    string str = text.Replace(Environment.NewLine, " ");
    var words = new Dictionary<string, int>();
    var builder = new StringBuilder();
    char charCurrent;
    Action<string> process = word =>
    {
      word = builder.ToString();
      if ( !string.IsNullOrEmpty(word) )
        if ( !words.ContainsKey(word) )
          words.Add(word, 1);
        else
          words[word]++;
    };
    for ( int index = 0; index < str.Length; index++ )
    {
      charCurrent = str[index];
      if ( char.IsLetter(charCurrent) )
        builder.Append(charCurrent);
      else
      if ( !char.IsNumber(charCurrent) )
        charCurrent = ' ';
      if ( char.IsWhiteSpace(charCurrent) )
      {
        process(builder.ToString());
        builder.Clear();
      }
    }
    process(builder.ToString());
    return words;
  }
}

[它解析所有拒绝所有非字母的字符,同时为每个单词创建字典,并计算出现次数。

Test

var result = text.CountDistinctWords();
Console.WriteLine($"Found {result.Count()} distinct words:");
Console.WriteLine();
foreach ( var item in result )
  Console.WriteLine($"{item.Key}: {item.Value}");

您的样品结果

Found 36 distinct words:

I: 3
have: 2
a: 2
long: 1
string: 1
with: 1
load: 1
of: 3
words: 1
and: 3
it: 1
includes: 1
new: 1
lines: 1
non: 1
letter: 1
characters: 1
want: 1
to: 2
remove: 1
all: 1
them: 1
split: 1
this: 1
text: 1
one: 1
word: 2
per: 1
line: 1
then: 1
can: 1
count: 1
how: 1
many: 1
each: 1
exist: 1

使用正则表达式排除非字母字符。这也将为您提供所有单词的集合。

var text = @"
I have a long string with a load of words,
and it includes new lines and non-letter characters.
I want to remove all of them and split this text to have one word per line, then I can count how many of each word exist.";

var words = Regex.Matches(text, @"[A-Za-z ]+").Cast<Match>().SelectMany(n => n.Value.Trim().Split(' '));
int wordCount = words.Count();

-1
投票

使用正则表达式排除非字母字符。这也将为您提供所有单词的集合。

© www.soinside.com 2019 - 2024. All rights reserved.