计算文本中的单词数

Question

我需要计算 UTF-8 文本文件中的单词数和唯一单词数，该文本可以包含任何语言。我尝试了下一个选项，但我得到的字数比 Word 或在线服务中的字数多。我让文本不包含任何字符，然后按空格计算单词数，检查该单词是单词还是空格。我怎样才能得到正确的结果？

string localText = System.IO.File.ReadAllText(fils[i], Encoding.UTF8);
localText = Regex.Replace(localText, @"[^\p{L}\d\s]", "");
var collection = Regex.Matches(localText, @"\b\w{1,}\b");
var wordsWithoutSpaces = collection.Cast<Match>().Select(m => m.Value).Where(word => !string.IsNullOrWhiteSpace(word));
var uniqueMatches = collection.OfType<Match>().Select(m => m.Value).Distinct(StringComparer.CurrentCultureIgnoreCase);

fils[i]-文件文件夹

Answer 1

首先，如果我们同意的话，我们应该就什么是词达成共识

Word 是一个非空的字符序列，必须以字母或撇号开头，并且可以包含字母、撇号和破折号（减号）
-

例如

'was           - 1 word 
Q.E.D.         - 3 words 
don't          - 1 word
forget-me-not  - 1 word
George W. Bush - 3 words
USA            - 1 word
PascalCase     - 1 word

然后我们可以提取所有单词如下：

var words = Regex
  .Matches(text, @"[\p{L}'][\p{L}'-]*")
  .Cast<Match>()
  .Select(match => match.Value)
  .ToArray();

然后

int count = words.Length;
int uniqueCount = words.Distinct().Count();

// Case sensitive, i.e And != and
HashSet<string> uniqueWords = new HashSet<string>(words);

计算文本中的单词数

问题描述投票：0回答：1

1个回答

最新问题

计算文本中的单词数

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1