PHP计数词频率，支持标点符号

Question

我正在尝试从大量正文中获取一些常用短语。我不仅想要单个单词，还想要任何停用词之间的所有系列词。因此，例如，https://en.wikipedia.org/wiki/Wuthering_Heights我希望计算短语“呼啸的高度”，而不是“呼啸的高度”和“身高”。

if (in_array($word, $this->stopwords)) 
{
    $cleanPhrase = preg_replace("/[^A-Za-z ]/", '', $currentPhrase);
    $cleanPhrase = trim($cleanPhrase);
    if($cleanPhrase != "" && strlen($cleanPhrase) > 2)
    {
        $this->Phrases[$cleanPhrase] = substr_count($normalisedText, $cleanPhrase);
        $currentPhrase = "";
    }
    continue;
}
else

$currentPhrase = $currentPhrase . $word . " ";

如果使用单词“ stage”，那么这个“年龄”存在的问题将被计算在内。这里的解决方案是在$cleanPhrase变量的任一侧添加空格。这导致的问题是如果没有空白。可能会出现逗号，句号或其他可能暗示标点符号的字符。我要数所有这些。有没有一种方法可以执行此操作而不必执行此类操作。

$terminate = array(".", " ", ",", "!", "?");
$count = 0;
foreach($terminate as $tpun)
{
    $count += substr_count($normalisedText, $tpun . $cleanPhrase . $tpun);
}

Answer 1

通过稍加修改利用this answer，您可以这样做：

$sentence = "Age: In this day and age, people of all age are on the stage.";
$word = 'age';
preg_match_all('/\b'.$word.'\b/i', $sentence, $matches);

\b表示单词边界。因此，如果搜索age，则该字符串的计数为3（模式中的i标志表示不区分大小写，如果要匹配大小写，也可以将其删除）。

如果您一次只匹配一个词组，则会在count($matches[0])中找到计数。

PHP计数词频率，支持标点符号

问题描述投票：0回答：1

1个回答

最新问题

PHP计数词频率，支持标点符号

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1