避免在Quanteda重叠使用KWIC条款

Question

我使用的字典来搜索语料库术语的出现，其中的条款可能会单独出现，但他们会经常重叠：

corpus <- c("According to the Canadian Charter of Rights and Freedoms, all Canadians...")

dict <- dictionary(list(constitution = c("charter of rights", "canadian charter"))) 

kwic(corpus, dict)

上面会（正确地）识别下面句子两次：

"According to the Canadian Charter of Rights and Freedoms, all Canadians..."

为了建立在其中这些术语出现，然而，为了避免重复计算次数，我需要确保其中“加拿大宪章”出现在长期才会被计算在内情况下，如果它不被” ..of权利跟进......”

我怎样才能做到这一点？

编辑：刚才注意到，如果使用tokens_lookup所以这个问题是个哑巴点，这不是一个问题。留给了的情况下，是有帮助的任何人。

Answer 1

当你问一个kwic，你会得到所有的模式匹配，即使这些重叠。因此，为了避免重叠，因为我认为你是问路的方式，就是手动转换以防止它们重叠的方式多词表达（MWEs）为单一标记。你的情况，你要算“加拿大宪章”时，后面没有“权利”呢。那么我会建议你记号化的文字，然后在保证他们不会重叠的序列化合物MWEs。

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- "The Canadian charter of rights and the Canadian charter are different."
dict <- dictionary(list(constitution = c("charter of rights", "canadian charter")))

toks <- tokens(txt)
tokscomp <- toks %>%
  tokens_compound(phrase("charter of rights"), concatenator = " ") %>%
  tokens_compound(phrase("Canadian charter"), concatenator = " ")
tokscomp
## tokens from 1 document.
## text1 :
## [1] "The"               "Canadian"          "charter of rights"
## [4] "and"               "the"               "Canadian charter" 
## [7] "are"               "different"         "."

这使得短语为单个标记，用空格分隔这里，这将意味着，在kwic()（如果这是你想用什么）不会加倍指望他们，因为他们现在唯一MWE匹配。

kwic(tokscomp, dict, window = 2)
##                                                             
##  [text1, 3] The Canadian | charter of rights | and the      
##  [text1, 6]      and the | Canadian charter  | are different

需要注意的是简单地算来，你也可以使用dfm()与你的字典作为select参数的值：

dfm(tokscomp, select = dict)
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
##        features
## docs    charter of rights canadian charter
##   text1                 1                1

最后，如果你本来想主要是为了区分“加拿大宪章”，“权利的加拿大宪章”，你可能会加剧前者先后者（最长到最短，最好在这里）。但是，这不完全是你问什么。

避免在Quanteda重叠使用KWIC条款

问题描述投票：0回答：1

1个回答

最新问题

避免在Quanteda重叠使用KWIC条款

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1