如何在R中输出一个字的固定距离值的所有可能的偏差?

问题描述 投票:1回答:1

我有一个词,并希望在R中输出所有可能的偏差(替换,替换,插入),将固定距离值输入到矢量中。

例如,单词“Cat”和固定距离值为1会产生一个向量,其元素为“cot”,“at”,......

r text-mining tidyverse stringr quanteda
1个回答
1
投票

我将假设您想要所有实际的单词,而不仅仅是编辑距离为1的字符的排列,其中包括非单词,例如“zat”。

我们可以使用adist()来计算目标单词和所有符合条件的英语单词之间的编辑距离,这些单词取自某些单词列表。在这里,我使用了quanteda包中的英语音节词典(你确实将这个问题标记为quanteda),但这可能是来自任何其他来源的任何英语词典单词的矢量。

为了缩小范围,我们首先从距离值中排除长度与目标词不同的所有单词。

distfn <- function(word, distance = 1) {
  # select eligible words for efficiency
  eligible_y_words <- names(quanteda::data_int_syllables)
  wordlengths <- nchar(eligible_y_words)
  eligible_y_words <- eligible_y_words[wordlengths >= (nchar(word) - distance) &
    wordlengths <= (nchar(word) + distance)]
  # compute Levenshtein distance
  distances <- utils::adist(word, eligible_y_words)[1, ]
  # return only those for the requested distance value
  eligible_y_words[distances == distance]
}

distfn("cat", 1)
##  [1] "at"   "bat"  "ca"   "cab"  "cac"  "cad"  "cai"  "cal"  "cam"  "can" 
## [11] "cant" "cao"  "cap"  "caq"  "car"  "cart" "cas"  "cast" "cate" "cato"
## [21] "cats" "catt" "cau"  "caw"  "cay"  "chat" "coat" "cot"  "ct"   "cut" 
## [31] "dat"  "eat"  "fat"  "gat"  "hat"  "kat"  "lat"  "mat"  "nat"  "oat" 
## [41] "pat"  "rat"  "sat"  "scat" "tat"  "vat"  "wat"

使用替代距离值演示如何对较长的单词进行操作。

distfn("coffee", 1)
## [1] "caffee"  "coffeen" "coffees" "coffel"  "coffer"  "coffey"  "cuffee" 
## [8] "toffee"

distfn("coffee", 2)
##  [1] "caffey"   "calfee"   "chafee"   "chaffee"  "cofer"    "coffee's"
##  [7] "coffelt"  "coffers"  "coffin"   "cofide"   "cohee"    "coiffe"  
## [13] "coiffed"  "colee"    "colfer"   "combee"   "comfed"   "confer"  
## [19] "conlee"   "coppee"   "cottee"   "coulee"   "coutee"   "cuffe"   
## [25] "cuffed"   "diffee"   "duffee"   "hoffer"   "jaffee"   "joffe"   
## [31] "mcaffee"  "moffet"   "noffke"   "offen"    "offer"    "roffe"   
## [37] "scoffed"  "soffel"   "soffer"   "yoffie"

(是的,根据CMU发音词典,这些都是实际的词......)

编辑:制作所有字母的排列,而不仅仅是实际的单词

这涉及来自字母表的排列,其具有与输入单词固定的编辑距离。在这里,我没有特别有效地通过在符合条件的范围内形成字母的所有排列,然后从目标词计算它们的编辑距离,然后选择它们来完成它。所以它是上面的变体,除了代替字典,它使用置换词。

distfn2 <- function(word, distance = 1) {
  result <- character()

  # start with deletions
  for (i in max((nchar(word) - distance), 0):(nchar(word) - 1)) {
    result <- c(
      result,
      combn(unlist(strsplit(word, "", fixed = TRUE)), i,
        paste,
        collapse = "", simplify = TRUE
      )
    )
  }

  # now for changes and insertions
  for (i in (nchar(word)):(nchar(word) + distance)) {
    # all possible edits
    edits <- apply(expand.grid(rep(list(letters), i)),
      1, paste0,
      collapse = ""
    )
    # remove original word
    edits <- edits[edits != word]
    # get all distances, add to result
    distances <- utils::adist(word, edits)[1, ]
    result <- c(result, edits[distances == distance])
  }

  result
}

对于OP示例:

distfn2("cat", 1)
##   [1] "ca"   "ct"   "at"   "caa"  "cab"  "cac"  "cad"  "cae"  "caf"  "cag" 
##  [11] "cah"  "cai"  "caj"  "cak"  "cal"  "cam"  "can"  "cao"  "cap"  "caq" 
##  [21] "car"  "cas"  "aat"  "bat"  "dat"  "eat"  "fat"  "gat"  "hat"  "iat" 
##  [31] "jat"  "kat"  "lat"  "mat"  "nat"  "oat"  "pat"  "qat"  "rat"  "sat" 
##  [41] "tat"  "uat"  "vat"  "wat"  "xat"  "yat"  "zat"  "cbt"  "cct"  "cdt" 
##  [51] "cet"  "cft"  "cgt"  "cht"  "cit"  "cjt"  "ckt"  "clt"  "cmt"  "cnt" 
##  [61] "cot"  "cpt"  "cqt"  "crt"  "cst"  "ctt"  "cut"  "cvt"  "cwt"  "cxt" 
##  [71] "cyt"  "czt"  "cau"  "cav"  "caw"  "cax"  "cay"  "caz"  "cata" "catb"
##  [81] "catc" "catd" "cate" "catf" "catg" "cath" "cati" "catj" "catk" "catl"
##  [91] "catm" "catn" "cato" "catp" "catq" "catr" "cats" "caat" "cbat" "acat"
## [101] "bcat" "ccat" "dcat" "ecat" "fcat" "gcat" "hcat" "icat" "jcat" "kcat"
## [111] "lcat" "mcat" "ncat" "ocat" "pcat" "qcat" "rcat" "scat" "tcat" "ucat"
## [121] "vcat" "wcat" "xcat" "ycat" "zcat" "cdat" "ceat" "cfat" "cgat" "chat"
## [131] "ciat" "cjat" "ckat" "clat" "cmat" "cnat" "coat" "cpat" "cqat" "crat"
## [141] "csat" "ctat" "cuat" "cvat" "cwat" "cxat" "cyat" "czat" "cabt" "cact"
## [151] "cadt" "caet" "caft" "cagt" "caht" "cait" "cajt" "cakt" "calt" "camt"
## [161] "cant" "caot" "capt" "caqt" "cart" "cast" "catt" "caut" "cavt" "cawt"
## [171] "caxt" "cayt" "cazt" "catu" "catv" "catw" "catx" "caty" "catz"

也适用于其他编辑距离,但对于较长的单词,它会变得非常慢。

d2 <- distfn2("cat", 2)
set.seed(100)
c(head(d2, 50), sample(d2, 50), tail(d2, 50))
##   [1] "c"     "a"     "t"     "ca"    "ct"    "at"    "aaa"   "baa"  
##   [9] "daa"   "eaa"   "faa"   "gaa"   "haa"   "iaa"   "jaa"   "kaa"  
##  [17] "laa"   "maa"   "naa"   "oaa"   "paa"   "qaa"   "raa"   "saa"  
##  [25] "taa"   "uaa"   "vaa"   "waa"   "xaa"   "yaa"   "zaa"   "cba"  
##  [33] "aca"   "bca"   "cca"   "dca"   "eca"   "fca"   "gca"   "hca"  
##  [41] "ica"   "jca"   "kca"   "lca"   "mca"   "nca"   "oca"   "pca"  
##  [49] "qca"   "rca"   "cnts"  "cian"  "pcatb" "cqo"   "uawt"  "hazt" 
##  [57] "cpxat" "aaet"  "ckata" "caod"  "ncatl" "qcamt" "cdtp"  "qajt" 
##  [65] "bckat" "qcatr" "cqah"  "rcbt"  "cvbt"  "bbcat" "vcaz"  "ylcat"
##  [73] "cahz"  "jcgat" "mant"  "jatd"  "czlat" "cbamt" "cajta" "cafp" 
##  [81] "cizt"  "cmaut" "qwat"  "jcazt" "hdcat" "ucant" "hate"  "cajtl"
##  [89] "caaty" "cix"   "nmat"  "cajit" "cmnat" "caobt" "catoi" "ncau" 
##  [97] "ucoat" "ncamt" "jath"  "oats"  "chatz" "ciatz" "cjatz" "ckatz"
## [105] "clatz" "cmatz" "cnatz" "coatz" "cpatz" "cqatz" "cratz" "csatz"
## [113] "ctatz" "cuatz" "cvatz" "cwatz" "cxatz" "cyatz" "czatz" "cabtz"
## [121] "cactz" "cadtz" "caetz" "caftz" "cagtz" "cahtz" "caitz" "cajtz"
## [129] "caktz" "caltz" "camtz" "cantz" "caotz" "captz" "caqtz" "cartz"
## [137] "castz" "cattz" "cautz" "cavtz" "cawtz" "caxtz" "caytz" "caztz"
## [145] "catuz" "catvz" "catwz" "catxz" "catyz" "catzz"

这可以通过减少所有排列的蛮力形成然后将adist()应用于它们来加速 - 它可以包括从letters算法生成的已知编辑距离的变化或插入。

© www.soinside.com 2019 - 2024. All rights reserved.