使用正则表达式查找具有相似性的文本

Question

我识别了不同 PDF 文档中的文本列表。现在我需要使用正则表达式从每个文本中提取一些值。我的一些模式是这样的：

some text[ -]?(.+)[ ,-]+some other text

但问题是，识别后有些字母可能会出错（

"0"

代替

"O"

，

"I"

代替

"l"

等）。这就是为什么我的模式与它不匹配。

我想使用类似 jaro-winkler 或 levenshtein 相似性的正则表达式，这样我就可以从像

MY_VALUE

这样的文本中提取

s0me text MY_VALUE, some otner text

。

我知道这看起来棒极了。但也许这个问题有解决方案。

顺便说一句，我正在使用Java，但可以接受其他语言的解决方案

Answer 1

package main

import (
    "fmt"
    "regexp"
    "strings"

    "github.com/agnivade/levenshtein"
)

func findClosestMatch(text string, candidates []string, threshold int) (string, bool) {
    for _, candidate := range candidates {
        if levenshtein.ComputeDistance(strings.ToLower(text), strings.ToLower(candidate)) <= threshold {
            return candidate, true
        }
    }
    return "", false
}

func findMatches(text string, threshold int) []string {
    // Broad regex to capture potential matches
    re := regexp.MustCompile(`(?i)(some\s*\w*\s*text\s*)([^,]+)`)
    potentialMatches := re.FindAllStringSubmatch(text, -1)

    var validMatches []string
    expectedPattern := "some text" // The pattern we expect to find

    for _, match := range potentialMatches {
        // Check if the first part of the match is close to our expected pattern
        closestMatch, isClose := findClosestMatch(match[1], []string{expectedPattern}, threshold)
        if isClose {
            // If the first part is close to 'some text', add the second part to valid matches
            validMatches = append(validMatches, strings.TrimSpace(closestMatch))
        }
    }

    return validMatches
}

func main() {
    text := "This is a sample text with s0me text MY_VALUE, some otner text."
    threshold := 10 

    matches := findMatches(text, threshold)
    fmt.Println("Matches found:", matches)
}

正则表达式模式

(?i)(some\s*\w*\s*text\s*)([^,]+)

用于捕获类似于“some text”的短语，后跟任何字符，直到逗号

使用正则表达式查找具有相似性的文本

问题描述投票：0回答：1

1个回答

最新问题

使用正则表达式查找具有相似性的文本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1