如何从段落和标题列表中找到最匹配的锚文本?

问题描述 投票:0回答:1

我有一段:

In today's world, keeping your personal information safe online is more important than ever. With cyber-attacks on the rise, having a strong cybersecurity strategy is essential. 
Whether it's protecting against viruses or securing your passwords, everyone needs to be vigilant. Understanding the digital threats out there can help you stay one step ahead. Building a resilient defence means using antivirus software and keeping your software updated. It's also important to be aware of phishing scams and suspicious emails. By investing in your cybersecurity, you can protect yourself and your data from harm. So, take the time to learn about online safety and protect your digital life.

还有其他一些文章标题:

titles = [
    "Keeping Your Data Safe: Building a Strong Cybersecurity Strategy",
    "Navigating the Online Minefield: Understanding Digital Threats",
    "Securing Your Online World: Navigating the Cybersecurity Landscape",
    "Strengthening Your Shield: Building a Resilient Cyber Defense",
    "Beyond the Basics: Exploring Advanced Cybersecurity Techniques",
    "Know Your Enemy: Understanding the Cyber Threat Landscape",
    "Protecting Your Digital Fort: Strengthening Ransomware Resilience",
    "Building Trust Online: Enhancing Customer Confidence in Your Security",
    "Compliance in Cybersecurity: Meeting Regulatory Standards for Online Safety",
    "Safeguarding Your Future: Investing in Cybersecurity for Peace of Mind",
]

我想找到最匹配的锚文本和标题来添加内部链接。

例如:

1. 
Anchor Text: Cybersecurity Strategy
Title: "Keeping Your Data Safe: Building a Strong Cybersecurity Strategy"

2.
Anchor Text: Security Landscape
Title: "Securing Your Online World: Navigating the Cybersecurity Landscape"

我怎样才能做到这一点?有人可以帮助我如何以编程方式实现这一目标吗?

python elasticsearch pattern-matching match string-matching
1个回答
0
投票

您可以使用 Python 中的 spaCy 等自然语言处理 (NLP) 库来满足您的需求。在这里,我根据您的段落和文章标题给您一个代码。输出将是基于使用 spaCy 计算的相似度分数的每个锚文本的最佳匹配标题。您可以调整锚文本列表并根据需要提供您自己的锚文本。


import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.matcher import PhraseMatcher

# Load English language model
nlp = spacy.load("en_core_web_sm")

# Paragraph and article titles
paragraph = """
In today's world, keeping your personal information safe online is more important than ever. With cyber-attacks on the rise, having a strong cybersecurity strategy is essential. 
Whether it's protecting against viruses or securing your passwords, everyone needs to be vigilant. Understanding the digital threats out there can help you stay one step ahead. Building a resilient defence means using antivirus software and keeping your software updated. It's also important to be aware of phishing scams and suspicious emails. By investing in your cybersecurity, you can protect yourself and your data from harm. So, take the time to learn about online safety and protect your digital life.
"""

titles = [
    "Keeping Your Data Safe: Building a Strong Cybersecurity Strategy",
    "Navigating the Online Minefield: Understanding Digital Threats",
    "Securing Your Online World: Navigating the Cybersecurity Landscape",
    "Strengthening Your Shield: Building a Resilient Cyber Defense",
    "Beyond the Basics: Exploring Advanced Cybersecurity Techniques",
    "Know Your Enemy: Understanding the Cyber Threat Landscape",
    "Protecting Your Digital Fort: Strengthening Ransomware Resilience",
    "Building Trust Online: Enhancing Customer Confidence in Your Security",
    "Compliance in Cybersecurity: Meeting Regulatory Standards for Online Safety",
    "Safeguarding Your Future: Investing in Cybersecurity for Peace of Mind",
]

# Preprocess the paragraph and titles
def preprocess(text):
    doc = nlp(text.lower())
    tokens = [token.lemma_ for token in doc if token.is_alpha and token.text.lower() not in STOP_WORDS]
    return tokens

# Compute similarity scores
def compute_similarity(anchor_text, title):
    anchor_doc = nlp(" ".join(anchor_text))
    title_doc = nlp(" ".join(title))
    return anchor_doc.similarity(title_doc)

# Find best matching title for each anchor text
def find_best_match(paragraph, titles):
    matcher = PhraseMatcher(nlp.vocab)
    anchor_texts = ["Cybersecurity Strategy", "Security Landscape"]  # Your anchor texts

    best_matches = {}
    for anchor_text in anchor_texts:
        anchor_tokens = preprocess(anchor_text)
        matcher.add(anchor_text, None, nlp(" ".join(anchor_tokens)))

        best_score = 0
        best_title = None
        for title in titles:
            title_tokens = preprocess(title)
            doc = nlp(" ".join(title_tokens))
            matches = matcher(doc)
            for match_id, start, end in matches:
                score = compute_similarity(anchor_tokens, title_tokens)
                if score > best_score:
                    best_score = score
                    best_title = title

        best_matches[anchor_text] = best_title

    return best_matches

# Find best matching titles for anchor texts
best_matches = find_best_match(paragraph, titles)
for anchor_text, best_title in best_matches.items():
    print(f"Anchor Text: {anchor_text}")
    print(f"Best Title: {best_title}")
    print()

© www.soinside.com 2019 - 2024. All rights reserved.