检查提交的作业中相似性百分比的最佳算法是什么?

问题描述 投票:-3回答:1

我计划在最后一年建立一个类似于相似性检查器的项目。在该项目中,我计划检查提交的作业之间的相似性百分比,即离线。

例如:

  1. 当第一个学生提交作业时,不会检查任何其他作业。
  2. 当第二个学生提交作业时,将使用第一个作业进行检查。
  3. 当第三个学生提交作业时,将通过第一次和第二次提交的作业进行检查。
  4. 同样,如果有35名学生,那么第36次提交的作业将与35份提交的作业中的其余作业进行核对。

现在,问题在于如何比较两个作业。在这种情况下,比较是文档中的文本之间的相似性。我希望结果与此类似:

Similarity checking between the documents

我只想表明类似句子的百分比及它们是什么?

我做了什么:

我研究了不同的算法,如tf-idf,余弦相似度算法,但我无法正确插值算法的结果。

所以,我想知道在这种情况下哪种算法最好,我想知道这是怎么做的。是否有任何网站的引用,博客可以帮助我?

algorithm project cosine-similarity sentence-similarity
1个回答
0
投票

这取决于您使用的算法如何返回比较结果。

例如,以下函数比较文档内容列表并返回映射到它们之间的公共单词序列列表的文档对的字典。它不区分彼此包括的单词序列,因为可能存在重叠的较长和较短单词序列的不同出现次数。

import re
from itertools import combinations

def wordList(document): return re.findall("(\w+|\d+)",document.lower())

def compareDocs(documents, minSize=2, maxSize=25):
    result  = dict() # { (documentIndex,documentIndex) : [CommonExpressions] }
    def tallyDuplicates(expressionDocs):
        for expression,docIndexes in expressionDocs.items():
            for docIndex,otherDoc in combinations(docIndexes,2):
                result.setdefault((docIndex,otherDoc),[]).append(expression)

    documentWords    = [ wordList(document) for document in documents ]
    wordCounts       = [ len(words) for words in documentWords ]
    expressionRanges = dict()
    for docIndex,words in enumerate(documentWords):
        for wordIndex,word in enumerate(words):
            expressionRanges.setdefault(word,[]).append((docIndex,wordIndex))

    size = 1    
    while size == 1 or expressionDocs and size <= maxSize:        
        nextExpressions   = dict()
        expressionDocs    = dict()
        for expression,starts in expressionRanges.items():
            for docIndex,startIndex in starts:
                endIndex = startIndex+size
                if endIndex >= wordCounts[docIndex]: continue
                extended = " ".join([expression,documentWords[docIndex][endIndex]])
                expressionDocs.setdefault(extended,set()).add(docIndex)
                nextExpressions.setdefault(extended,[]).append( (docIndex,startIndex) )
        expressionDocs   = { expression:docIndexes for expression,docIndexes in expressionDocs.items() if len(docIndexes) > 1 }
        expressionRanges = { expression:ranges for expression,ranges in nextExpressions.items() if expression in expressionDocs }  
        if size >= minSize: tallyDuplicates(expressionDocs)
        size += 1

    return result

根据这些比较结果,您需要分析每个文档对的内容,以计算常用表达式(单词序列)所涵盖的单词。假设表达式包含多个单词,则每个表达式将考虑相似比中的多个单词:words-in matching-expressions / words-in-document。

[编辑]我将结果分析放在自己的函数中,并添加了一个html输出来突出显示文档文本中的表达式:

def analyzeComparison(doc1,doc2,commonExpr):
    words1  = wordList(doc1)
    words2  = wordList(doc2)
    normalizedDoc1 = " ".join(words1)
    normalizedDoc2 = " ".join(words2)
    expressions.sort(key=lambda s:len(s),reverse=True)
    matches = []
    for expression in expressions:
        count1 = len(re.findall(expression,normalizedDoc1))
        count2 = len(re.findall(expression,normalizedDoc2))
        commonCount = min(count1,count2)
        if commonCount == 0: continue
        expressionId = "<#"+str(len(matches))+"#>"
        normalizedDoc1 = re.sub(expression,expressionId,normalizedDoc1,commonCount)
        normalizedDoc2 = re.sub(expression,expressionId,normalizedDoc2,commonCount)
        matches.append((expression,commonCount))
    commonWords = sum( count*len(expr.split(" ")) for expr,count in matches)
    percent1 = 100*commonWords/len(words1)
    percent2 = 100*commonWords/len(words2)
    for index,match in enumerate(matches):
        expressionId = "<#"+str(index)+"#>"
        expressionHighlight = "<span style='background-color:yellow'>"+match[0]+"</span>"
        normalizedDoc1 = re.sub(expressionId,expressionHighlight,normalizedDoc1)
        normalizedDoc2 = re.sub(expressionId,expressionHighlight,normalizedDoc2)
    return (percent1,percent2,matches,normalizedDoc1,normalizedDoc2)

例如:如果您有以下3个文档(通常从文件中读取它们):

doc1 = """
Plagiarism, one of the main scourges of the academic life, is quite an easy concept, but, nonetheless, harmful. In short, to plagiarize means to steal someone else’s idea or part of work and use it as your own. But why exactly it is considered to be so bad and immoral? And it is really considered immoral and a serious offence. In case it is discovered, it may lead to very unpleasant consequences; the higher the position of the offender is, the more unpleasant they are.
copy and paste
There are two major kinds of harm plagiarism causes. First, it is something as simple as stealing and lying – you just steal someone else’s work and trick somebody into believing it was you who had written it, which is as immoral as any other kind of theft is. It means that somebody had actually spent time and effort in order to create something, while you did nothing but ripping it off and submitting it.
copy and paste function
Second, it is a crime you commit against yourself. If you study at an educational institution, there are certain tasks copy and paste you are given in order to ensure that you learn something. When you resort to plagiarism, you undo all these efforts for, instead of actually doing something and understanding it in process, you use someone else’s work and the certain amount of experience that you were supposed to get just misses you.
"""
doc2 = """
Plagiarism has always been a problem in schools. However, with the invention of the internet,copy and paste  it has made plagiarism even more of a challenge. Plagiarism.org, “estimates that nearly 30 percent of all students may be plagiarizing on all their written assignments and that the use of the Internet has made plagiarism much worse.” [1] The act of plagiarism can be defined as, “To steal and pass off (the ideas or words of another) as one’s own, to use (another’s production) without crediting the source, to commit literary theft, to present as new and original as idea or product derived from an existing source”2. Plagiarism has become such a concern for colleges that almost all the sites on this topic are sponsored by schools. The three main topics with plagiarism are the copy and paste function, “paper mills” and the ways that can be used to prevent students from doing this. 
it is quite an easy concept
The first major concern with the internet would be the copy and paste function. Wittenberg copy and paste function lists that “Widespread availability of the internet and increased access to full text databases has made cut and paste plagiarism very easy”.3 While the function is actually very nice to have, people are using it the wrong way. Instead of just using it to copy quotes from websites, than pasting it to their word document and giving it the proper credit, people are passing it off as their own. This is where the problem occurs.
"""

doc3 = """
Plagiarism has always been a problem in schools. However, it is something as simple as stealing and lying
it is a crime you. some other text
"""

您将首先在文档内容列表上调用compareDocs(),对于每对文档(由函数返回),您将使用analyzeComparison()来获取百分比,计数和突出显示:

documents   = [doc1,doc2,doc3]
comparisons = compareDocs( documents )
for documentPair,expressions in comparisons.items():
    docIndex1,docIndex2 = documentPair
    doc1 = documents[docIndex1]
    doc2 = documents[docIndex2]        
    pct1,pct2,matches,doc1,doc2 = analyzeComparison(doc1,doc2,expressions)

    # print result on console ...
    print(int(pct1//1)," % of document #",docIndex1," is same as document #", docIndex2)
    print(int(pct2//1)," % of document #",docIndex2," is same as document #", docIndex1)
    print("Common expressions are:")
    for expression,count in matches:
        print( "    ",expression,"(",count,"times )")
    print("")

    # output comparison result to an HTML file...
    htmlPage = "<html><body><table border='1'>"
    htmlPage += "<tr><th>#" + str(docIndex1) + ": Source " + str(int(pct1//1)) + "% duplicate</th>"
    htmlPage += "<th>#" + str(docIndex2) + ": Target  " + str(int(pct2//1)) + "% duplicate</th></tr>"
    htmlPage += "<tr><td width='50%' valign='top'>" + doc1 + "</td><td valign='top'>" + doc2 + "</td></tr>"
    htmlPage +="</table></body></html>"        
    fileName = str(docIndex1)+"-"+str(docIndex2)+".html"
    with open(fileName,"w") as f: f.write(htmlPage)       

这将打印以下信息并创建一组看起来类似于所需结果的HTML文件:

3.0  % of document # 1  is same as document # 2
34.0  % of document # 2  is same as document # 1
Common expressions are:
     plagiarism has always been a problem in schools however ( 1 times )

6.0  % of document # 0  is same as document # 1
5.0  % of document # 1  is same as document # 0
Common expressions are:
     is quite an easy concept ( 1 times )
     copy and paste function ( 1 times )
     copy and paste ( 2 times )

5.0  % of document # 0  is same as document # 2
53.0  % of document # 2  is same as document # 0
Common expressions are:
     it is something as simple as stealing and lying ( 1 times )
     it is a crime you ( 1 times )

总而言之,整个过程的工作原理如下:

1)运行比较函数以识别每对文档共有的表达式(单词序列)。

  • compareDocs函数在给定文档文本列表的一次调用中执行此操作。
  • 如果使用不同的比较算法,则可以将其设计为仅执行两个文档之间的比较,或者在分类器的情况下,它可以简单地返回一个文档的单词/术语频率列表。
  • 根据算法的输入和输出,您需要将逻辑包装在或多或少的代码中以获得所需的结果
  • 您在此阶段应该寻找的是不同文档对之间的常用表达式(单词序列)列表。
  • 如果您正在使用仅提取术语频率的算法(例如td-idf),那么您将面临一个高复杂性问题,即交叉匹配文档对之间的术语频率。 例如,分类器可以返回频率:“cut”= 25次,“和”= 97次“paste”=给定文档的31次。这并没有表明文档中实际存在“剪切和粘贴”这一表达方式或者存在多少次。该文件可能是关于牙膏,并且从不按顺序排列这3个字。仅根据单词频率比较文档会发现同一主题的论文之间存在高度相关性,但这并不意味着存在抄袭。 此外,即使您的分类器设法返回两个或更多单词的所有表达式,每个文档也会生成接近w * 2 ^ n个表达式,其中w是文档中单词的数量,n是表达式中表达式的最大长度单词(您必须决定的最大值)。这将轻松达到每个文档数百万,然后您将需要与其他文档中的数百万匹配。如果您拥有Google的资源,这可能不是问题,但对我们其他人来说也是如此。

2)要测量文档之间的相似性百分比,您需要在两侧找到公共表达式,并测量每个文档的单词被常用表达式覆盖的数量。

  • 识别表达式的位置是一个简单的文本搜索过程
  • 但是,您需要避免多次计算任何给定的单词,因为您的百分比的分母是文档中的单词数(并且您不想高估或超过100%)
  • 这可以通过首先处理较长的表达式并从文本中删除它们(或掩盖它们)来实现,这样它们的单词不会被后续(较短的)表达式再次计算
  • analyzeComparison()函数通过用占位符替换它们来掩盖文本中找到的表达式,占位符稍后将用于使用突出显示标记(HTML)重新注入文本。

3)在您自己的程序中使用文档比较分析。这取决于您希望如何呈现信息以及是否需要存储结果(由您决定)。例如,您可以决定相似性的阈值,并且只输出可疑的文档对。此阈值可以基于百分比,常用表达式的数量,常用表达式的最大或平均长度等。

[EDIT2] compareDocs的工作原理......

  • 该函数创建表达式字典,将它们映射到每个文档中第一个单词的位置。它存储在expressionRanges变量中。 例如:{“复制并粘贴”:[(0,57),(1,7),(1,32)] ....} 这意味着在文档#0的第57位(单词“copy”的位置)和文档#1的第7和32位找到3个单词表达“复制和粘贴”。
  • 表达式字典(expressionRanges)以单字表达式开始,并使用它来获得2个单词的表达式,然后是3个单词,依此类推。
  • 在继续下一个表达式大小之前,通过删除仅在一个文档中找到的所有表达式来清除表达式字典。 size 1 ==> {“copy”:[(0,57),(0,72),(1,7),(1,32),(1,92)] ......} 清理 ... size 2 ==> {“copy and”:[(0,57),(1,7),(1,32),(1,92)] ......} 清理 ... 大小3 ==> {“复制并粘贴”:[(0,57),(1,7),(1,32)] ...}
  • 通过创建单独的字典(expressionDocs)来实现此清理,该字典将表达式映射到包含表达式的一组文档索引。从两个词典中删除最终只有一个文档集的表达式。
  • expressionDocs字典也用于产生函数的输出。出现在多个文档中的表达式被映射到文档对(2的组合)以形成包含以下内容的字典:{(文档对):[表达式列表]}(这是函数的结果)
  • tallyDuplicates子函数通过将表达式添加到文档索引列表中的每个2的组合,执行从{Expression:[文档索引列表]}到{(文档对):[表达式列表]}的转置。

expressionRanges的连续细化大大减少了要执行的单词匹配的数量。每个过程只是为每个表达式添加一个单词,并在转到下一个表达式大小之前立即清除。 expressionRanges字典开头的条目数与文档中的不同单词一样多,但是会迅速缩小到更小的大小(除非文档实际上相同)。

这种方法的一个缺点是,具有大量非常长的匹配表达式的文档将导致字典增长而不是缩小,而while循环将运行更长时间。最糟糕的情况是两个相同的文件。通过引入最大表达式大小以使循环更早停止可以避免这种情况。例如,如果您将最大大小设置为25,则该函数将仅报告25个字的公共表达式和5个字的公共表达式,而不是30个字的表达式(如果有)。这可能是一种可接受的折衷方案,以避免几乎相同文档所需的非常长的处理时间。就相似的百分比而言,差异将是最小的。 (即,如果有一个26个字的共同表达式,并且最大值为25但是27个字的表达式将作为25个字和2个字匹配,则可以忽略一个常用字)

© www.soinside.com 2019 - 2024. All rights reserved.