我计划在最后一年建立一个类似于相似性检查器的项目。在该项目中,我计划检查提交的作业之间的相似性百分比,即离线。
例如:
现在,问题在于如何比较两个作业。在这种情况下,比较是文档中的文本之间的相似性。我希望结果与此类似:
我只想表明类似句子的百分比及它们是什么?
我做了什么:
我研究了不同的算法,如tf-idf,余弦相似度算法,但我无法正确插值算法的结果。
所以,我想知道在这种情况下哪种算法最好,我想知道这是怎么做的。是否有任何网站的引用,博客可以帮助我?
这取决于您使用的算法如何返回比较结果。
例如,以下函数比较文档内容列表并返回映射到它们之间的公共单词序列列表的文档对的字典。它不区分彼此包括的单词序列,因为可能存在重叠的较长和较短单词序列的不同出现次数。
import re
from itertools import combinations
def wordList(document): return re.findall("(\w+|\d+)",document.lower())
def compareDocs(documents, minSize=2, maxSize=25):
result = dict() # { (documentIndex,documentIndex) : [CommonExpressions] }
def tallyDuplicates(expressionDocs):
for expression,docIndexes in expressionDocs.items():
for docIndex,otherDoc in combinations(docIndexes,2):
result.setdefault((docIndex,otherDoc),[]).append(expression)
documentWords = [ wordList(document) for document in documents ]
wordCounts = [ len(words) for words in documentWords ]
expressionRanges = dict()
for docIndex,words in enumerate(documentWords):
for wordIndex,word in enumerate(words):
expressionRanges.setdefault(word,[]).append((docIndex,wordIndex))
size = 1
while size == 1 or expressionDocs and size <= maxSize:
nextExpressions = dict()
expressionDocs = dict()
for expression,starts in expressionRanges.items():
for docIndex,startIndex in starts:
endIndex = startIndex+size
if endIndex >= wordCounts[docIndex]: continue
extended = " ".join([expression,documentWords[docIndex][endIndex]])
expressionDocs.setdefault(extended,set()).add(docIndex)
nextExpressions.setdefault(extended,[]).append( (docIndex,startIndex) )
expressionDocs = { expression:docIndexes for expression,docIndexes in expressionDocs.items() if len(docIndexes) > 1 }
expressionRanges = { expression:ranges for expression,ranges in nextExpressions.items() if expression in expressionDocs }
if size >= minSize: tallyDuplicates(expressionDocs)
size += 1
return result
根据这些比较结果,您需要分析每个文档对的内容,以计算常用表达式(单词序列)所涵盖的单词。假设表达式包含多个单词,则每个表达式将考虑相似比中的多个单词:words-in matching-expressions / words-in-document。
[编辑]我将结果分析放在自己的函数中,并添加了一个html输出来突出显示文档文本中的表达式:
def analyzeComparison(doc1,doc2,commonExpr):
words1 = wordList(doc1)
words2 = wordList(doc2)
normalizedDoc1 = " ".join(words1)
normalizedDoc2 = " ".join(words2)
expressions.sort(key=lambda s:len(s),reverse=True)
matches = []
for expression in expressions:
count1 = len(re.findall(expression,normalizedDoc1))
count2 = len(re.findall(expression,normalizedDoc2))
commonCount = min(count1,count2)
if commonCount == 0: continue
expressionId = "<#"+str(len(matches))+"#>"
normalizedDoc1 = re.sub(expression,expressionId,normalizedDoc1,commonCount)
normalizedDoc2 = re.sub(expression,expressionId,normalizedDoc2,commonCount)
matches.append((expression,commonCount))
commonWords = sum( count*len(expr.split(" ")) for expr,count in matches)
percent1 = 100*commonWords/len(words1)
percent2 = 100*commonWords/len(words2)
for index,match in enumerate(matches):
expressionId = "<#"+str(index)+"#>"
expressionHighlight = "<span style='background-color:yellow'>"+match[0]+"</span>"
normalizedDoc1 = re.sub(expressionId,expressionHighlight,normalizedDoc1)
normalizedDoc2 = re.sub(expressionId,expressionHighlight,normalizedDoc2)
return (percent1,percent2,matches,normalizedDoc1,normalizedDoc2)
例如:如果您有以下3个文档(通常从文件中读取它们):
doc1 = """
Plagiarism, one of the main scourges of the academic life, is quite an easy concept, but, nonetheless, harmful. In short, to plagiarize means to steal someone else’s idea or part of work and use it as your own. But why exactly it is considered to be so bad and immoral? And it is really considered immoral and a serious offence. In case it is discovered, it may lead to very unpleasant consequences; the higher the position of the offender is, the more unpleasant they are.
copy and paste
There are two major kinds of harm plagiarism causes. First, it is something as simple as stealing and lying – you just steal someone else’s work and trick somebody into believing it was you who had written it, which is as immoral as any other kind of theft is. It means that somebody had actually spent time and effort in order to create something, while you did nothing but ripping it off and submitting it.
copy and paste function
Second, it is a crime you commit against yourself. If you study at an educational institution, there are certain tasks copy and paste you are given in order to ensure that you learn something. When you resort to plagiarism, you undo all these efforts for, instead of actually doing something and understanding it in process, you use someone else’s work and the certain amount of experience that you were supposed to get just misses you.
"""
doc2 = """
Plagiarism has always been a problem in schools. However, with the invention of the internet,copy and paste it has made plagiarism even more of a challenge. Plagiarism.org, “estimates that nearly 30 percent of all students may be plagiarizing on all their written assignments and that the use of the Internet has made plagiarism much worse.” [1] The act of plagiarism can be defined as, “To steal and pass off (the ideas or words of another) as one’s own, to use (another’s production) without crediting the source, to commit literary theft, to present as new and original as idea or product derived from an existing source”2. Plagiarism has become such a concern for colleges that almost all the sites on this topic are sponsored by schools. The three main topics with plagiarism are the copy and paste function, “paper mills” and the ways that can be used to prevent students from doing this.
it is quite an easy concept
The first major concern with the internet would be the copy and paste function. Wittenberg copy and paste function lists that “Widespread availability of the internet and increased access to full text databases has made cut and paste plagiarism very easy”.3 While the function is actually very nice to have, people are using it the wrong way. Instead of just using it to copy quotes from websites, than pasting it to their word document and giving it the proper credit, people are passing it off as their own. This is where the problem occurs.
"""
doc3 = """
Plagiarism has always been a problem in schools. However, it is something as simple as stealing and lying
it is a crime you. some other text
"""
您将首先在文档内容列表上调用compareDocs(),对于每对文档(由函数返回),您将使用analyzeComparison()来获取百分比,计数和突出显示:
documents = [doc1,doc2,doc3]
comparisons = compareDocs( documents )
for documentPair,expressions in comparisons.items():
docIndex1,docIndex2 = documentPair
doc1 = documents[docIndex1]
doc2 = documents[docIndex2]
pct1,pct2,matches,doc1,doc2 = analyzeComparison(doc1,doc2,expressions)
# print result on console ...
print(int(pct1//1)," % of document #",docIndex1," is same as document #", docIndex2)
print(int(pct2//1)," % of document #",docIndex2," is same as document #", docIndex1)
print("Common expressions are:")
for expression,count in matches:
print( " ",expression,"(",count,"times )")
print("")
# output comparison result to an HTML file...
htmlPage = "<html><body><table border='1'>"
htmlPage += "<tr><th>#" + str(docIndex1) + ": Source " + str(int(pct1//1)) + "% duplicate</th>"
htmlPage += "<th>#" + str(docIndex2) + ": Target " + str(int(pct2//1)) + "% duplicate</th></tr>"
htmlPage += "<tr><td width='50%' valign='top'>" + doc1 + "</td><td valign='top'>" + doc2 + "</td></tr>"
htmlPage +="</table></body></html>"
fileName = str(docIndex1)+"-"+str(docIndex2)+".html"
with open(fileName,"w") as f: f.write(htmlPage)
这将打印以下信息并创建一组看起来类似于所需结果的HTML文件:
3.0 % of document # 1 is same as document # 2
34.0 % of document # 2 is same as document # 1
Common expressions are:
plagiarism has always been a problem in schools however ( 1 times )
6.0 % of document # 0 is same as document # 1
5.0 % of document # 1 is same as document # 0
Common expressions are:
is quite an easy concept ( 1 times )
copy and paste function ( 1 times )
copy and paste ( 2 times )
5.0 % of document # 0 is same as document # 2
53.0 % of document # 2 is same as document # 0
Common expressions are:
it is something as simple as stealing and lying ( 1 times )
it is a crime you ( 1 times )
总而言之,整个过程的工作原理如下:
1)运行比较函数以识别每对文档共有的表达式(单词序列)。
2)要测量文档之间的相似性百分比,您需要在两侧找到公共表达式,并测量每个文档的单词被常用表达式覆盖的数量。
3)在您自己的程序中使用文档比较分析。这取决于您希望如何呈现信息以及是否需要存储结果(由您决定)。例如,您可以决定相似性的阈值,并且只输出可疑的文档对。此阈值可以基于百分比,常用表达式的数量,常用表达式的最大或平均长度等。
[EDIT2] compareDocs的工作原理......
expressionRanges的连续细化大大减少了要执行的单词匹配的数量。每个过程只是为每个表达式添加一个单词,并在转到下一个表达式大小之前立即清除。 expressionRanges字典开头的条目数与文档中的不同单词一样多,但是会迅速缩小到更小的大小(除非文档实际上相同)。
这种方法的一个缺点是,具有大量非常长的匹配表达式的文档将导致字典增长而不是缩小,而while循环将运行更长时间。最糟糕的情况是两个相同的文件。通过引入最大表达式大小以使循环更早停止可以避免这种情况。例如,如果您将最大大小设置为25,则该函数将仅报告25个字的公共表达式和5个字的公共表达式,而不是30个字的表达式(如果有)。这可能是一种可接受的折衷方案,以避免几乎相同文档所需的非常长的处理时间。就相似的百分比而言,差异将是最小的。 (即,如果有一个26个字的共同表达式,并且最大值为25但是27个字的表达式将作为25个字和2个字匹配,则可以忽略一个常用字)