这个作业属于哪个课程 | 22级计科2班 |
---|---|
这个作业要求在哪里 | 作业要求 |
这个作业目标 | 设计一个论文查重算法,给出一个原文文件和一个在这份原文上经过了增删改的抄袭版论文的文件,在答案文件中输出其重复率 |
Github地址:<>
项目设计
整体流程
函数功能
函数名 | 功能 |
---|---|
readText | 读取传入地址中文件的内容信息 |
writeText | 将对应内容写进文件 |
getHash | 同过MD5加密将字符串转换为二进制串 |
getSimHash | 对字符串进行关键词的分组并转换为二进制串,在对其进行加权累加,最后进行降维成二进制串 |
getHammingDistance | 计算两个字符串的海明距离 |
getSimilary | 计算两个字符串的相似程度 |
算法实现
计算hamming值
package Utils;
import com.hankcs.hanlp.HanLP;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.util.List;
public class SimHashUtil {
/*用MD5获取输入字符串对应的二进制编码*/
public static String getHash(String str) {
try {
MessageDigest messageDigest = MessageDigest.getInstance("MD5");
return new BigInteger(1, messageDigest.digest(str.getBytes("UTF-8"))).toString(2);
} catch (Exception e) {
e.printStackTrace();
return str;
}
}
/*计算simHash值*/
public static String getSimHash(String str) {
if (str.length() < 66) {
System.out.println("输入文本过短");
}
int[] weight = new int[128];
List<String> keywordList = HanLP.extractKeyword(str, str.length());
int size = keywordList.size();
int i = 0;
for (String keyword : keywordList) {
String hash = getHash(keyword);
if (hash.length() < 128) {
int distance = 128 - hash.length();
for (int j = 0; j < distance; j++) {
hash += "0";
}
}
for (int j = 0; j < weight.length; j++) {
if (hash.charAt(j) == '1') {
weight[j] += (10 - (i / (size / 10)));
} else {
weight[j] -= (10 - (i / (size / 10)));
}
}
i++;
}
String simHash = "";
for (int j = 0; j < weight.length; j++) {
if (weight[j] > 0) {
simHash += "1";
} else simHash += "0";
}
return simHash;
}
}
测试函数
package Utils;
public class HammingUtil {
public static int getHammingDistance(String simHash1, String simHash2){
if(simHash1.length()!=simHash2.length())
return -1;
int distance=0;
System.out.println("str1的simHash值:"+simHash1);
System.out.println("str2的simHash值:"+simHash2);
for (int i=0;i<simHash1.length();i++){
if(simHash1.charAt(i)!=simHash2.charAt(i))
distance++;
}
System.out.println("海明距离为:"+distance);
return distance;
}
public static double getSimilarity(int distance){
return 1-distance/128.0;
}
}
异常处理
package Utils;
import java.io.*;
public class IOUtil {
public static String readTxt(String txtPath) {
String str = "";
String strLine;
// 将 txt文件按行读入 str中
File file = new File(txtPath);
FileInputStream fileInputStream = null;
try {
fileInputStream = new FileInputStream(file);
InputStreamReader inputStreamReader = new InputStreamReader(fileInputStream, "UTF-8");
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
// 字符串拼接
while ((strLine = bufferedReader.readLine()) != null) {
str += strLine;
}
// 关闭资源
inputStreamReader.close();
bufferedReader.close();
fileInputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
return str;
}
public static void writeTxt(String str,String txtPath){
File file = new File(txtPath);
FileWriter fileWriter = null;
try {
fileWriter = new FileWriter(file, true);
fileWriter.write(str, 0, str.length());
fileWriter.write("\r\n");
// 关闭资源
fileWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
结果显示
性能分析
PSP分析
PSP2.1 | Person Software Process Stages | 预估耗时(min) | 实际耗时(min) |
---|---|---|---|
Planning | 计划 | 30 | 60 |
.Estimate | 预估这个任务需要多少时间 | 60 | 90 |
.Development | 开发 | 180 | 240 |
· Analysis | 需求分析(包括学习新技术 | 60 | 90 |
· Design Spec | 生成设计文档 | 30 | 30 |
· Design Review | 设计复审 | 20 | 15 |
· Coding Standard | 代码规范(为目前的开发制定合适的规范) | 30 | 20 |
· Design | 具体设计 | 20 | 30 |
· Coding | 具体编码 | 120 | 240 |
· Code Review | 代码复审 | 30 | 40 |
· Test | 测试(自我测试、修改代码、提交修改) | 30 | 40 |
Reporting | 报告 | 30 | 40 |
· Test Report | 测试报告 | 20 | 10 |
· Size Measurement | 计算工作量 | 10 | 10 |
· Postmortem & Process Improvement Plan | 事后总结,并提出过程改进计划 | 30 | 30 |
Total | 合计 | 610 | 915 |