标签：查重抄袭 String 个人项目 text List 文本

工程概论作业二：论文查重

这个作业属于哪个课程	工程概论
作业要求	作业要求
这个作业的目标	学习论文查重方法，了解GitHub基本操作

需求

题目：论文查重

描述如下：

设计一个论文查重算法，给出一个原文文件和一个在这份原文上经过了增删改的抄袭版论文的文件，在答案文件中输出其重复率。

原文示例：今天是星期天，天气晴，今天晚上我要去看电影。
抄袭版示例：今天是周天，天气晴朗，我晚上要去看电影。
要求输入输出采用文件输入输出，规范如下：

从命令行参数给出：论文原文的文件的绝对路径。
从命令行参数给出：抄袭版论文的文件的绝对路径。
从命令行参数给出：输出的答案文件的绝对路径。

开发环境

操作系统：Windows 10 专业版
语言：JAVA
JDK：JDK1.8
IDE：IntelliJ IDEA 2023.1

PSP表格

PSP2.1\	Personal Software Process Stages\	预估耗时（分钟）\	实际耗时（分钟）\
Planning	计划
· Estimate	· 估计这个任务需要多少时间	400	500
Development	开发
· Analysis	· 需求分析 (包括学习新技术)	70	90
· Design Spec	· 生成设计文档	10	25
· Design Review	· 设计复审	20	45
· Coding Standard	· 代码规范 (为目前的开发制定合适的规范)	35	35
· Design	· 具体设计	80	45
· Coding	· 具体编码	30	30
· Code Review	· 代码复审	25	30
· Test	· 测试（自我测试，修改代码，提交修改）	25	60
Reporting	报告
· Test Repor	· 测试报告	15	15
· Size Measurement	· 计算工作量	25	10
· Postmortem & Process Improvement Plan	· 事后总结, 并提出过程改进计划	20	25
· 合计	300	400	500

算法思路：

预处理：对原文文本和抄袭版论文文本进行预处理，如去除标点、转换为小写等。
分词：将原文文本和抄袭版论文文本分别进行分词，形成原文词语列表和抄袭版词语列表。
特征提取：从词语列表中提取特征，如计算词频、TF-IDF 等。
相似度计算：使用余弦相似度的计算方法计算原文和抄袭版论文之间的相似度得分。
判定：如果相似度得分大于等于相似度阈值，则判定为抄袭，否则判定为非抄袭。
返回判定结果。

函数列表：

函数	作用
preprocessText(text: String): String	预处理
tokenizeText(text: String): List	文本分词
extractFeatures(words: List): List	返回词频列表
calculateSimilarity(origText: String, plagiarizedText: String): double	计算相似度
checkPlagiarism(origText: String, plagiarizedText: String, similarityThreshold: double): boolean	检查相似度阈值
runPlagiarismChecker	写入输出文件

流程图

代码实现

去除标点符号、分词、进行相似度计算

public String preprocessText(String text) {
        // 去除标点符号
        text = text.replaceAll("[\\p{Punct}]", "");
        // 转换为小写字母
        text = text.toLowerCase();
        return text;
    }
    
    public List<String> tokenizeText(String text) {
        // 使用空格进行分词
        return List.of(text.split("\\s+"));
    }
    
    public List<String> extractFeatures(List<String> words) {
        // 在这里可以进行其他特征的提取，这里只返回原始词语列表
        return words;
    }
    
    public double calculateSimilarity(List<String> origWords, List<String> plagiarizedWords) {
        // 在这里进行相似度计算，这里只返回 0.0
        return 0.0;
    }

 public boolean checkPlagiarism(String origText, String plagiarizedText, double similarityThreshold) {
        // 预处理原文文本和抄袭版论文文本
        origText = preprocessText(origText);
        plagiarizedText = preprocessText(plagiarizedText);
        
        // 分词原文文本和抄袭版论文文本
        List<String> origWords = tokenizeText(origText);
        List<String> plagiarizedWords = tokenizeText(plagiarizedText);
        
        // 提取特征
        origWords = extractFeatures(origWords);
        plagiarizedWords = extractFeatures(plagiarizedWords);
        
        // 计算相似度得分
        double similarityScore = calculateSimilarity(origWords, plagiarizedWords);
        
        // 判断相似度得分是否超过相似度阈值
        return similarityScore >= similarityThreshold;
    }

异常处理

   catch (Exception e) {
            // 如果发生异常，将异常包装成 IOException 并重新抛出
            throw new IOException("文本预处理失败", e);
        }
   catch (IOException e) {
            // 捕获并处理 IO 异常
            System.out.println("查重算法执行异常：" + e.getMessage());

单元测试

源文本：This is the original text. It contains some unique content.
查重文本：This is the plagiarized text. It contains some unique content.
使用相似度阈值为 0.8 进行查重

查重代码
public class Main {
    public static void main(String[] args) {
        PlagiarismChecker checker = new PlagiarismChecker();

        String origText = "This is the original text. It contains some unique content.";
        String plagiarizedText = "This is the plagiarized text. It contains some unique content.";
        double similarityThreshold = 0.8;

        try {
            boolean isPlagiarized = checker.checkPlagiarism(origText, plagiarizedText, similarityThreshold);
            
            if (isPlagiarized) {
                System.out.println("The text is plagiarized.");
            } else {
                System.out.println("The text is not plagiarized.");
            }
        } catch (IOException e) {
            System.out.println("An error occurred: " + e.getMessage());
        }
    }
}
查重结果：The text is plagiarized.
表明这两个文本的相似度高于设定的相似度阈值 0.8。

标签：查重,抄袭,String,个人,项目,text,List,文本
From： https://www.cnblogs.com/nihaohhh/p/17718618.html

个人项目