首页 > 编程语言 > COMP3009J 信息检索编程

COMP3009J 信息检索编程

时间:2023-05-20 13:23:20浏览次数:46  
标签:信息检索 编程 COMP3009J should program file small corpus your


COMP3009J – Information Retrieval
Programming Assignment
This assignment is worth 30% of the final grade for the module.
Due Date: Sunday 28th May 2023 at 23:55 (i.e. before beginning of Week 15)
Before you begin, download and extract the files ``small_corpus.zip’’ and ``large_corpus.zip’’
from Brightspace. These contain several files that you will need to complete this assignment.
The README.md file in each describes the files contained in the archive and their format1
.
Both corpora are in the same format, except for the relevance judgments. For the small
corpus, all documents not included in the relevance judgments have been judged nonrelevant. For the large corpus, documents not included in the relevance judgments have not
been judged.
You should first try the following tasks using the small corpus. If you can do this successfully,
then you should move to the large corpus. The large corpus will require more efficient
programs.
Part 1: BM25 Model
For this assignment you are required to implement the BM25 of Information Retrieval. You
must create a program (using Python) that can do the following.
1. Extract the documents contained in the document collections provided. You must divide
the documents into terms in an appropriate way. The strategy must be documented in your
source code comments.
2. Perform stopword removal. A list of stopwords to use is contained in the stopwords.txt
file that is provided in the ``files’’ directory.
3. Perform stemming.  代写COMP3009J 程序For this task, you may use the porter.py code in the ``files’’
directory.
4. The first time your program runs, it should create an appropriate index so that IR using the
BM25 method may be performed. Here, an index is any data structure that is suitable for
performing retrieval later.
This will require you to calculate the appropriate weights and do as much pre-calculation as
you can. This should be stored in an external file in some human-readable format. Do not use
database systems (e.g. MySQL, SQL Server, SQLite, etc.) for this.
1 This is a Markdown file. Although you can open and read it as plain text, a Markdown
editor like Remarkable (https://remarkableapp.github.io/ - Windows or Linux) or MacDown
(https://macdown.uranusjr.com/ - macOS) is recommended.
5. The other times your program runs, it should load the index from this file, rather than
processing the document collection again.
6. Run queries according to the BM25 model. This can be done in two ways:
- In “interactive” mode, a user can manually type in queries and see the first 15 results
in their command line, sorted beginning with the highest similarity score. The output should
have three columns: the rank, the document’s ID, and the similarity score. A sample run of
the program is contained later in this document. The user should continue to be prompted to
enter further queries until they type “QUIT”.
- In “automatic” mode, the standard queries should be read from the
``queries.txt’’ file (in the ``files’’ directory). This file has a query on each line, beginning
with its query ID. The results should be printed into a file named “results.txt”, which
should include four columns: query ID, rank, document ID and similarity score.
It is ESSENTIAL that this can be run as a standalone program, without requiring an IDE such
as IDLE, PyCharm, etc.
You can assume that your program will be run in the same directory as the README.md file
(i.e. the current directory will have the ``documents’’ and ``files’’ directories in it). Do not
use absolute paths in your code.
Non-standard libraries (other than the Porter stemmer provided) may not be used.
Users should be able to select the appropriate mode by using command line arguments, e.g.:
- python search_small_corpus.py -m interactive
Or for automatic mode:
- python search_small_corpus.py -m automatic
Part 2: Evaluation
For this part, your program should evaluate your results.txt file (that was created during
automatic mode above) to evaluate the effectiveness of the BM25 approach.
The user should be able to run the program using the following command:
- python evaluate_small_corpus.py
Based on the results.txt file, your program should calculate and print the following
evaluation metrics (based on the relevance judgments contained in the ``qrels.txt’’ file
in the ``files’’ directory):
- Precision
- Recall
- P@10
- R-precision
- MAP
- bpref
What you should submit
Submission of this assignment is through Brightspace. You should submit a single .zip archive
containing:
- Part 1: Python programs to run the queries.
o search_small_corpus.py
o search_large_corpus.py (if you have attempted the large corpus)
- Part 2: Python files to perform the evaluation.
o evaluate_small_corpus.py
o evaluate_large_corpus.py (if you have attempted the large corpus)
- A README.txt or README.md file that describes what your program can do (in
particular it should mention whether the program will work on both corpora or only
the small one).
Sample Run (Interactive)
$ python search_small_corpus.py -m interactive
Loading BM25 index from file, please wait.
Enter query: library information conference
Results for query [library information conference]
1 928 0.991997
2 1109 0.984280
3 1184 0.979530
4 309 0.969075
5 533 0.918940
6 710 0.912594
7 388 0.894091
8 1311 0.847748
9 960 0.845044
10 717 0.833753
11 77 0.829261
12 1129 0.821643
13 783 0.817639
14 1312 0.804034
15 423 0.795264
Enter query: QUIT
Note: In all of these examples, the results, and similarity scores were generated at random for
illustration purposes, so they are not correct scores.
Sample Run (Evaluation)
$ python evaluate_small_corpus.py
Evaluation results:
Precision: 0.138
Recall: 0.412
R-precision: 0.345
P@10: 0.621
MAP: 0.253
bpref: 0.345
Grading Rubric
Below are the main criteria that will be applied for the major grades (A, B, C, etc.). Other aspects
will be taken into account to decide minor grades (i.e. the difference between B+, B, B-, etc.).
- Readability and organisation of code (including use of appropriate functions, variable
names, helpful comments, etc.).
- Quality of solution (including code efficiency, presence of minor bugs, avoiding absolute
paths, etc.).
Questions should be sent to [email protected] or posted in the Brightspace forum.
Passing Grades
``D'' Grade
Good implementation of the primary aspects of Information Retrieval, using the small corpus. This
includes extracting the documents from the document collection, preprocessing (stemming and
stopword removal), indexing and retrieval. The solution may contain some implementation errors.
It is clear that the student has correctly understood the Information Retrieval process.
``C'' Grade
Good implementation of the primary aspects of Information Retrieval, using the small corpus. The
program can also save and load the index to/from an external file, as appropriate.
``B'' Grade
Correct implementation of all sections of the assignment using the small corpus (some minor
implementation errors will be tolerated). It is clear that the student has understood both the
information retrieval process and the evaluation process. Note: This means that evaluation is only
taken into account if your search program can successfully retrieve documents for evaluation.
``A'' Grade
Excellent implementation of all sections of the assignment, including choice of appropriate efficient
data structures and efficient programming. The efficiency of the programs will be measured using
the large corpus. In particular, a response to a query must be returned in a reasonable amount of
time, although efficiency is important in indexing also. Note: This means that working efficiently on
the large corpus is only taken into account if your code can successfully work with the small corpus.
Failing Grades
``ABS'' Grade
No submission received.
``NM'' Grade
No relevant work attempted.
``G'' Grade
Wholly unacceptable, little or no evidence of meaningful work attempted.
``F'' Grade
Some evidence of work attempted, but little (if any) functionality operates in the correct manner.
``E'' Grade
Clear evidence that work has been attempted on implementing retrieval using BM25, but there
are serious errors in implementation, or in understanding of the process.
Other notes
1. This is an individual assignment. All code submitted must be your own work. Submitting the work
of somebody else or generated by AI tools such as ChatGPT is plagiarism, which is a serious
academic offence. Be familiar with the UCD Plagiarism Policy and the UCD School of Computer
Science Plagiarism Policy.
2. If you have questions about what is or is not plagiarism, ask!
Document Version History
v1.0: 2023-05-08, Initial Version.
v1.1: 2023-05-15, Updated requirements for the output format of automatic mode.

 WX:codehelp

标签:信息检索,编程,COMP3009J,should,program,file,small,corpus,your
From: https://www.cnblogs.com/messagejava/p/17417085.html

相关文章

  • BOM编程
    1. BOM编程window的open合close  45当前窗口  _self新窗口    _blank父窗口   _parent顶级窗口  _top代码在course10 010 BOM编程的open和close<!--BOM编程的open和close45--><!DOCTYPEhtml><html> <head> <metacharset="utf-8"> <title>BOM......
  • 个人编程踩坑或易错记录
    ASP.NETCoreArea视图里<environmentinclude="Development">不起作用ASP.NETCoremvc,在Area的view文件中,不能使用以下代码。有可能异常的代码示例:(本应该正常,参考:ASP.NETCore中的环境标记帮助程序|MicrosoftLearn)<environmentinclude="Development"><h1>Devel......
  • 软构笔记-7-面向对象的编程
    目录软构7基本概念Interface在interface中使用default方法继承与重写重写AbstractClass抽象类Polymorphism,subtypingandoverloading多态、子类型、重载三种多态Overloading重载重载的规则Overridingvs.Overloading子类型多态继承和子类型:层次结构一瞥软构7本章......
  • 我写了本开源书:《3D编程模式》
    大家好,我写了本开源书,罗列了我从自己的实战项目中提炼出来的关于3D编程(主要包括“3D引擎/游戏引擎”、“编辑器”开发)的各种编程模式本书的在线阅读地址在这里:在线阅读本书的源码在Github中,欢迎star,感恩您:Github地址本书的写作花了我300多个小时,将近3个月的全职写作,凝结了我一......
  • 组件化编程
    模块与组件模块向外提供特定功能的js程序--一般就是一个js文件作用--复用js、简化js的编写、提高js运行效率模块化--当应用中的js都是以模块来编写的--这个应用就是一个模块化的应用组件用来实现局部(特定)功能效果的代码集合作用--复用编码--简化项目编码、提高运行......
  • C#异步编程之async/await
    https://blog.csdn.net/wulex/article/details/127380403  异步、多线程的区别 异步:属于通信的范畴,在发出消息当下不等待对方回应,便开始继续自己的任务。所以c#的很多_异步_操作都在io、socket这些类库下面,都是通信性质的类库, io可以理解向系统内核发送文件......
  • windows 编程模板
    win32消息#include<Windows.h>#include<stdio.h>#include<string.h>//用于在win32中获取控制台HANDLEg_hOutput=NULL;//全局保存实例对象HINSTANCEg_hInstance=NULL;LRESULTCALLBACKWnProc( HWNDhwnd, UINTmsg, WPARAMwparam, LPARAMlpar......
  • Java编程进阶:掌握这五个高级特性让编码更得心应手
    当今,在软件开发中最流行和使用最广泛的语言之一是Java。Java不仅是一种面向对象的编程语言,还可以用于Web开发、移动应用程序编程、云计算和大数据处理。如果你是一个初学者,或者是想要进一步了解Java编程的高级特性,那么这篇文章会帮助你更好地了解Java编程。一、使用lambda表达式在J......
  • 编程打卡:面向对象程序设计测试
    ``gopackagemainimport"bufio"import"flag"import"fmt"import"io"import"os"import"strconv"varinfile*string=flag.String("i","unsorted.dat","Filecontains......
  • 开心档之C++ Web 编程
    C++Web编程什么是CGI?公共网关接口(CGI),是一套标准,定义了信息是如何在Web服务器和客户端脚本之间进行交换的。CGI规范目前是由NCSA维护的,NCSA定义CGI如下:公共网关接口(CGI),是一种用于外部网关程序与信息服务器(如HTTP服务器)对接的接口标准。目前的版本是CGI/1.1,CGI/......