首页 > 其他分享 >COMP4650 文档分析

COMP4650 文档分析

时间:2023-08-20 12:11:28浏览次数:39  
标签:分析 will py should COMP4650 文档 query gov your

COMP4650
COMP4650/6490 Document Analysis – Semester 2 / 2023
Assignment 1
Due 17:00 on Wednesday 16 August 2023 AEST (UTC +10)
Last updated July 28, 2023
Overview
In this assignment, your task is to implement a basic boolean and ranked information retrieval (IR)
system using a document collection, and then measure search performance based on predefined queries.
A document collection containing more than 30, 000 government site descriptions is provided for this
assignment, along with a set of queries (in file gov/topics/gov.topics) and the expected returned
documents (in file gov/qrels/gov.qrels). The provided code implements most of an IR system.
Throughout this assignment you will make changes to the provided code to improve or complete existing
functions. Note that the provided code is designed to be simple to understand and modify, it is not efficient
nor scalable. When developing a real-world IR system you would be better off using high performance
software such as Apache Lucene1.
Throughout this assignment:
1. You will develop a better understanding of indexing, including the tokeniser, parser, and normaliser
components, and how to improve the search performance given a predefined evaluation metric;
2. You will develop a better understanding of search algorithms, and how to obtain better search
results, and
3. You will find the best way to combine an indexer and search algorithm to maximise the performance
of an IR system.
Submission
You will produce an answers file with your responses to each question. Your answers file must be
a PDF file named u1234567.pdf where u1234567 should be replaced with your Uni ID.
The answers to this assignment (including your code files) have to be submitted online in Wattle.
You should submit a ZIP file containing all of the code files and your answers PDF file, BUT NO
DATA.
No late submission will be permitted without a pre-arranged extension. A mark of 0 will be
awarded if not submitted by the due date.
Marking
This assignment will be marked out of 100, and it will contribute 10% of your final course mark.
Your answers to coding questions will be marked based on the quality of your code (is it efficient, is it
readable, is it extendable, is it correct) and the solution in general (is it appropriate, is it reliable, does it
demonstrate a suitable level of understanding).
Your answers to discussion questions will be marked based on how convincing your explanations are (are
they sufficiently detailed, are they well-reasoned, are they backed by appropriate evidence, are they clear,
do they use appropriate aids such as tables and plots where necessary).
1https://lucene.apache.org/
1
This is an individual assignment. Group work is not permitted. Assignments will be checked for
similarities.
Question 1: Implement Boolean Queries (40%)
The construction of inverted index is implemented in indexer.py. You should first run indexer.py to
store the following index data:
A dictionary (called index) mapping a token string to a sorted list of (doc id, term frequency)
tuples,
a dictionary doc freq mapping a token string to its document frequency,
a dictionary (called doc ids) mapping a doc id to the path of the document, and
the number of documents in the collection (called num docs).
Please double check that you are using process tokens original within the function process tokens
in string processing.py before running indexer.py.
Your task is to implement the ability to run boolean queries on the built index. The starting code is
provided in query boolean.py. Specifically, you shall implement a simplified boolean query grammar.
You may assume that input queries consist of only AND and OR operators separated by single tokens. For
example, cat AND dog is a valid query while cat mask AND dog is not a valid query since cat mask is
not a single token. You are not required to implement NOT. The order of operations will be left to right
with no precedence for either of the operators, for example the query cat AND dog OR fish AND fox
should be done in the following order: ((cat AND dog) OR fish) AND fox. The brackets are provided
as an example; you can assume that the queries provided to your system will not contain brackets. To
score full marks on this question, your solution should implement ( + ) sorted list intersection and
sorted list union algorithms – ( +) sorted list intersection was covered in the lectures, while union is
very similar – where and refer to the lengths of the two lists. Solutions using data structures such
as sets or dictionaries to implement the intersection and union operations will not score full marks.
Once you have completed your boolean query system please run it on the following queries and list the
relative paths (e.g., ./gov/documents/14/G00-14-2198849) of the retrieved documents in your answers
PDF file.
HINT: none of the queries below give more than 10 results, and Query 0 has been done for you so that
you can check your system.
Query 0: Workbooks
Answer:
./gov/documents/14/G00-14-2198849
./gov/documents/36/G00-36-2337608
./gov/documents/50/G00-50-0062475
./gov/documents/69/G00-69-1400565
./gov/documents/69/G00-69-2624147
Query 1: Australasia OR Airbase
Query 2: Warm AND WELCOMING
Query 3: Global AND SPACE AND economies
Query 4: SCIENCE OR technology AND advancement AND PLATFORM
Query 5: Wireless OR Communication AND channels OR SENSORY AND INTELLIGENCE
Make sure you submit your code for this question (i.e., query boolean.py) as well as your answers.
2
Question 2: Implement TF-IDF Cosine Similarity (30%)
Your task is to implement the get doc to norm function and the run query function in query tfidf.py to
use TF-IDF and the cosine similarity applied to TF-IDF. In your solution both the query and the document
vectors should be TF-IDF vectors. Your implementation could be similar to the get doc to norm and
run query functions in query.py but should use TF-IDF instead of term frequency.
The TF-IDF variant you should implement is:
TF-IDF = ln

1 + ,
where is the term frequency, is the document frequency, and is the total number of documents
in the collection. This is almost the standard TF-IDF variant, except that 1 is added to the document
frequency to avoid division by zero errors.
Once you have implemented TF-IDF cosine similarity, run the query tfidf.py file and record the top-5
retrieved documents as well as their similarity scores in your answers PDF file for two queries below:
Query 1: Is nuclear power plant eco-friendly?
Query 2: How to stay safe during severe weather?
You should then run evaluate.py to evaluate the query results (for the queries in gov/topics/gov.topics)
against the ground truth, and record the evaluation results in your answers PDF file. Make sure you submit
your query tfidf.py.
Question 3: Explore Linguistic Processing Techniques (30%)
For this question you will explore ways to improve the process tokens function in string processing.py.
The current function removes stopwords and lowercases the tokens. You should modify the function and
explore the results. To modify the function, you should make changes to the functions process token 1,
process token 2, and process token 3 and then uncomment the one you want to test within the main
process tokens function. You should pick at least three different modifications and evaluate them (you
can add new process tokens functions if you want to evaluate more than three modifications). See lectures
for some possible modifications. You might find the Python nltk library useful. The modifications
you make do not need to require significant coding, the focus of this question is choosing reasonable
modifications and explaining the results.
To evaluate each modification you made, you should
(1) run indexer.py to rebuild the index data, then
(2) run query.py (not query tfidf.py), and
(3) run evaluate.py to evaluate the query results.
For each of the modification you make you should describe in your answers:
What modifications you made.
Why you made them (in other words why you thought they might work).
What the new performance is.
Why you think the modification did/did not work. Making sure to give (and explain) examples of
possible failure or success cases.
Finally, you should compare all the modifications by choosing one appropriate metric and decide which
modification (or combination of modifications) performed the best. Your comparison should make use
of a table or chart as well as some discussion. Make sure to report all of this and your justification in
your answers, and to submit your string processing.py showing each of the changes you made.

 

标签:分析,will,py,should,COMP4650,文档,query,gov,your
From: https://www.cnblogs.com/longtimeagos/p/17643827.html

相关文章

  • 基础入门-算法分析&传输加密&数据格式&密文存储&代码混淆&逆向保护
    基础入门-算法分析&传输加密&数据格式&密文存储&代码混淆&逆向保护基础入门-算法分析&传输加密&数据格式&密文存储&代码混淆&逆向保护传输数据-编码型&加密型等传输格式-常规&JSON&XML等密码存储-Web&系统&三方应用代码混淆-源代码加密&逆向保护加密:1.常见加密编码进制等算法解......
  • 知识图谱入门:使用Python创建知识图,分析并训练嵌入模型
    本文中我们将解释如何构建KG、分析它以及创建嵌入模型。构建知识图谱加载我们的数据。在本文中我们将从头创建一个简单的KG。 https://avoid.overfit.cn/post/7ec9eb11e66c4b44bd2270b8ad66d80d......
  • IPO排队名单列表分析
    IPO排队名单列表分析最新IPO排队情况最新辅导备案证监会官网显示截至发稿前,2023年8月7日-8月11日,启动辅导备案的企业共16家。从辅导备案时间来看:7号2家,8号4家,9号6家,10号3家,11号1家。从辅导备案企业的注册地来看:广东新增4家辅导备案企业;北京、浙江和四川各新增2家。从新增辅导......
  • 使用WebAssembly实现高性能计算:C++和Rust的案例分析
    WebAssembly是一种新型的低级字节码格式,它可以在浏览器中运行高效的编译代码。使用WebAssembly可以实现高性能计算、游戏引擎等功能,对于需要大量计算的Web应用程序来说尤为重要。本文将介绍使用WebAssembly实现高性能计算的两个案例:C++和Rust。C++C++是一种高性能的编程语言,它......
  • 归并,基数排序及排序分析
    归并,基数排序及排序分析归并排序将两个或两个以上的有序子序列"归并"为一个有序的序列.归并排序的演示归并需要logn趟如何将两个有序序列合成一个有序序列?使用前面学的两个线性表的合并在同一个有序序列里面的合并操作归并排序算法分析归并排序方法的比较基数......
  • 2-10-Feign-最佳实践分析(11-Feign-实现Feign最佳实践)
    所谓的最佳实践是针对发请求与收请求两个接口而言的总共分两种规范:继承+抽取由于继承会出现多次实现且不同模块的维护人还不一样要是出现更新人力安排也是一个问题抽取方式则不会出现这些问题因为实现仅一份而且还都是由服务维护方维护的不存在人力安排问题从生产者......
  • PostgreSQL 源码性能诊断(perf profiling)指南(含火焰图生成分析FlameGraph) - 珍藏级
    PostgreSQL源码性能诊断(perfprofiling)指南(含火焰图生成分析FlameGraph)-珍藏级作者digoal日期2016-11-28标签PostgreSQL,Linux,perf,性能诊断,stap,systemtap,strace,dtrace,dwarf,profiler,perf_events,probe,dynamicprobe,tracepoint......
  • perf + 火焰图分析程序性能
    perf+火焰图分析程序性能 1、perf命令简要介绍性能调优时,我们通常需要分析查找到程序百分比高的热点代码片段,这便需要使用perfrecord记录单个函数级别的统计信息,并使用perfreport来显示统计结果;perfrecordperfreport举例:sudoperfrecord-ecpu-clock-g-p......
  • 浅谈性能分析
    浅谈性能分析2022年02月05日 数据库 评论1条 阅读1,855次 性能分析和优化是一个要求比较全面的工作,通常既要了解所分析的目标系统本身的设计和实现,也要对操作系统等底层基础设施有一定了解,同时需要掌握一些方法论以指导性能分析和优化工作。本文尝试根据个人这几年......
  • 调试与性能分析
    调试与性能分析 2022-08-03 7minread c/cpp , techs调试我常用的调试工具是GDB(g++-g)和二分查错法,先删除一半代码,看是否有问题,如果没有问题,那问题就在另一半代码中:)运行时运行时(runtimedebug)调试在一些场景下比较重要,比如调试阻塞的程序。运行时调试的工具有......