辅导INFS7410、讲解Java

2019.09.24 - 首页 >> Java编程

INFS7410 Project - Part 1 - v3

Note: these instructions have been modified on 28/08/2019

Preamble

The due date for this assignment is 29 August 2019 17:00 5 September 2019 17:00, Eastern

Australia Standard Time (extended from 29/08) 19 September 2019 17:00 Eastern Australia

Standard Time, together with part 2.

This project is worth 5% of the overall mark for INFS7410. A detailed marking sheet for this

assignment is provided at the end of this document.

We recommend that you make an early start on this assignment, and proceed by steps. There are

a number of activities you make already tackle, including setting up the pipeline, manipulating

the queries, implement some retrieval functions and perform evaluation and analysis. There are

some activities you do not know yet how to perform, in particular the implementation of the rank

fusion algorithms: this will be the topic of the week 5 lecture and tutorials.

Aim

Project aim: The aim of this project is to implement a number of information retrieval methods,

evaluate them and compare them in the context of a real use-case.

Project Part 1 aim

The aim of part 1 is to:

setup the evaluation infrastructure, including collection and index, topics, qrels

implement common information retrieval baselines

implement ranking fusion methods

evaluate, compare and analyse baseline and ranking fusion methods

The Information Retrieval Task: Ranking of studies for

Systematic Reviews

In this project we will consider the problem of ranking research studies identified as part of a

systematic review. Systematic reviews are a widely used method to provide an overview of the

current scientific consensus, by bringing together multiple studies in a reliable, transparent way.

We will use the CLEF 2017 and 2018 eHealth TAR (task 2) collections. In CLEF TAR 2017, the task

we consider is referred to as subtask 1 (and is the only task); in CLEF TAR 2018, the task we

consider is referred to as subtask 2. We provide the CLEF 2017 and 2018 TAR task overview

papers in the assignment folder in blackboard for your reference. These contain details about the

topics, the collection, the task, etc. These details are not necessary to complete the assignment,

but nevertheless you may want to know more about this task, its importance, approaches that

have been tried, and so on.

The task consists of, given as the starting point the results of the Boolean search created by the

researchers undertaking a systematic review, ranking the set of the provided documents (they are

PMID - pubmed ID - in the files provided; for each PMID there is an associated title and abstract).

The goal is to produce an ordering of the documents such that all the relevant documents are

retrieved above the irrelevant ones. This is to be achieved through automatic methods that rank

all abstracts, with the goal of retrieving relevant documents as early in the ranking as possible.

There are two datasets to consider in this project. The CLEF 2017 TAR dataset; and the CLEF 2018

TAR dataset. Each dataset consists of material for training, and. material for testing the developed

information retrieval methods.

What we provide you with

We provide:

for each dataset, a list of topics to be used for training. Each topic is organised into a file.

Each topic contains a title and a Boolean query.

for each dataset, a list of topics to be used for testing. Each topic is organised into a file. Each

topic contains a title and a Boolean query.

each topic file (both those for training and those for testing), includes a list of retrieved

documents in the form of their PMIDs: these are the documents that you have to rank. Take

note: you do not need to perform the retrieval from scratch (i.e. execute the query against

the whole index); instead you need to rank (order) the provided documents.

for each dataset, and for each train and test partition, a qrels file, containing relevance

assessments for the documents to be ranked. This is to be used for evaluation.

for each dataset, and for test partitions, a set of runs from retrieval systems that

participated to CLEF 2017/2018 to be considered for fusion.

a Terrier index of the entire Pubmed collection. This index has been produced using the

Terrier stopword list and Porter stemmer.

a Java Maven project that contains the Terrier dependencies and a skeleton code to give you

a start. NOTE: Tip #1 provides you with a restructured skeleton code to make the processing

of queries more efficient.

a template for your project report.

What you need to produce

You need to produce:

correct implementations of the methods required by this project specifications

correct evaluation, analysis and comparison of the evaluated methods, written up into a

report following the provided template

a project report that, following the provided template, details: an explanation of the retrieval

methods used, an explanation of the evaluation settings followed, the evaluation of results

(as described above), inclusive of analysis, a discussion of the findings.

Required methods to implement

In part 1 of the project you are required to implement the following retrieval methods:

1. TF-IDF: you can create your own implementation using the Terrier API to extract index

statistics, or use the implementation available through the Terrier API

2. BM25: you can create your own implementation using the Terrier API to extract index

statistics, or use the implementation available through the Terrier API

3. The ranking fusion method Borda; you need to create your own implementation of this

4. The ranking fusion method CombSUM; you need to create your own implementation of this

5. The ranking fusion method CombMNZ; you need to create your own implementation of this

We strongly reccommend you use the provided Maven project to implement these methods. You

should have already attempted many of the implementations above as part of the tutorial

exercises.

In the report, detail how the methods were implemented, i.e. (i) which formula you implemented,

(ii) if you did your own implementation or levereged Terrier's ones (for TF-IDF and BM25).

For ranking fusion methods, consider to fuse the runs from previous participants from CLEF

2017/2018 we provide, and the TF-IDF and the BM25 runs you will produce.

What queries to use

We ask you to consider two types of queries for each topic (the second type is optional and

attracts bonus points):

1. for each topic, a query created from the topic title. For example, consider the example

(partial) topic listed below: the query will be Rapid diagnostic tests for diagnosing

uncomplicated P. falciparum malaria in endemic countries (you may consider

performing text processing).

2. (OPTIONAL: 2% bonus if done) for each topic, a query created from the Boolean query

associated with the topic. This Boolean query will be made up of the terms that appear in

the query, but will ignore any operator (e.g., will ignore and , or, Exp , / , etc.) and field

restrictions (e.g., .ti , .ab , .ti,ab , etc.). Note that some keywords in the Boolean query

have been manually stemmed, e.g. diagnos* in the example topic below. As part of the

query creation process, we ask you to use the entrez API. For documentation on the entrez

esearch API, please refer to the Entrez Programming Utilities Help reference available at:

https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch. Example usage can be

found at the following URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?

db=pubmed&term=diagnos*. Note the terms in the TranslationStack field. These are the

terms you would use to replace diagnosis* and therefore concatenate to form the query

(along with the other terms).

Above: example topic file