Giter VIP home page Giter VIP logo

legaldiveval's Introduction

Diversifying the Legal Order

This page is a companion for the paper on Evaluation of Diversification Techniques for Legal Information Retrieval, written by Koniaris Marios (me), Ioannis Anagnostopoulos and Yannis Vassiliou.

This page hosts the complete dataset, ground-truth data, queries and relevance assessments we utilize in the article. Our goal is to encourage progress on the diversification in legal IR.

Dataset

Our corpus contains 3.890 Australian legal cases from the Federal Court of Australia. The cases were originally downloaded from AustLII and were used in

F. Galgani, P. Compton, and A. Hoffmann. Combining different summarization techniques for legal text. In Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, pages 115-123, Avignon, France, April 2012. Association for Computational Linguistics

to experiment with automatic summarization and citation analysis. The legal corpus contains all cases from the Federal Court of Australia spanning from 2006 up to 2009. From the cases, we extracted all needed text for our diversification framework.

Our dataset can de found here

West Law Digest Topics

West Law Digest Topics is a taxonomy of identifying points of law from reported cases and organizing them by topic and key number. It is used to organize the entire body of American law.

  1. WikiPedia entry
  2. PDF list

we downloaded this list from WestLaw, process it and acquired a textual representation of it.

  1. Original Topics/ queries

Each topic was issued as candidate query to our retrieval system. Outlier queries, whether too specific/rare or too general, where removed using the interquartile range, below or above values Q1 and Q3, sequentially in terms of number of hits in the result set and score distribution for the hits, demanding in parallel a minimum cover of min|N| results.

  1. Used Topics/ queries

Our final list of user queries. In total, we kept 289 queries

Query assessments and ground-truth.

For each topic/query we kept the top-n results. An LDA topic model, using an open source implementation (mallet) was trained on the top-n results for each query. From the resulting topic distributions for each document, with an acceptance threshold of 20%, we consider relevance judgments for each query/ document and subtopic. In other words, we consider the topics created from LDA as aspects of each query, and based on the topic/ document distribution we can infer whether a document is relevant for an aspect. Our ground-truth data can be found:

Stop Words

Our stop word list can be found here

Novelty and diversity evaluation measures

C source code for diversity task evaluation (trec - ndeval) can be found here

Results

Our results can be found here.

For each query, our initial set N contains the top - n query results. The interpolation parameter l [0..1] is tuned in 0.1 steps separately for each method.

We present the evaluation results for the methods employed, using the aforementioned evaluation metrics, at cut-off values of 5, 10, 20 and 30, as typical in TREC evaluations. Note that each of the diversification variations, is applied in combination with each of the diversification algorithms and for each user query

Citing

If you use queries and relevance assessments utilized in this work in your research, please cite:

Koniaris, M.; Anagnostopoulos, I.; Vassiliou, Y. Evaluation of Diversification Techniques for Legal Information Retrieval. Algorithms 2017, 10(1), 22; (doi:10.3390/a10010022).

legaldiveval's People

Contributors

mkoniari avatar

Watchers

James Cloos avatar Antonio Mallia avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.