Giter VIP home page Giter VIP logo

cikm20's Introduction

CIKM 2020 Resource Track

L. Gallagher, A. Mallia, J. S Culpepper, T. Suel, and B. Barla Cambazoglu. 2020. Feature Extraction for Large-Scale Text Collections. In Proc. CIKM. DOI: https://doi.org/10.1145/3340531.3412773

This repository contains scripts to build the dataset, and reproduce the experiments from the paper Feature Extraction for Large-Scale Text Collections from the CIKM 2020 Resource Track.

Download the LTR Dataset

The dataset is available for download at the links below:

  • Release 1.0.1 - cikm20ltr-1.0.1 (sha1 cca713f3d331921f4d5d3093832d5f182da79c25)

Environment Setup

The following environment configuration was used to build the dataset and run the experiments. We assume you have a working conda installation (recommended).

Clone this repo and setup Conda environment:

git clone https://github.com/ten-blue-links/cikm20
cd cikm20
git submodule update --init --recursive --depth 1
conda env create -f env.yml
conda activate cikm20fxt
./src/sh/lgbm.sh
pip install -r requirements.txt

Build the Dataset

The following details the prerequisites and steps to configure and run the build scripts.

Prerequisites

  • Indri index of ClueWeb09B (example config)
  • ~350GiB RAM
  • ~300GiB disk space
  • Webgraph data ClueWeb09_WG_50m.graph-txt.gz and ClueB-ID-DOCNO.txt.tar.gz. Once downloaded decompress the ClueB-ID-DOCNO.txt.tar.gz:
    • ClueWeb09B_WG_50m.graph-txt.gz leave this as is.
    • ClueB-ID-DOCNO.txt.tar.gz decompress to ClueB-ID-DOCNO.txt.
  • The gradle build system was used for the AlexaRank data
  • GCC 8.x (not tested with Clang)
  • Boost (tested with 1.65.1)
  • Cmake 3.x

Configure and Run the Build Process

  1. Copy configuration template: cp config/dataset.dist config/dataset
  2. Edit config/dataset and configure the following variables:
    • INDRI_INDEX_PATH - path to existing ClueWeb09B Indri index (example config)
    • FXT_INDEX_PATH - path where the Fxt index will be created
    • BOOST_INCLUDE_PATH - path to Boost headers
    • BOOST_LIBRARY_PATH - path to Boost libraries
    • INDRI_INCLUDE_PATH - path to Indri headers
    • INDRI_LIBRARY_PATH - path to Indri libraries
    • WEBGRAPH_PATH - path to ClueWeb09_WG_50m.graph-txt.gz (gzipped)
    • GRAPHPAIRS_PATH - path to ClueB-ID-DOCNO.txt (decompressed)
  3. Run ./src/dataset/main.sh (build may take ~32 hours)
  4. Dataset files build/cikm20ltr

AlexaRank Notes

The snapshot for the AlexaRank data is from 2010. This was the temporally closest working snapshot to Jan-Feb 2009 for ClueWeb09B.

Reproduce the LTR Experiments

The term reproduce is defined as per the ACM artifacts policy. Note the definitions for the terms replicate and reproduce were recently swapped around (Aug 2020).

  1. Copy configuration template: cp config/experiment.dist config/experiment
    1. If the dataset files are in a different location than the default build/cikm20ltr edit config/experiment and set DATASETD to the correct path
  2. Run ./src/experiment/main.sh
  3. cat the results: for i in build/result/wt??/test/eval/*.txt; do echo $i; cat $i; done
  4. TREC run files build/result/wt??/test/run

LambdaMART Effectiveness

The experiment scripts should be able to reproduce the following results:

Test Queries RBP 0.9 NDCG 5 NDCG 20 AP
Web Track 2009 (Topics 1-50) 0.286+0.344 0.298 0.296 0.219
Web Track 2010 (Topics 51-100) 0.187+0.295 0.224 0.245 0.131
Web Track 2011 (Topics 101-150) 0.132+0.139 0.235 0.199 0.117
Web Track 2012 (Topics 151-200) 0.193+0.185 0.193 0.189 0.164

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.