Giter VIP home page Giter VIP logo

data_input_text_reuse's Introduction

Text Reuse input data creator

Make data for BLAST text reuse detection from ecco and eebo.

Setup

Copy or symlink ECCO text files in ./data/raw/eccotext and EEBO files in ./data/raw/eebotxt maintaining the original directory structure. Eg. ECCO files would look something like this:

./data/raw/eccotxt/ECCO_I/ECCO_1of2/HistAndGeo/0146000100/xml/0146000100.txt
./data/raw/eccotxt/ECCO_I/ECCO_1of2/HistAndGeo/0146000500/xml/0146000500.txt
...

And EEBO would look like:

./data/raw/eebotxt/eebo_phase1/A0/A00678/A00678.headed.txt
./data/raw/eebotxt/eebo_phase1/A0/A00671/A00671.headed.txt
./data/raw/eebotxt/eebo_phase1/A0/A00671/A00671.headed_note_at_99798.txt
...

Create index to ECCO and EEBO texts

Example:

python ecco_index.py --eccodir "/home/local/vvaara/projects/comhis/data_input_text_reuse/data/raw/eccotxt/" --eebodir "/home/local/vvaara/projects/comhis/data_input_text_reuse/data/raw/eebotxt/"

Usage

python create_json.py -i doc_ids_hume.txt -o hume_history_for_text_reuse.json

Output data generator

Turn finished text reuse data (indices in csv files) back to json with actual text indices, fragments of texts and text ids.

Usage

python generate_json.py --datadir DATADIR HERE --outdir OUTPUTDIR HERE --iter ITERATION NUMBER TO PROCESS

python generate_json_multiprocess_lmdb.py --datadir "../../text-reuse-verify/data/raw/qpi1" --outdir "../../text-reuse-verify/data/work/processed" --threads 8 --db "../../text-reuse-verify/data/blast_work/db/original_data_DB" --iter 3976 --tqdm

data_input_text_reuse's People

Contributors

villevaara avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.