data_input_text_reuse's Introduction

Text Reuse input data creator

Make data for BLAST text reuse detection from ecco and eebo.

Setup

Copy or symlink ECCO text files in ./data/raw/eccotext and EEBO files in ./data/raw/eebotxt maintaining the original directory structure. Eg. ECCO files would look something like this:

./data/raw/eccotxt/ECCO_I/ECCO_1of2/HistAndGeo/0146000100/xml/0146000100.txt
./data/raw/eccotxt/ECCO_I/ECCO_1of2/HistAndGeo/0146000500/xml/0146000500.txt
...

And EEBO would look like:

./data/raw/eebotxt/eebo_phase1/A0/A00678/A00678.headed.txt
./data/raw/eebotxt/eebo_phase1/A0/A00671/A00671.headed.txt
./data/raw/eebotxt/eebo_phase1/A0/A00671/A00671.headed_note_at_99798.txt
...

Create index to ECCO and EEBO texts

Example:

python ecco_index.py --eccodir "/home/local/vvaara/projects/comhis/data_input_text_reuse/data/raw/eccotxt/" --eebodir "/home/local/vvaara/projects/comhis/data_input_text_reuse/data/raw/eebotxt/"

Usage

python create_json.py -i doc_ids_hume.txt -o hume_history_for_text_reuse.json

Output data generator

Turn finished text reuse data (indices in csv files) back to json with actual text indices, fragments of texts and text ids.

Usage

python generate_json.py --datadir DATADIR HERE --outdir OUTPUTDIR HERE --iter ITERATION NUMBER TO PROCESS

python generate_json_multiprocess_lmdb.py --datadir "../../text-reuse-verify/data/raw/qpi1" --outdir "../../text-reuse-verify/data/work/processed" --threads 8 --db "../../text-reuse-verify/data/blast_work/db/original_data_DB" --iter 3976 --tqdm

Recommend Projects

villevaara / data_input_text_reuse Goto Github PK

data_input_text_reuse's Introduction

Text Reuse input data creator

Setup

Create index to ECCO and EEBO texts

Usage

Output data generator

Usage

data_input_text_reuse's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent