Make data for BLAST text reuse detection from ecco and eebo.
Copy or symlink ECCO text files in ./data/raw/eccotext
and EEBO files in ./data/raw/eebotxt
maintaining the original directory structure. Eg. ECCO files would look something like this:
./data/raw/eccotxt/ECCO_I/ECCO_1of2/HistAndGeo/0146000100/xml/0146000100.txt
./data/raw/eccotxt/ECCO_I/ECCO_1of2/HistAndGeo/0146000500/xml/0146000500.txt
...
And EEBO would look like:
./data/raw/eebotxt/eebo_phase1/A0/A00678/A00678.headed.txt
./data/raw/eebotxt/eebo_phase1/A0/A00671/A00671.headed.txt
./data/raw/eebotxt/eebo_phase1/A0/A00671/A00671.headed_note_at_99798.txt
...
Example:
python ecco_index.py --eccodir "/home/local/vvaara/projects/comhis/data_input_text_reuse/data/raw/eccotxt/" --eebodir "/home/local/vvaara/projects/comhis/data_input_text_reuse/data/raw/eebotxt/"
python create_json.py -i doc_ids_hume.txt -o hume_history_for_text_reuse.json
Turn finished text reuse data (indices in csv files) back to json with actual text indices, fragments of texts and text ids.
python generate_json.py --datadir DATADIR HERE --outdir OUTPUTDIR HERE --iter ITERATION NUMBER TO PROCESS
python generate_json_multiprocess_lmdb.py --datadir "../../text-reuse-verify/data/raw/qpi1" --outdir "../../text-reuse-verify/data/work/processed" --threads 8 --db "../../text-reuse-verify/data/blast_work/db/original_data_DB" --iter 3976 --tqdm