"crawls_preprocessing" is used to automate the process from warcExtractor OR langSepa until selection of the outputs of langSepa.
The program takes the following input file:
- the outputs from web crawler in format
*.warc.gz
- the outputs from warcExtractor in format
*.source
Depends on the input file format, different workflow will be selected automatically.
wf_from_warc()
:*.warc.gz
→ warcExtractor →*.source
→ langSepa →*.txt
→ output_selectwf_from_langSepa()
:*.source
→ langSepa →*.txt
→ output_select →*.tar.gz
- These workflows are defined in
wf_types.sh
. - The functions used in
wf_types.sh
are defined inwf_functions.sh
anddb_functions.sh
.
The prefix of input files (followed by an underscore '_' ) shall follow these rules:
- two charactor prefix (tld-prefix) are for files crawled by top level domain (tld), e.g.
de_web_2015.01668.warc.source
- three charactor prefix (lang-prefix) are for files crawled by language, e.g.
fin_news.00004.warc.gz
BASENAME
:= input file name without suffix
E.g. input file = de_web_2015.01668.warc.source
--> BASENAME
= de_web_2015.01668
The output files from langSepa will be selected by following rules:
- only files with
LANG_0000.txt
are kept - which languages to keep (the
LANG
part) is determined by the prefix of the input*.source
file:
- tld-prefix: all languages (if exist after langSepa) listed in the file
TLD_commonlang.txt
under the specific tld will be kept. If tld-prefix is not onTLD_commonlang.txt
, all outputs from langSepa are kept. - lang-prefix: only the 'lang' specified in the input file name is kept
After which languages to keep is determined, the following naming and selection rules are applied for each language:
BASENAME_LANG_stopwort.txt
: if language is on 'stopwort_list.ini
list, its output in Stopwort-folder is keptBASENAME_LANG_uni.txt
&BASENAME_LANG_tri.txt
: if language is onuni_trigramm_list.ini
list, its output in Trigramm- AND Unigramm-folders are keptBASENAME.all.tar.gz
: language not on any list above, keep all outputs from langSepa
At the end of each workflow, following .tar.gz files are generated and copied to $output
BASENAME.all.tar.gz
: this file is generated from two cases:- if tld-prefix is not on
TLD_commonlang.txt
- if any of the languages to keep is not on
stopwort_list.ini
anduni_trigramm_list.ini
- if tld-prefix is not on
BASENAME.selected.tar.gz
: if all wanted languages are listed instopwort_list.ini
oruni_trigramm_list.ini
, all outputs from a workflow (BASENAME_LANG_stopwort.txt
andBASENAME_LANG_uni.txt
andBASENAME_LANG_tri.txt
) are pulled and compressed as this file
to create MySQL database tables to keep track of job status, run create_preprocessing_tables.sql
Before running this preprocessing program, these settings shall be set up:
- set
WORKING_DIRECTORY
(absolute dir) at the first line of filesrc/preprocessing_main.sh
- set parameters in
cfg/prepro.cfg
- when running in a new server, fields to be changed are denated with **
- NOTE: file
LangSepa.ini
: parameter tunning forLangSepa.jar
Due to parsing rules in preprocessing program, please do not set PREFIX inLangSepa.ini
!!!
- make a
WORKING_DIRECTORY
, e.g.prepro_0104
- copy folder preprocessing into
WORKING_DIRECTORY
- navigate into
WORKING_DIRECTROY
- run preprocessing by
(nohup) ./src/run_preprocessing.sh WORKING_DIRECTORY &
- workflows are run under
WORKING_DIRECTORY/run_prepro
- outputs of each workflow are moved to
${destination}
- folders/files under
WORKING_DIRECTORY
:
create_preprocessing_tables.sql
cfg/
src/
run_prepro
FINISHED_JOB_LOG
log
- folders/files inside each workflow (
WORKING_DIRECTORY/run_prepro/BASENAME
):
BASENAME.log
f1_warc_gz_dir
f2_warcExtractor_dir
f3_langSepa_dir
f4_selected_output
TLD_commonlang.txt
stopwort_list.ini
uni_trigramm_list.ini
stopwort_uni_trigramm_list.ini
tools.jWarcEx-0.0.1-SNAPSHOT.jar
WORKING_DIRECTORY/log
: document the start of each workflowWORKING_DIRECTORY/run_prepro/BASENAME/BASENAME.log
: document the stages of the workflowBASENAME
- single preprocessing logs i.e. 2. are moved to
FINISHED_JOB_LOG
folder after finished
-
get folder (in
run_preprocessing.sh
): folders in thecfg/piority_folder.ini
are dealt first then, all folders in${file_source}
are dealt, the oldest folder first -
get file list within each folder (in
run_preprocessing.sh
): all files, which are not modified in the last one minute is saved into a list and passed to command "parallel" to run -
check file status (in
preprocessing_main.sh
): It is firstly checked against the database if a file is already preprocessed before by:- if the file is in database
- if the file shows 'finished' at "langSepa" column in database
If and only if both conditions are true, the original file in
${file_source}
is removed and next file on the input list is called.