First edition of course-dataset for Investigative Journalism project.
- Wanted PDFs found in "NOR-pdfs/" folder
- Code to be generalized so that functions and methods can be reused for other wanted course extractions
- World Championship courses to extracted
- Find out if all courses/pdfs for WC destinations are relevant, or how to distunguish/separate those who are
Backup "Misc/"
Backup-folder for storage if needed, used as initial value for several parameters in "course_extractor.py"-methods if not otherwise specified. Otherwise empty.
html-pages stored in "NOR-courses/"
Holds all scraped .txt-files over the html-pages for each destination with the "nationcode=nor" in its url
Name of file: The scraped name from html url
(potential for normalization of names here "%2C" is actually a comma and "+" is a single space)
Actual course-image-files stored in "NOR-pdfs/"
All pdfs extracted from the urls in "partial-urls.txt"
Name of file: The name found in the scraping of the .txt-version of the current html-page.
Corresponding to NOR_*destination name*_*Homologation-code*.pdf
Example: NOR_Bodo_20_51-05_2-5.pdf
html-pages stored in "WC-courses/"
Holds all scraped .txt-files over the html-pages for each destination with the "homologationlevel=WC" in its url
Name of file: The scraped name from html url
(potential for normalization of names here "%2C" is actually a comma and "+" is a single space)
Actual course-image-files stored in "WC-pdfs/"
All pdfs extracted from the urls in "part-urls.txt"
Name of file: The name found in the scraping of the .txt-version of the current html-page.
Corresponding to *countrycode**destination name**Homologation-code*.pdf
Example: CZE_Nove_Mesto_na_Morave_WC21_03-02_1-4.pdf