fis-extraction's Introduction

Image extraction

First edition of course-dataset for Investigative Journalism project.

Wanted PDFs found in "NOR-pdfs/" folder
Code to be generalized so that functions and methods can be reused for other wanted course extractions
World Championship courses to extracted
Find out if all courses/pdfs for WC destinations are relevant, or how to distunguish/separate those who are

Backup "Misc/"

Backup-folder for storage if needed, used as initial value for several parameters in "course_extractor.py"-methods if not otherwise specified. Otherwise empty.

html-pages stored in "NOR-courses/"

Holds all scraped .txt-files over the html-pages for each destination with the "nationcode=nor" in its url
Name of file: The scraped name from html url
(potential for normalization of names here "%2C" is actually a comma and "+" is a single space)

Actual course-image-files stored in "NOR-pdfs/"

All pdfs extracted from the urls in "partial-urls.txt"
Name of file: The name found in the scraping of the .txt-version of the current html-page.
Corresponding to NOR_*destination name*_*Homologation-code*.pdf
Example: NOR_Bodo_20_51-05_2-5.pdf

html-pages stored in "WC-courses/"

Holds all scraped .txt-files over the html-pages for each destination with the "homologationlevel=WC" in its url
Name of file: The scraped name from html url
(potential for normalization of names here "%2C" is actually a comma and "+" is a single space)

Actual course-image-files stored in "WC-pdfs/"

All pdfs extracted from the urls in "part-urls.txt"
Name of file: The name found in the scraping of the .txt-version of the current html-page.
Corresponding to *countrycode**destination name**Homologation-code*.pdf
Example: CZE_Nove_Mesto_na_Morave_WC21_03-02_1-4.pdf

Recommend Projects

jeanetmu / fis-extraction Goto Github PK

fis-extraction's Introduction

Image extraction

Backup "Misc/"

html-pages stored in "NOR-courses/"

Actual course-image-files stored in "NOR-pdfs/"

html-pages stored in "WC-courses/"

Actual course-image-files stored in "WC-pdfs/"

fis-extraction's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent