Giter VIP home page Giter VIP logo

kommunencrawler's Introduction

KommunenCrawler

Work in progress

Focussed WebCrawler to search and collect for unstructured, decentralized data on a specific topic. Content-based Relevance check for .html and .pdf documents via Taxonomie or scikit-learn document classification. Written in Python 3.5.2 Based on the Spider: https://github.com/buckyroberts/Spider by Bucky Roberts. Thanks mate!

Originally designed for crawling Websites of German Municipals and collecting data about informal citizien engagement. This is my very first Python project. It might not be very Pythonic in some parts. It might not be 'best-practice'. I am grateful for any constructive feedback and contribution.

Content:

  1. Background & Concept
  2. Used Non-Standard Libraries
  3. Components & How to use
  4. Next steps

1. Background and Concept

This Prototype is part of a Bachelor-Thesis with the purpose of collecting data about the effort of any german municipal in citizen engagement and participation. Starting with one or more given relevant pages (e.g. the first ten results of agoogle search for a relevant keyword, limited to the municipal website), the crawler will visit every link from every relevant page, and from every childpage of a relevant page. It stops at irrelevant child pages of irrelevant pages (see illustration 1). This restriction was was chosen to limit the crawl on relevant parts of the municipals website.

crawling path

Illustration 1: Crawling path

The relevance assesment is based on the cosine similarity of the Term-Frequency/Invert-Document-Frequency values of the trainings-dataset data test dcument.Jana Vembunarayanan provides an easy to understand introduction For each document an value between 0 and 1 is calculated. Documents with an relevance value, higher than 0.2 are labeled as relevant and saved to a database.

2. Used Non-Standard Python Libraries

This Prototype was developed in Python 3.5.2 The following Non-Standard Libraries are required to run this Prototype:

  • Required:

  • Additional:

    • google-api-python-client - Can be used to automatically add google results for a specific keyword to the starting queue. API required.
    • BeautifulSoup 4 - Might be used to automatically add google results for a specific keyword to the starting queue without API. Disclaimer: This would violate their Terms and Conditions of Use and you really shouldn't do this!!!1
    • lxml 3.4.0 - See BeautifulSoup 4.

3. Components & How to use

First run:

  1. Change configuration.yaml and logging.yaml to according to your needs.
    • You should definitely set the number of threads according to the capabilities of you system.
  2. Run main.py once, to create datastructures.
  3. Insert at least one entity to input database, including name, homepage, and unique identifier.
  4. Run main.py again for each crawl.
  5. Insert Unique when getting asked for gkz.
  6. Create a starting queue:
    • Either uncomment 'queue = erstell_queue_quicklane(...)' or 'queue = erstelle_queue(...)' in main.py
    • When quicklane is chosen, the first ten search results for the standard keyword, as well as the homepage is added to the queue.
    • Elsewise you have more options:
    • You can insert links manually
    • You can add a list of Links. Use a .txt with one Link per Line. Put it in a folder 'Linklisten'. Enter filename when you are asked for it.
    • You can run a google search for as many keywords as you want. The first 10 results will be added to queue.
    • You can insert the already saved links from the output database to the queue (Useful, if you have changed the relevance criteria)
  7. Wait for the process to finish. You can stopp and resume the crawl at any point.

main.py

Initializes the Crawl: Reads in the Configuration. Instanciates the RelevanceCheckerr. Sets up the datastructure. This includes creating an input and an output database. Creating folders for the queue-files, the list of crawled pages, the output-csv and a for downloading pdf-documents. Creates the starting queue, based on user settings. Starts the frontier.

relevancechecker.py

Spider.py passes the HTML-document/ the pdf text to relevancecheck.runcheck(). The text of HTML-documents will be extracted, using a boilerpipe extractor. The text will than be preprocessed (Some word forms are replaced, stopwords will be removed, the documents gets tokenized (nltk word_tokenize), each token gets stemmed (nltk snowball GermanStemmer), words from the list of weighted words will be added twice to the stemmed words.) The tf/idf-value gets calculated by the TfidfVectorizer from scikit-learn, the cosine similarity gets calculated and will be pairwised compared to the value of each document in the trainings-dataset.

2. Next steps

* Improve traindata, stoppword list and word weightings. 
* Completing the documentation
* ...

kommunencrawler's People

Contributors

kon-foo avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.