Giter VIP home page Giter VIP logo

focusedcrawler's Introduction

Focused Crawler

Originally designed as part of Virginia Tech's Crisis & Tragedy Recovery Network (CTRnet), this project crawls the internet and collects webpages related to a given topic, often for archival purposes.

FocusedCrawler.py

  • Driver class for this project
  • Responsible for creating configuration and classifier object and calling crawler

crawler.py

  • Crawler class responsible for collecting and exploring new URLs to find relevant pages
  • Given a priority queue and a scoring class with a calculate_score(text) method

classifier.py

  • Parent class of classifiers (non-VSM) including NaiveBayesClassifier and SVMClassifier

  • Contains code for tokenization and vectorization of document text using sklearn

  • Child classes only have to assign self.model

  • NBClassifier.py

  • Subclass of Classifier, representing a Naïve Bayes classifier

  • SVMClassifier.py

  • Subclass of Classifier, representing an SVM classifier

scorer.py

  • Parent class of scorers, which are non-classifier models, typically VSM
  • tfidfscorer.py

    • Subclass of Scorer, representing a tf-idf vector space model
  • lsiscorer.py

    • Subclass of Scorer representing an LSI vector space model

config.ini

  • Configuration file for focused crawler in INI format

config.py

  • Class responsible for reading configuration file, using ConfigParser
  • Adds all configuration options to its internal dictionary (e.g. config[“seedFile”])

utils.py

  • Contains various utility functions relating to reading files and sanitizing/tokenizing text

seeds.txt

  • Contains URLs to relevant pages for focused crawler to start
  • Default name, but can be modified in config.ini

priorityQueue.py

  • Simple implementation of a priority queue using a heap

webpage.py

  • Uses BeautifulSoup and nltk to extract webpage text

FocusedCrawlerReport.docx

For the full technical report, please visit: https://docs.google.com/file/d/0B436PtOU57sJZkc5anMyNDZPaHM/edit?usp=sharing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.