Giter VIP home page Giter VIP logo

logo_extraction's Introduction

Logo Extraction

This program extracts url of website logo given a website's url.

Getting started

This program will work for Python 3 and above. Git clone or download this repository. The program is tested for Windows 10, Linux (Ubuntu 16) and Mac OS X. The python version used in all these tests was 3.5 and higher.

Prerequistes

Firefox needs to be installed on your machine.

This program requires you to install selenium package. Installation instructions can be found here. (Note:There are specific instructions for Windows users.)

In general, running the following command in commandline should install selenium successfully. (Assuming pip is already installed.)

pip install selenium

A headless browser is used to fetch webpage content. The program uses firefox driver (It comes by default with the Selenium package) The headless browser needs geckodriver which can be found here. Based on the machine that you are using download the driver.

Config.file

The config file is present in the Logo_Extraction_master directory. Mention the path of geckodriver.

Running the script

logo_extraction.py

This file performs the logo extraction task. It accepts the input file or a url from command line.

To the run the file:

  1. Open command prompt (The application is tested on anaconda command prompt) and cd to the Logo_Extraction_master directory.
  2. Run the file as follows:
  • For a input file use:
python logo_extraction.py /path/to/your/input file/your_input_file.txt
  • For a url use:
python logo_extraction.py http://python.org

The name of the output file will be output.txt. The output would be written to a file in the same directory. The output format is website url, logo_url. For some websites the logo might be just stylized text. In such cases, the logo url will be blank.

Running tests

logo_extraction_test.py

  1. Open command prompt (The application is tested on anaconda command prompt) and cd to the Logo_Extraction_master directory.
  2. Use the command:
python -m unittest -v logo_extraction_test.py

Interpreting logs

The log is present in logo_extraction.log file. The log file will be stored in the same directory. The log shows information about number of urls processed, logo url sources by tags and errors like invalid urls.

Apart from the log generated by the script, geckodriver has its own log named geckodriver.log, which will be in the same directory. This log can be referred to for additional information.

Experimental Files

The Experimental directory has the logo extration implementation which is an attempt to use multiprocessing using the Pool class in multiprocessing library in Python. It is not included in the final implementation as the WebDriver in selenium is not thread-safe.

logo_extraction's People

Contributors

ssb10 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.