Giter VIP home page Giter VIP logo

swifind's Introduction

Swifind

Build Status codecov License: MIT

Overview

Swifind is web scraping function builder. It is a toolset to increase web scraping function simplicity and modularity. It came with its scripting language (swipl) to plan web scraping and crawling strategies. swipl script will be interpreted to Python sequence of function, makes it easier to recreate, reuse or modify web scraping script. It can be run as a standalone script or even attached to existing project.

Please check these amazing libraries that I used to develop this project:

  1. BeautifullSoup to parse HTML page.
  2. Request to retrieve website page.
  3. lxml to enable lxml parsing with BeautifulSoup.

Requirements

  • Python >= 3.6

Workflow

Swifind work in three simple phase:

Initiation

Catfish initiated with swipl script path as an argument. Catfish will interpret, validate and extract information from swipl script. It will store the information into sequence of function that stored in Strategy in form of Plan. Catfish uses Bag as a container for extracted or scraped data.

How it works?
interpretation_flow
  1. swipl script will be validated by Validator. Validator will check syntax validity of each line or block of component. If there is an error, exception will raised. All of validated component will be parsed into validated components.
  2. Validated components will be used to generated plan blueprint with Extractor. Extractor will return function and initiated Plan.
  3. Plan will be assembled to linked list of Plan. This sequence of Plan is assigned to Strategy that attached to existing Catfish.
  4. Catfish will utilize its Strategy to do scraping and crawling activity.

Swimming

Catfish execute all function that assigned to Strategy. Each Plan in Strategy will be execute from Strategy origin. For data collection activity, each scraped information will be stored in Bag.

Swipl Activity
Currently, there are two activity that available in `swipl`:
  • ORIGIN: define starting point of Catfish (first page).
  • PICK: define information extraction activity.

For more info about swipl activity definition and usage, read this doc.

Unpacking

Catfish return all collected items inside its Bag. Bag also contains activity or journey logs that can be retrieved with Catfish unpack method.

Example

For example, imagine there is a website (http://example.com) with following HTML structure:

<body>
  <div class="container">
    <h1>Title Example</h1>
    <a href="/link">Example Link</a>        
    <ul>
      <li>First Item</li>
      <li>Second Item</li>
      <li>Third Item</li>
    </ul>
  </div>
</body>

We then plan to extract several things:

  • Title of page, we named it title.
  • Link of example link, we named it link
  • Second element of unordered-list, we named it second_elm

Below are the swipl script to extracted those things, we named it example.swipl:

ORIGIN http://example.com
PICK title 'h1*'
PICK link 'div a' href
PICK second_elm 'ul* li[1]'

To use this script, we define Python script as follow:

from swifind.catfish import Catfish

cf = Catfish('example.swipl')
cf.swim()
result = cf.unpack()

*above example assume swipl and Python in the same directory

Result will contain extracted information as follow:

{
  "items":{
    "title": "Title Example",
    "link": "/link",
    "second_elm": "Second Item"
  }
}

Resources

swifind's People

Contributors

avidito avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

swifind's Issues

Typo on example.py and example.swipl

Hi, I encountered a typo that prevents the example from executing:

On example.py:
path = 'exampl.swipl' should be path = 'example.swipl'

On example.swipl:
IGIN https://quotes.toscrape.com/ should be ORIGIN https://quotes.toscrape.com/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.