Giter VIP home page Giter VIP logo

scalawebscraper's Introduction

Scala Webscraper 0.4.1

Build Status

Getting started

The project is build with Scala 2.10.2 and sbt 0.13.0, both can be installed using this install script

To try the example navigate to the project folder and run sbt "project scraper-demo" run which will start the example scraper

Installation

If you use SBT, you just have to edit build.sbt and add the following:

libraryDependencies += "nl.razko" %% "scraper" % "0.4.1"

If you want to use bleeding edge versions using snapshots then add the Sonatype snapshots to the resolvers:

resolvers += "Sonatype Snapshots" at "http://oss.sonatype.org/content/repositories/snapshots/"

libraryDependencies += "nl.razko" %% "scraper" % "0.4.1-SNAPSHOT"

DSL

The webscraper provides a simple DSL to write scrape rules

import org.rovak.scraper.ScrapeManager._
import org.jsoup.nodes.Element

object Google {
  val results = "#res li.g h3.r a"
  def search(term: String) = {
    "http://www.google.com/search?q=" + term.replace(" ", "+")
  }
}

// Open the search results page for the query "php elephant"
scrape from Google.search("php elephant") open { implicit page =>

  // Iterate through every result link
  Google.results each { x: Element =>
  
    val link = x.select("a[href]").attr("abs:href").substring(28)
    if (link.isValidURL) {

      // Iterate through every found link in the found page
      scrape from link each (x => println("found: " + x))
    }
  }
}

Spiders

A spider is a scraper which recursively loads a page and opens every link it finds. It will keep scraping until all pages within the allowed domains are visited once.

The following snippet demonstrates a basic spider which crawls a website and provides hooks to do something with the data

new Spider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

The spider can be extended by providing traits, if you want to scrape emails then add the EmailSpider trait which offers a new onEmailFound hook in which emails can be collected.

new Spider with EmailSpider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"

  onEmailFound ::= { email: String =>
    // Email found
  }

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

Multiple spiders can be mixed together

new Spider with EmailSpider with SitemapSpider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"
  sitemapUrls ::= "http://events.stanford.edu/sitemap.xml"

  onEmailFound ::= { email: String =>
    println("Found email: " + email)
  }

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

Documentation

scalawebscraper's People

Contributors

rovak avatar tomer-ben-david avatar

Watchers

Eko Prastyo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.