Giter VIP home page Giter VIP logo

webmuncher's Introduction

Maven Central Javadoc

webmuncher is a tool that can be used to easily retrieve all the contents of a website. More accurately, contents under a single domain. This is the perfect use case which reflects the original need for which it was created: Read more about that [here] (http://geekabyte.blogspot.be/2014/12/a-web-scrapercrawler-in-java-krwkrw.html)

webmuncher is available via Maven central, and you can easily drop it into your project with this coordinates:

Maven:

<dependency>
<groupid>com.blogspot.geekabyte.webmuncher</groupid>
<artifactid>webmuncher</artifactid>
<version>${webmuncher.version}</version>
</dependency>

Gradle:

dependencies {
    compile "com.blogspot.geekabyte.webmuncher:webmuncher:$webmuncher.version}"
}

Or you can also build from source and have the built jar in your classpath.

The available releases can be seen [here] (https://github.com/dadepo/webmuncher/releases)

The announcement for the most recent release can be seen here

###How to use webmuncher.

webmuncher is designed around the [Strategy Pattern] (http://en.wikipedia.org/wiki/Strategy_pattern). The main object that would be used is the webmuncher object, while the client using webmuncher would need to provide an implementation of the FetchAction interface which contains code that operates on every fetched page represented by the FetchedPage object

The FetchAction interface has only one method that needs to be implemented. The execute() method. The execute() method is given a FetchedPage object which contains the information extracted from every crawled pages. e.g, the HTML content of the page, the uri of the page, the title of the page, the time it took webmuncher to retrieve the page etc.

Since version 0.1.2 webmuncher comes with utility FetchActions, that makes it easy to persist pages crawled. The included utility actions are:

  1. JDBCAction - for persisting web pages into a relational database. (since 0.1.2)
  2. ElasticSearchAction - for indexing web pages into ElasticSearch. (since 0.1.2)
  3. CSVAction - for saving web pages into a CSV file. (since 0.1.2)

For example, to use webmuncher to extract all the contents of http://www.example.com into a CSV file, you do:

    // Use the builder to build the CSVAction
    CSVAction action = CSVAction.builder()
                .convertToPlainText(true) // converts HTML to plain text
                .setDestination(Paths.get("example-com.csv"))
                .buildAction();

    // creates an instance of the crawler with the action
    webmuncher crawler = new webmuncher(action);

    // Configure the crawler to your hearts desire

    // Crawler will wait 20 seconds between each requests
    crawler.setDelay(20);

    // When at first you don't succeed?
    // Give up and move onto the next one, after 3 attempts!
    crawler.setMaxRetry(3)

    // the crawler would select randomly from the list of user agents
    // you give for each request
    crawler.setUserAgents(Arrays.asList(
      "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6)...",
      "Opera/9.80 (X11; Linux i686; Ubuntu/14.10)...")
    );

    // Provide the list of addresses to use for the referral. So the folks at
    // example.com when checking the webserver logs:sometimes the request
    // comes from google, sometimes, yahoo, sometimes bing...
    crawler.setReferrals(Arrays.asList(
                    "http://www.google.com",
                    "http://www.yahoo.com",
                    "http://www.bing.com"));

        // Start the crawling operation as a blocking call.
        Set<String> strings = crawler.crawl("http://www.example.com");

        // If you want to execute the crawling in another thread,
        // so the current thread does not block, then do:
        Set<String> strings = crawler.crawlAsync("http://www.example.com");

        // in case you do the crawling in another thread,
        // you most likely want to be notified when the
        // crawling operations terminates. in such a case,
        // you should use crawler.onExit(FetchExitCallback callback)
        // to register the callback

The above steps makes use of the CSVAction that comes with the library. In case you have custom operations you want applied to the fetched web pages, then you can easily implement your own FetchAction. for example a JPA backed FetchAction implementation may look like:

class CustomJpaAction implements FetchAction {

        private EntityManager em;
        private EntityManagerFactory emf;

        /**
         * Operates on given {@link com.blogspot.geekabyte.webmuncher.FetchedPage}
         *
         * @param page
         */
        @Override
        public void execute(FetchedPage page) {
            emf = Persistence.createEntityManagerFactory("FetchedPage");
            em = emf.createEntityManager();
            em.getTransaction().begin();

            FetchedPageEntity entity = new FetchedPageEntity();
            entity.setHtml(page.getHtml());
            entity.setLoadTime(page.getLoadTime());
            entity.setStatus(page.getStatus());
            entity.setTitle(page.getTitle());
            entity.setUrl(page.getUrl());
            entity.setSourceUrl(page.getSourceUrl());

            em.persist(entity);
            em.flush();
            em.getTransaction().commit();
        }
}

###Overview of webmuncher API.

The accompanying Javadoc should be helpful in having an overview of the API. It can be gotten using the [Javadoc tool] (http://www.oracle.com/technetwork/articles/java/index-jsp-135444.html) or via Maven using the [Maven Javadoc plugin] (http://maven.apache.org/plugins/maven-javadoc-plugin/).

More conveniently, thanks to Javadoc.io, you can also access the most recent Javadoc online

Licenses

[The MIT License (MIT)] (http://www.opensource.org/licenses/mit-license.php)

webmuncher's People

Contributors

dadepo avatar valery1707 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.