Giter VIP home page Giter VIP logo

sitemap-parser's Introduction

Java sitemap parser

Java library to parse sitemaps. For details see the Javadoc comments in the respective classes.

General information:

  • works with Java 7 and higher
  • no other dependencies
  • currently built with Maven
  • MIT license

To get you started:

SitemapParser sitemapParser = new SitemapParser();

This creates a new SitemapParser.

Let's assume you want to parse the sitemap(s) of a website and want to make sure that this still works even when the location of the sitemap(s) changes.

Set<String> sitemapLocations = sitemapParser.getSitemapLocations("https://www.google.com/");

This returns a Set with URLs of the locations of the sitemaps of the website (in this case from Google). You can pass any URL from a website. The getSitemapLocations only uses the hostname (as well as protocol and port) and ignores the rest. Internally the methods fetches the robots.txt file of the webserver and extracts the sitemap information. So this method will only work when the robots.txt file contains at least one entry for a sitemap (which is the case for Google).

int sitemapLocationSize = sitemapLocations.size();
String lastSitemapLocation = sitemapLocations.toArray(new String[sitemapLocationSize])[sitemapLocationSize - 1];
Sitemap sitemap = sitemapParser.parseSitemap(lastSitemapLocation, false);

This would parse the last sitemap that was returned. If this is a sitemap index (which it is in this case), it would not recursivly parse all contained sitemaps but only the sitemap index. Pass true as second parameter or omit it to parse all contained sitemaps (this will be quite a few in this case).

System.out.println("Sitemap is of type " + sitemap.getSitemapType());
System.out.println("Sitemap contains " + sitemap.getSitemapIndexes().size() + "sitemap indexes");

This prints the type of the sitemap and of how many sitemap indexes the sitemap consists.

SitemapIndex firstSitemapIndex = sitemap.getSitemapIndexes().iterator().next();
sitemap = sitemapParser.parseSitemap(firstSitemapIndex.getLoc());

This gets the sitemap for the first sitemap index entry.

Date minLastModDate = new GregorianCalendar(2015, Calendar.AUGUST, 15).getTime();
sitemap = sitemap.getSitemapWithMinPriority(0.9).getSitemapModifiedAfter(minLastModDate);

This filters out entries with a priority lower than 0.9 and before 2015-08-15 (this is done locally without fetching the sitemap again).

System.out.println("Sitemap is of type " + sitemap.getSitemapType());
System.out.println("Sitemap contains " + sitemap.getSitemapEntries().size() + " entries after filtering");
for (SitemapEntry sitemapEntry : sitemap.getSitemapEntries()) {
    System.out.println(sitemapEntry);
}

This prints out the sitemap type (which is XML in this case) and the number of entries that remained after filtering. Afterwards these entries are printed as well.

sitemap-parser's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.