Giter VIP home page Giter VIP logo

url-normalization's Introduction

url-normalization

URL normalization (or URL canonicalization) in general is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs may be equivalent. For more detail see http://en.wikipedia.org/wiki/URL_normalization

Rather than providing several traditional types of normalization for SEO purpose this java libraries provides transforming URLs into comparable and therefore sortable URLs. You can use this code whenever a URL is used as (primary) key in your application or storage system. This library produces URL by inverting the domain level labels by default, but also gives the option not to.

Examples

    ch.sentric/blog/berlin-buzzwords-2012-presentation-and-highlights
    ch.sentric/blog/berlin-buzzwords-2012-review-from-a-search-perspective
    ch.sentric/blog/comparing-cloudera-impala
    ch.sentric/blog/cucumber-goes-hadoop
    ch.sentric/blog/ein-treffen-mit-james-kinley-von-cloudera
    ch.sentric/blog/hadoop-best-practice-cluster-checklist
    ch.sentric/blog/hbase-sizing-notes
    ch.sentric/blog/highlights-of-apache-lucene-solr-4-0
    ch.sentric/blog/how-should-pig-and-hive-be-integrated-to-access-data-in-hadoop
    ch.sentric/blog/how-to-determine-hbase-row-sizes
    ch.sentric/blog/log-data-analysis-what-is-the-most-popular-apache-webserver-version
    ch.sentric/blog/monitoring-web-apps-with-cucumber
    ch.sentric/blog/rebuilding-a-solr-index-the-hard-way
    ch.sentric/blog/sentric-at-strata-conference-hadoop-world-2012-in-new-york
    ch.sentric/blog/sentric-becomes-cloudera-connect-partner
    ch.sentric/blog/sentric-speaking-at-apachecon-europe-2012
    ch.sentric/blog/whats-an-appropriate-use-case-for-kafka
    ch.sentric/blog/why-hadoop-and-why-now
    ch.sentric/blog/why-we-chose-solr-4-0-instead-of-elasticsearch

Normalization process

Normalizations that Preserve Semantics

  • Converting the host (and scheme) to lower case: The host (and scheme) components of the URL are case-insensitive. This normalizer will convert them to lowercase. Example: HTTP://www.Example.com/seARch → com.example/search

  • Decoding percent-encoded octets of unreserved characters: For consistency, percent-encoded octets in the ranges of ALPHA (%41–%5A and %61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, they will be decoded to their corresponding unreserved characters by this normalizer. Example: http://www.example.com/%7Eusername/ → com.example/~username/

  • Removing the default port: The default port (port 80 for the “http” scheme) is removed from a URL. Example: http://www.example.com:80/bar.html → com.example/bar.html

Normalizations that Change Semantics

  • Removing “www” as the first domain label: Some websites operate in two Internet domains: one whose least significant label is “www” and another whose name is the result of omitting the least significant label from the name of the first. For example, http://example.com/ and http://www.example.com/ may access the same website. Many websites redirect the user from the www to the non-www address or vice versa. This normalizer determines one of these URLs redirects to the other and normalize all URLs by removing the “www” first level domain. Example: http://www.example.com/search → com.example/search
  • Sorting the query parameters: Some web pages use more than one query parameter in the URL. This normalizer can sort the parameters into alphabetical order (with their values), and reassemble the URL. Example: http://www.example.com/display?lang=en&article=fred → com.example/display?article=fred&lang=en
  • Removing the "?" when the query is empty: When the query is empty, there may be no need for the "?". Example: http://www.example.com/display? → com.example.com/display

Quickstart

  1. Grab the sources from github:

    $ git clone https://github.com/sentric/url-normalization.git
    $ cd url-normalization  
    
  2. Build:

    $ mvn assembly:assembly
    
  3. Test:

    $ mvn test
    

Example Code

    $ URL url = new URL("http://www.example.com:80/bar.html");
    $ url.getNormalizedUrl(); // --> com.example/bar.html    

License

url-normalization is released under Apache License Version 2.0, see LICENSE.txt for details.

githalytics.com alpha

url-normalization's People

Contributors

jkoenig avatar

Watchers

PL avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.