Giter VIP home page Giter VIP logo

apache-clickpath's Introduction

NAME
    Apache::ClickPath - Apache WEB Server User Tracking

SYNOPSIS
     LoadModule perl_module ".../mod_perl.so"
     PerlLoadModule Apache::ClickPath
     <ClickPathUAExceptions>
       Google     Googlebot
       MSN        msnbot
       Mirago     HeinrichderMiragoRobot
       Yahoo      Yahoo-MMCrawler
       Seekbot    Seekbot
       Picsearch  psbot
       Globalspec Ocelli
       Naver      NaverBot
       Turnitin   TurnitinBot
       dir.com    Pompos
       search.ch  search\.ch
       IBM        http://www\.almaden\.ibm\.com/cs/crawler/
     </ClickPathUAExceptions>
     ClickPathSessionPrefix "-S:"
     ClickPathMaxSessionAge 18000
     PerlTransHandler Apache::ClickPath
     PerlOutputFilterHandler Apache::ClickPath::OutputFilter
     LogFormat "%h %l %u %t \"%m %U%q %H\" %>s %b \"%{Referer}i\" \"%{User-agent}i\" \"%{SESSION}e\""

ABSTRACT
    "Apache::ClickPath" can be used to track user activity on your web
    server and gather click streams. Unlike mod_usertrack it does not use a
    cookie. Instead the session identifier is transferred as the first part
    on an URI.

    Furthermore, in conjunction with a load balancer it can be used to
    direct all requests belonging to a session to the same server.

DESCRIPTION
    "Apache::ClickPath" adds a PerlTransHandler and an output filter to
    Apache's request cycle. The transhandler inspects the requested URI to
    decide if an existing session is used or a new one has to be created.

  The Translation Handler
    If the requested URI starts with a slash followed by the session prefix
    (see "ClickPathSessionPrefix" below) the rest of the URI up to the next
    slash is treated as session identifier. If for example the requested URI
    is "/-S:s9NNNd:doBAYNNNiaNQOtNNNNNM/index.html" then assuming
    "ClickPathSessionPrefix" is set to "-S:" the session identifier would be
    "s9NNNd:doBAYNNNiaNQOtNNNNNM".

    If no session identifier is found a new one is created.

    Then the session prefix and identifier are stripped from the current
    URI. Also a potentially existing session is stripped from the incoming
    "Referer" header.

    There are several exceptions to this scheme. Even if the incoming URI
    contains a session a new one is created if it is too old. This is done
    to prevent link collections, bookmarks or search engines generating
    endless click streams.

    If the incoming "UserAgent" header matches a configurable regular
    expression neither session identifier is generated nor output filtering
    is done. That way search engine crawlers will not create sessions and
    links to your site remain readable (without the session stuff).

    The translation handler sets the following environment variables that
    can be used in CGI programms or template systems (eg. SSI):

    SESSION
        the session identifier itself. In the example above
        "s9NNNd:doBAYNNNiaNQOtNNNNNM" is assigned. If the "UserAgent"
        prevents session generation the name of the matching regular
        expression is assigned, (see "ClickPathUAExceptions").

    CGI_SESSION
        the session prefix + the session identifier. In the example above
        "/-S:s9NNNd:doBAYNNNiaNQOtNNNNNM" is assigned. If the "UserAgent"
        prevents session generation "CGI_SESSION" is empty.

    SESSION_START
        the request time of the request starting a session in seconds since
        1/1/1970.

    CGI_SESSION_AGE
        the session age in seconds, i.e. CURRENT_TIME - SESSION_START.

    REMOTE_SESSION
        in case a friendly session was caught this variable contains it, see
        below.

    REMOTE_SESSION_HOST
        in case a friendly session was caught this variable contains the
        host it belongs to, see below.

  The Output Filter
    The output filter is entirely skipped if the translation handler had not
    set the "CGI_SESSION" environment variable.

    It prepends the session prefix and identifier to any "Location" an
    "Refresh" output headers.

    If the output "Content-Type" is "text/html" the body part is modified.
    In this case the filter patches the following HTML tags:

    <a ... href="LINK" ...>
    <area ... href="LINK" ...>
    <form ... action="LINK" ...>
    <frame ... src="LINK" ...>
    <iframe ... src="LINK" ...>
    <meta ... http-equiv="refresh" ... content="N; URL=LINK" ...>
        In all cases if "LINK" starts with a slash the current value of
        "CGI_SESSION" is prepended. If "LINK" starts with "http://HOST/" (or
        https:) where "HOST" matches the incoming "Host" header
        "CGI_SESSION" is inserted right after "HOST". If "LINK" is relative
        and the incoming request URI had contained a session then "LINK" is
        left unmodified. Otherwize it is converted to a link starting with a
        slash and "CGI_SESSION" is prepended.

  Configuration Directives
    All directives are valid only in *server config* or *virtual host*
    contexts.

    ClickPathSessionPrefix
        specifies the session prefix without the leading slash.

    ClickPathMaxSessionAge
        if a session gets older than this value (in seconds) a new one is
        created instead of continuing the old. Values of about a few hours
        should be good, eg. 18000 = 5 h.

    ClickPathMachine
        set this machine's name. The name is used with load balancers. Each
        machine of a farm is assigned a unique name. That makes session
        identifiers unique across the farm.

        If this directive is omitted a compressed form (6 Bytes) of the
        server's IP address is used. Thus the session is unique across the
        Internet.

        In environments with only one server this directive can be given
        without an argument. Then an empty name is used and the session is
        unique on the server.

        If possible use short or empty names. It saves bandwidth.

        A name consists of letters, digits and underscores (_).

        The generated session identifier contains the name in a slightly
        scrambled form to slightly hide your infrastructure.

    ClickPathUAExceptions
        this is a container directive like "<Location>" or "<Directory>".
        The container content lines consist of a name and a regular
        expression. For example

         1   <ClickPathUAExceptions>
         2     Google     Googlebot
         3     MSN        (?i:msnbot)
         4   </ClickPathUAExceptions>

        Line 2 maps each "UserAgent" containing the word "Googlebot" to the
        name "Google". Now if a request comes in with an "UserAgent" header
        containing "Googlebot" no session is generated. Instead the
        environment variable "SESSION" is set to "Google" and "CGI_SESSION"
        is emtpy.

    ClickPathUAExceptionsFile
        this directive takes a filename as argument. The file's syntax and
        semantic are the same as for "ClickPathUAExceptions". The file is
        reread every time is has been changed avoiding server restarts after
        configuration changes at the prize of memory consumption.

    ClickPathFriendlySessions
        this is also a container directive. It describes friendly sessions.
        What is a friendly session? Well, suppose you have a WEB shop
        running on "shop.tld.org" and your company site running on
        "www.tld.org". The shop does it's own URL based session management
        but there are links from the shop to the company site and back.
        Wouldn't it be nice if a customer once he has stepped into the shop
        could click links to the company without loosing the shopping
        session? This is where friendly sessions come in.

        Since your shop's session management is URL based the "Referer" seen
        by "www.tld.org" will be something like

         https://shop.tld.org/cgi-bin/shop.pl?session=sdafsgr;clusterid=25

        (if session and clusterid are passed as CGI parameters) or

         https://shop.tld.org/C:25/S:sdafsgr/cgi-bin/shop.pl

        (if session and clusterid are passed as URL parts) or something
        mixed.

        Assuming that "clusterid" and "session" both identify the session on
        "shop.tld.org" "Apache::ClickPath" can extract them, encode them in
        it's own session and place them in environment variables.

        Each line in the "ClickPathFriendlySessions" section decribes one
        friendly site. The line consists of the friendly hostname, a list of
        URL parts or CGI parameters identifying the friendly session and an
        optional short name for this friend, eg:

         shop.tld.org uri(1) param(session) shop

        This means sessions at "shop.tld.org" are identified by the
        combination of 1st URL part after the leading slash (/) and a CGI
        parameter named "session".

        If now a request comes in with a "Referer" of
        "http://shop.tld.org/25/bin/shop.pl?action=showbasket;session=213"
        the "REMOTE_SESSION" environment variable will contain 2 lines:

         25
         session=213

        Their order is determined by the order of "uri()" and "param()"
        statements in the configuration section between the hostname and the
        short name. The "REMOTE_SESSION_HOST" environment variable will
        contain the host name the session belongs to.

        Now a CGI script or a modperl handler or something similar can fetch
        the environment and build links back to "shop.tld.org". Instead of
        directly linking back to the shop your links then point to that
        script. The script then puts out an appropriate redirect.

    ClickPathFriendlySessionsFile
        this directive takes a filename as argument. The file's syntax and
        semantic are the same as for "ClickPathFriendlySessions". The file
        is reread every time is has been changed avoiding server restarts
        after configuration changes at the prize of memory consumption.

  Working with a load balancer
    Most load balancers are able to map a request to a particular machine
    based on a part of the request URI. They look for a prefix followed by a
    given number of characters or until a suffix is found. The string
    between identifies the machine to route the request to.

    The name set with "ClickPathMachine" can be used by a load balancer. It
    is immediately following the session prefix and finished by a single
    colon. The default name is always 6 bytes long.

  Logging
    The most important part of user tracking and clickstreams is logging.
    With "Apache::ClickPath" many request URIs contain an initial session
    part. Thus, for logfile analyzers most requests are unique which leads
    to useless results. Normally Apache's common logfile format starts with

     %h %l %u %t \"%r\"

    %r stands for *the request*. It is the first line a browser sends to a
    server. For use with "Apache::ClickPath" %r is better changed to "%m
    %U%q %H". Since "Apache::ClickPath" strips the session part from the
    current URI %U appears without the session. With this modification
    logfile analyzers will produce meaningful results again.

    The session can be logged as "%{SESSION}e" at end of a logfile line.

  A word about proxies
    Depending on your content and your users community HTTP proxies can
    serve a significant part of your traffic. With "Apache::ClickPath"
    almost all request have to be served by your server.

  Using with SSI
    Server Side Includes are also implemented as an output filter. Normally
    Perl output filters are called *before* mod_include leading to
    unexpected results if an SSI statement generated links. On the other
    hand one can configure the "INCLUDES" filter with "PerlSetOutputFilter"
    which preserves the order given in the configuration file. Unfortunately
    there is no "PerlSetOutputFilterByType" directive and and the "INCLUDES"
    filter processes everything independend of the "Content-Type". Thus,
    also images and other stuff is scanned for SSI statements.

    With Apache 2.2 there will be a filter dispatcher module that can maybe
    address this problem.

    Currently my only solution to this problem is a little module
    "Apache::RemoveNextFilterIfNotTextHtml" and setting up the filter chain
    with "PerlOutputFilterHandler" and "PerlSetOutputFilter":

     PerlOutputFilterHandler Apache::RemoveNextFilterIfNotTextHtml
     PerlSetOutputFilter INCLUDES
     PerlOutputFilterHandler Apache::ClickPath::OutputFilter

    Don't hesitate to contact me if you are interested in this little
    module.

  Debugging
    Sometimes it is useful to know the information encoded in a session
    identifier. This is why Apache::ClickPath::Decode exists.

SEE ALSO
    Apache::ClickPath::Decode(3) <http://perl.apache.org>,
    <http://httpd.apache.org>

AUTHOR
    Torsten Foertsch, <[email protected]>

COPYRIGHT AND LICENSE
    Copyright (C) 2004 by Torsten Foertsch

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

INSTALLATION
     perl Makefile.PL
     make
     make test
     make install

DEPENDENCIES
    mod_perl 1.9918 (aka 2.0.0-RC1), perl 5.8.0

apache-clickpath's People

Contributors

torsten-deriv avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.