Giter VIP home page Giter VIP logo

referer-parser's Introduction

referer-parser

referer-parser is a database for extracting marketing attribution data (such as search terms) from referer URLs, inspired by the ua-parser project (an equivalent library for user agent parsing).

The referer-parser project also contains multiple libraries for working with the referer-parser database in different languages.

referer-parser is a core component of Snowplow, the open-source web-scale analytics platform powered by Hadoop and Redshift.

Note that we always use the original HTTP misspelling of 'referer' (and thus 'referal') in this project - never 'referrer'.

Database

The database is available in YAML and JSON format.

The latest database is always available on this URL:

https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-latest.yaml https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-latest.json

The database is updated at most once a month. Each new version of the database is also uploaded with a timestamp:

https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-YYYYMMDD.yaml https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-YYYYMMDD.json

Example: https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-20200331.yaml https://s3-eu-west-1.amazonaws.com/snowplow-hosted-assets/third-party/referer-parser/referers-20200331.json

If there is an issue with the database necessitating a re-release within the month, the corresponding files will be overwritten.

Language-specific repositories

referers.yml

referer-parser identifies whether a URL is a known referer or not by checking it against the referers.yml file; the intention is that this YAML file is reusable as-is by every language-specific implementation of referer-parser.

The file is broken out into sections for the different mediums that we support:

  • unknown for when we know the source, but not the medium
  • email for webmail providers
  • social for social media services
  • search for search engines

Then within each section, we list each known provider (aka source) by name, and then which domains each provider uses. For search engines, we also list the parameters used in the search engine URL to identify the search term. For example:

Google: # Name of search engine referer
  parameters:
    - 'q' # First parameter used by Google
    - 'p' # Alternative parameter used by Google
  domains:
    - google.co.uk  # One domain used by Google
    - google.com    # Another domain used by Google
    - ...

The number of referers and the domains they use is constantly growing - we need to keep referers.yml up-to-date, and hope that the community will help!

Contributing

We welcome contributions to referer-parser:

  1. New search engines and other referers - if you notice a search engine, social network or other site missing from referers.yml, please fork the repo, add the missing entry and submit a pull request
  2. Ports of referer-parser to other languages - we welcome ports of referer-parser to new programming languages (e.g. Lua, Go, Haskell, C)
  3. Bug fixes, feature requests etc - much appreciated!

Please sign the Snowplow CLA before making pull requests.

Support

General support for referer-parser is handled by the team at Snowplow Analytics Ltd.

You can contact the Snowplow Analytics team through any of the channels listed on their wiki.

Copyright and license

referers.yml is based on Piwik's SearchEngines.php and Socials.php, copyright 2012 Matthieu Aubry and available under the GNU General Public License v3.

referer-parser's People

Contributors

235 avatar alexanderdean avatar benfradet avatar blazy2k9 avatar danm avatar donspaulding avatar emilssolmanis avatar eyepulp avatar fblundun avatar jethron avatar jhirbour avatar jobartim44 avatar kaibinhuang avatar kingo55 avatar lstrojny avatar mkatrenik avatar mleuthold avatar ramin avatar raulgenially avatar rgraff avatar rzats avatar saj1th avatar shuttie avatar silviucpp avatar swijnands avatar tiborb avatar tombar avatar tsileo avatar ukutaht avatar yoloseem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

referer-parser's Issues

Java & Scala: make tests JSON-driven

referer-parser could take a cue from ua-parser by adding a YAML (or JSON) file with test cases consisting of example referer urls for each of the referers in search.yml and the corresponding parse results.

This would greatly increase the confidence level in the consistency of future ports to other programming languages.

Can't use composer - possibility for a phar file?

Came across this tonight. I am interested in the php to parse domain and search terms out of url logs. Problem is I cannot use composer with my current server (WHM/Cpanel setup)... I came across this situation once before, however, they also offered a phar file of everything bundled which I could use without problems and eliminated the need for composer.

Any chance of this happening?

Create explicit tests to express the recursive check logic

Basically, the v1 library was looking for exact matches even www.google.com vs google.com

That approach is actually flawed:

  1. Social networks often let users have their own subdomain, and you can't obviously list all of them
  2. Yahoo! puts its search engine on load-balanced subdomains, and that's an unpredictable list

So this version instead tries multiple lookups:

  • It tries the host, then the host + path, then the host + one-level path
  • Then strips off a subdomain and tries again

We should write some explicit tests into the Specs2 test suite to check that this recursion logic works correctly.

/cc @donspaulding

Java referer-parser doesn't work on Hadoop

Unfortunately the bump to httpclient 4.3.3 has broken referer-parser on Hadoop.

Specifically it's this line of code:

81b88ff#diff-729b6a9a457c5e4a0b244bf130a1e08eR192

This is the error on Hadoop:

Caused by: java.lang.NoSuchMethodError: org.apache.http.client.utils.URLEncodedUtils.parse(Ljava/lang/String;Ljava/nio/charset/Charset;)Ljava/util/List;
    at com.snowplowanalytics.refererparser.Parser.extractSearchTerm(Parser.java:205)
    at com.snowplowanalytics.refererparser.Parser.parse(Parser.java:154)
    at com.snowplowanalytics.refererparser.Parser.parse(Parser.java:116)
    at com.snowplowanalytics.refererparser.scala.Parser$.parse(Parser.scala:153)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.registry.RefererParserEnrichment.extractRefererDetails(RefererParserEnrichment.scala:107)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2$$anonfun$apply$5.apply(EnrichmentManager.scala:291)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2$$anonfun$apply$5.apply(EnrichmentManager.scala:277)
    at scala.Option.foreach(Option.scala:236)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2.apply(EnrichmentManager.scala:277)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$$anonfun$enrichEvent$2.apply(EnrichmentManager.scala:277)
    at scalaz.Validation$class.foreach(Validation.scala:126)
    at scalaz.Success.foreach(Validation.scala:329)
    at com.snowplowanalytics.snowplow.enrich.common.enrichments.EnrichmentManager$.enrichEvent(EnrichmentManager.scala:277)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1$$anonfun$apply$2.apply(EtlJob.scala:70)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1$$anonfun$apply$2.apply(EtlJob.scala:70)
    at scalaz.std.OptionFunctions$class.cata(Option.scala:157)
    at scalaz.std.option$.cata(Option.scala:209)
    at scalaz.syntax.std.OptionOps$class.cata(OptionOps.scala:9)
    at scalaz.syntax.std.ToOptionOps$$anon$1.cata(OptionOps.scala:103)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1.apply(EtlJob.scala:70)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$toCanonicalOutput$1.apply(EtlJob.scala:70)
    at scalaz.Validation$class.flatMap(Validation.scala:141)
    at scalaz.Success.flatMap(Validation.scala:329)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$.toCanonicalOutput(EtlJob.scala:69)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$7.apply(EtlJob.scala:170)
    at com.snowplowanalytics.snowplow.enrich.hadoop.EtlJob$$anonfun$7.apply(EtlJob.scala:169)
    at com.twitter.scalding.MapFunction.operate(Operations.scala:58)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
    ... 11 more

Hadoop bundles an old version of httpclient which doesn't have parse(String, Charset). There is talk of Hadoop removing that dependency but in any case we and EMR use an oldish version of Hadoop.

The problem is not using 4.3.3 per se, but using parse(String, Charset). /cc @squeed

Re-factor Ruby library

Architecture is very OO and convoluted.

Move towards similar architecture to the Scala version: a Parser module which includes one-time instantiation, and then a static parse() method which returns a Referer object for a given URL.

@Tombar has moved it in the right direction with a public parse() method (Tombar@a280d9e).

Bug in Python module

I got the error:

from referer_parser import Referer
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\referer_parser__init__.py", line 32, in <
module>
REFERERS = load_referers(JSON_FILE)
File "C:\Python27\lib\site-packages\referer_parser__init__.py", line 21, in l
oad_referers
params = list(map(text_type.lower, config['parameters']))
TypeError: descriptor 'lower' requires a 'unicode' object but received a 'str'

i fix it temporaly modifing __init__.py, load_referers function:

if 'parameters' in config:
    p = config['parameters']
    if text_type == unicode:
        p = unicode(config['parameters'])
    params = list(map(text_type.lower, p))

Python 2.7, SO:Windows 7

Add support for search engines that use subdomains for LB

Some search engines operate load balancing etc on subdomains, leading to referers which can't be found in search.yml. For example, a Yahoo! refererer URL might be "us.yhs4.search.yahoo.com"

The most performant way of supporting this is probably this algo:

  1. Lookup the full domain in search.yml. Found? Finish
  2. Not found? Strip off first sub-portion ("us."). Lookup. Found? Finish
  3. Not found? Strip off next sub-portion ("yhs4"). Lookup. Found? Finish
  4. Continue till found or no parts left!

This should be a lot faster than switching to a regexp-based approach.

Java: ParserTest failed

org.junit.ComparisonFailure: Internal subdomain HTTP medium
Expected :internal
Actual :unknown

at org.junit.Assert.assertEquals(Assert.java:125)
at com.snowplowanalytics.refererparser.ParserTest.refererTests(ParserTest.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)

Update the PHP library with new internal domain tests

referer_tests.json now has two new tests for custom internal domains functionality:
https://github.com/snowplow/referer-parser/blob/feature/json-tests/resources/referer-tests.json#L235-L248

The Referer-Parser is configured with a list of domains which should be counted as internal:
https://github.com/snowplow/referer-parser/blob/feature/json-tests/java-scala/src/test/scala/com/snowplowanalytics/refererparser/scala/JsonParseTest.scala#L41

The PHP version of the Referer-Parser already has support for internal hosts, so it should be possible to get it working with the new tests.

When done, please update sync_data.py to automatically copy the master copy of referer-tests.json into the PHP subfolder: https://github.com/snowplow/referer-parser/blob/23e3fd9f3bfaa8947fcb456ed8fbdb22f271dabc/sync_data.py#L58

Add test for domain + path search engine

A test to ensure that a google.com/product referer returns "Google Product Search" not "Google".

/cc @donspaulding as I wasn't sure that this was handled in the Python version.

The relevant line in the Ruby is:

https://github.com/snowplow/referer-parser/blob/master/ruby/lib/referer-parser/referers.rb#L28

The relevant line in the Java is:

https://github.com/snowplow/referer-parser/blob/master/java-scala/src/main/java/com/snowplowanalytics/refererparser/Parser.java#L78

Should return better result

Hi, all, please see this referer:
http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images

I've added parameters as_q, as_epq, as_eq to my local referers.yml.
and run

irb(main):002:0> require 'referer-parser'
=> true
irb(main):003:0> ref = "http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images"
=> "http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images"
irb(main):004:0> st = RefererParser::Referer.new(ref, 'referers.yml')
=> #<RefererParser::Referer:0x25feb227 @search_term="", @known=true, @referer="Google", @uri=#<URI::HTTP:0x746231ed URL:http://www.google.com/search?hl=en&as_q=&as_epq=carbonite+offer+code&as_oq=&as_eq=&num=100&lr=lang_en&as_filetype=&ft=i&as_sitesearch=&as_qdr=all&as_rights=&as_occt=any&cr=&as_nlo=&as_nhi=&safe=images>, @search_parameter="as_q">
irb(main):005:0> st.search_term
=> ""

I think that "carbonite offer code" is the better result.
https://github.com/snowplow/referer-parser/blob/master/ruby/lib/referer-parser/referer.rb#L85
If the referer contains two or more parameters, I prefer to return searchterm that is not nil or empty rather than the one when parameters first match.

Update npm module.

Hi guyz. Would you be so kind to update npm module :3. It's a bit outdated ('0.0.2': '2013-08-16T20:29:36.351Z')

Query string not being extracted from below

Potentially because the &url= isn't being escaped? (Why isn't it being escaped?)

sa=t&rct=j&q=g+star&source=web&cd=3&ved=0CEEQFjAA&url=http://www.gstars.co.uk/?ito=GAG5362963510&itc=GAC19854885430&itkw=g-stars&itawnw=search&ei=8eMQUt_hAvTSpgLjqQE&usg=AFQjCNFFNpW7yF9pcqCfOpYvqafYS94p_Q

Add tracking of keyword ranks

Suggestion from Peter O'Neill based on the blog post A new method to track keyword ranking using Google Analytics.

On occasion, Google search exposes the position of the keyword that drove the click to your website in the page_referrer as a cd= parameter. We should extract this from the referrer_url so that it can be stored in SnowPlow, and used to track:

  1. The average position of keywords over periods of time. (Is search engine rank for particular terms getting better or worse?)
  2. Grouping keywords together around specific e.g. products and categories and creating an performance index for those buckets. (As per the blog post.)

In order to implement this, we'll need to:

  • Extend the search engines YAML so it includes not just the query string parameter to identify the keywords, but also the parameter to identify the location of the search result. (Where this is available.)

Then in SnowPlow we'll need an addition mkt_xxx field to store the result e.g. mkt_rank.

Strange Yahoo search data

Am seeing some strange Yahoo data showing up as "search".

A client ran a homepage takeover on Yahoo a few weeks back and sent a lot of traffic from the yahoo homepage from hostname "au.yahoo.com".

I know this isn't search traffic, so when I queried it in Snowplow, this hostname had no real keywords.

In contrast, the hostname "au.search.yahoo.com" had quite a few search terms.

Is this a case of not provided?

phar version?

Can someone with time create a phar version or do a pull request of one using box? This is above my head, but I cannot use without having a phar... would be extremely grateful if anyone has the time.

http://box-project.org/

Make JSON the standard format loaded by the libraries

Two reasons for this:

  1. JSON seems faster to load than YAML in most languages (see tobie/ua-parser#117)
  2. More languages have JSON handling built-in than YAML (e.g. Python)

This would involve updating the Java/Scala and Ruby ports, and making @donspaulding's build_json.py a standard part of the distribution for all ports.

Important note: we would still keep the master copy of the database in YAML for readability/editability purposes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.