opawg / user-agents Goto Github PK

An open, platform-agnostic list of user-agent and referrer regexes for use in podcast analytics services

License: MIT License

Python 46.21% JavaScript 53.79%

user-agents's Introduction

THIS IS GOING AWAY

While this list is being kept updated, you should now be using user-agents v2. It's more performant, more regularly updated, and better for everyone.

User agent list

A list of apps, services and bots that consume podcast audio. This data is used by a number of podcast hosts to assist with their analytics.

One public example is this page at Podnews which uses this data alongside the RSS UA. We're aware that this data is used by a number of large podcast hosts and private podcasters too.

This page runs this data through a regex for 1,000 entries in OP3.

Contributing to the list

The simplest way is to add to the file at src/user-agents.json.

Each app, service or bot should have its own entry. The user_agents should be as exclusive as possible, to avoid multiple matches.

Each entry must contain the following properties:

Be careful about ensuring the file is correctly escaped.

user_agents (array of strings): a list of regular expressions against which the requesting user-agent should be validated.

Each entry can contain one of the following properties:

bot (boolean): set to true when the requesting agent is a bot (no need to set to false otherwise).
app (string): set to the human-readable name of the app or service. We do not set this string if it's just a library or framework.
device (string): set to a slug of the device type, usually one of
- pc (meaning a desktop or laptop computer running Linux, macOS or Windows)
- phone
- radio (a smart radio)
- speaker (smart speaker)
- tablet
- watch
os (string): set to the slug of the operating system, usually one of
- android
- ios
- linux
- macos
- windows
examples (array of strings): a few different examples of the user-agent as seen in the wild. Caution should be taken to remove any personally identifying information
description (string): intended to be a humanly readable description of the app, bot or other
info_url (string): a link to the homepage of the app, bot or other, for public consumption
svg (string): a name of a square SVG file, intended for use in app dashboards for identification purposes
developer_notes (string): freeform notes for developers, where it is helpful to leave notes on behaviour of certain useragents or bots.

Slugs

A slug is a lowercase alphanumeric (ASCII) representation of a string, consisting only of numbers, letters and, in our case, underscores. It's up to apps that implement the list to display this information however they see fit, and using a slug is better for disambiguation.

Unknowns

It is proposed that we only specify a property above when it is known (not assumed). For example, it's often difficult to know whether an Android app is running on a phone or a tablet. We can assume that since Android tablets are rarer, almost all requests will be via Android phones, but we can't know that.

Parsing order

Multiple matches should ideally not happen for anything that has an app name; so parsing order shouldn't matter. For devices and OS, you mat discover that multiple matches will give you more accurate data, but you should hopefully only see one app name.

Testing

The /src folder contains a subfolder /tests with unit tests per programming languages. Unit tests should try to compile all the regular expressions. In case of failure, the problematic regular expressions should be fixed before pushing the changes.

python

# Running tests with pytest
pytest

user-agents's People

Contributors

Stargazers

Watchers

user-agents's Issues

Soprify bot regex may filter valid requests

Hello,
It appears that the spotify bot regex : ^Spotify/\\d+ is not restrictive enough and may match other Soptify user agents that are not related to bots (see here a wide list).
Can we update the Regex to exactly match the Spotify/1.0 user agent, i.e : ^Spotify/\\d+\.\\d+$ ?

Wrong format for user_agents key

This record has the wrong key, it has user-agents instead of user_agents

user-agents/src/user-agents.json

Line 1604 in c5f3ad3

"user-agents": [

Quantifier at the beginning of the regex

I think this is invalid, quantifier should not be at the beginning of the regex.

user-agents/src/user-agents.json

Line 636 in dcd5d78

"*GSA/",

Automatically combine separate user-agent JSON files into one file when a pull request is made, with versioning

I noticed this in the README:

To stop the list becoming unwieldy, in the future it may be possible to separate out the apps into separate files, that are then combined together automatically.

And thought it sounded fun / fairly simple to implement with Github Actions and tried building a proof-of-concept. Basically the proposed process for adding to the user-agents.json works as follows...

Shorter explanation:

Add user-agent objects into src/organizations// as separate JSON files.
Create a PR to merge the branch with your JSON files in with opawg/user-agents#master
The Github Action should take care of everything else, ultimately resulting in a combined JSON found at dist/user-agent.json :)

Longer explanation:

User-agent objects are added into a src/organizations/ directory. (examples)
When a PR is made to merge your changes into the opawg/user-agents, a Github Action runs the following steps:
The patch version in the package.json is automatically incremented (1.0.0 becomes 1.0.1)
The combine-jsons command is run from the package.json, which searches the src/organizations directory for all jsons, and combines them into the array in the user-agents.json file in alphabetical order by organization, and sorted by a new priority field within those organizations.
The new combined user-agents.json file is then saved to dist/user-agents.json (the latest version), and dist/archives/<package.json version number>/user-agents.json.
A corresponding user-agents.yaml file is generated and saved in dist/user-agents.yaml and dist/archives/<package.json version number>/user-agents.yaml.
The JSON output files in the dist are then validated using the validate-json-action Github Action.
If all of the previous steps succeed, the last step is the Github Action will automatically push the new JSON and YAML files in the dist file into the branch.

Anyway, all of the steps above can be changed or optimized, I just wanted to get something that accomplishes automatically combining separate files into a single file, with versioning history in case it helps you get started.

If you'd like to see the code, the commit history for the PR is far too messy, so I don't propose merging this in as is, but if this is a direction that you think would be helpful for opawg/user-agents, I would be happy to make create a new cleaner PR with the changes you would like.

Proof-of-Concept PR: https://github.com/podverse/user-agents/pull/1/files

Sample dist folder: https://github.com/podverse/user-agents/tree/autoCombineJSONs

Sample organizations folder: https://github.com/podverse/user-agents/tree/autoCombineJSONs/src/organizations

Thanks for taking the lead on the user agents initiative! It's amazing how just a few lines of code change to podcast apps can make such a big improvement for the podcast ecosystem. Please let me know if there is more I can do to help.

doubleTwist has wrong app name

In the JSON:

    {
        "user_agents": [
            "^doubleTwist CloudPlayer"
        ],
        "examples": [
            "doubleTwist CloudPlayer"
        ],
        "app": "doubleTwitch CloudPlayer",
        "device": "phone",
        "info_url": "https://www.doubletwist.com/cloudplayer",
        "os": "android"
    },

You probably want app to be doubleTwist CloudPlayer.

YAML file is out of date

Unless this file can be updated automatically, I'd like to propose deleting it. It's a poor advertisement for this repo.

Feature request: human detail and information

In order for this data to be user-friendly, I'd like to suggest some additional fields for the JSON, to be available to be used in user dashboards.

My suggestions are (all optional):

description: a user-facing description to help explain what this device is and how it works
infourl: a user-facing link to allow people to go and find more information about this app. This is intended to be a homepage or other similar page
developernotes: developer-facing notes of interest. Not intended to be viewed by users.
svg: a square SVG icon of the app/device, for use in dashboards and others

A suggested example is in the screenshot below.

I might note that "app:" in the current specification is, presumably, intended to be user-facing.

Automated tests

Any interest in adding more tests to this project?

I hacked together jdelStrother@995af44 that just checks that all the examples listed in the json match one of the user_agents regexes.
(half-a-dozen or so seem to have bad example UAs: https://github.com/jdelStrother/user-agents/actions/runs/3439700273/jobs/5737297616).

I didn't bother adding an actual test framework (eg jest) because I struggled to think of other tests I wanted to add, but I could add one if we thought it might expand in usage.

feature: $schema keyword

Firstly, thank you for the fine work!

The $schema keyword is used to declare which dialect of JSON Schema the schema was written for. The value of the $schema keyword is also the identifier for a schema that can be used to verify that the schema is valid according to the dialect $schema identifies.

What do you think of adding the $schema key to user-agenst.json. It's value could reference locally, or better yet - point to the json schema store URL, such as https://json.schemastore.org/github-workflow-template-properties.json (for more information, see schema repositories)?

VSCode honors the $schema keyworkd and many schema validators could be used.

I could open a pull-request with the change, but I am reaching for the maintainers opinion to know whether the schema should be published to schema store and use a global URL or use a local reference

Bad escaping for numerics

Heya - as of 0359556, it seems like we lost most/all of the use of \d+ to represent a number

As an example, before that commit, one of the Apple Podcasts' useragent matchers was

    "user_agents": [
         "^Podcasts/.*\\d$",
         "^Balados/.*\\d$",
         "^Podcasti/.*\\d$",
         "^Podcastit/.*\\d$",
         "^Podcasturi/.*\\d$",
         "^Podcasty/.*\\d$",
         "^Podcast’ler/.*\\d$",
         "^Podkaster/.*\\d$",
         "^Podcaster/.*\\d$",
         "^Podcastok/.*\\d$",
         "^Подкасти/.*\\d$",
         "^Подкасты/.*\\d$",
         "^פודקאסטים/.*\\d$",
         "^البودكاست/.*\\d$",
         "^पॉडकास्ट/.*\\d$",
         "^พ็อดคาสท์/.*\\d$",
         "^%E6%92%AD%E5%AE%A2/.*\\d$",
         "^播客/.*\\d$",
         "^팟캐스트/.*\\d$"
     ],

the same matcher is now:

    "user_agents": [
      "^Podcasts\/.*d$",
      "^Balados\/.*d$",
      "^Podcasti\/.*d$",
      "^Podcastit\/.*d$",
      "^Podcasturi\/.*d$",
      "^Podcasty\/.*d$",
      "^Podcast\u2019ler\/.*d$",
      "^Podkaster\/.*d$",
      "^Podcaster\/.*d$",
      "^Podcastok\/.*d$",
      "^\u041f\u043e\u0434\u043a\u0430\u0441\u0442\u0438\/.*d$",
      "^\u041f\u043e\u0434\u043a\u0430\u0441\u0442\u044b\/.*d$",
      "^\u05e4\u05d5\u05d3\u05e7\u05d0\u05e1\u05d8\u05d9\u05dd\/.*d$",
      "^\u0627\u0644\u0628\u0648\u062f\u0643\u0627\u0633\u062a\/.*d$",
      "^\u092a\u0949\u0921\u0915\u093e\u0938\u094d\u091f\/.*d$",
      "^\u0e1e\u0e47\u0e2d\u0e14\u0e04\u0e32\u0e2a\u0e17\u0e4c\/.*d$",
      "^%E6%92%AD%E5%AE%A2\/.*d$",
      "^\u64ad\u5ba2\/.*d$",
      "^\ud31f\uce90\uc2a4\ud2b8\/.*d$"
      ],

eg "^Podcasts\/.*d$" only matches user agents ending with a "d", not a numeric.

I'd also suggest the \/ in there is a bit weird - it's harmless, but in JSON "/" and "\/"are equivalent, AFAIK.

Feature request: Sample strings

Would be great to have samples of the user agents for each of the regex sections. That would simplify writing tests for the implementation on different platforms.
It would also be helpful in the future when writing updates for missing / new versions since you can compare the newly reported user-agent with the already existing one and adjust the rule to cover both or add an additional.

One non-valid Regex

From my error logs, there's one non-valid regex pattern in the current JSON. I'm not clever enough to quite work out where. I'll continue that work, but just flagging this as an error.

Feature Request: Version Numbers in Github

Friends,

Would it be possible to create releases/version numbers for when the list is updated? That would make it much easier to coordinate updates of gems, such as https://github.com/dan/podcast_agent_parser/ which relies on your fantastic list.

Thanks!

adding a GUID to each user agent record

We are looking at synchronizing our user agent list with this one and would find it very useful if there was a consistent unique identifier (GUID) associated to each record. Such an identifier would be also be useful for the situation where the application name changes also.

GUID: In this case would be a 128-bit integer number used to identify the user agent with a well-defined sequence of 32 hexadecimal digits grouped as 8-4-4-4-12

Example of the proposed addition to the first 2 records found in the json...

"guid": "f93bfaff-f0ac-4e44-bb52-2ca0aafcbd01",

[ { "guid": "f93bfaff-f0ac-4e44-bb52-2ca0aafcbd01", "user_agents": [ "^Acast.+[Aa]ndroid" ], "app": "Acast", "device": "phone", "os": "android" }, { "guid": "476757ae-28b4-47ed-94dd-753cf4832cdb", "user_agents": [ "^Acast.+iOS" ], "app": "Acast", "device": "phone", "os": "ios" }, ...] (also attached)

This would allow a developer to pull this json file, check the guid with what was saved to identify when a record should be updated vs inserted as new. The GUID could be used to match the title of the app and the regular expression used and if the "user_agent" or the "app" changed, then the application would know what record to update.

The result would require that all new user agents added include a unique guid. Thoughts?

Thanks!

--Angelo

Incorrect user agent for Apple Podcasts

I noticed that for Apple Podcasts on macOS Monterrey that the following user agent is used:

AppleCoreMedia/1.0.0.21G83 (Macintosh; U; Intel Mac OS X 12_5_1; en_gb)

In the list it states that AppleCoreMedia should not be treated as Apple Podcasts but in this case it should. I am not at which point Apple changed the UA but it means that platform detection does not work as expected

Thanks!

duplicative of https://github.com/PRX/prx-podagent ?

Seems like maybe we should be working together?
https://github.com/PRX/prx-podagent

Feature request: referrers

Some apps and services use the referrer HTTP header, and this can also be useful information in terms of knowing which service has been used to play an episode.

Here are some of the ones I'm seeing, with the associated useragent.

Referrer	Useragent
https://www.gstatic.com/narrative_cast_receiver/receiver.html?feature=1	Mozilla/5.0%2520(X11;%2520Linux%2520armv7l)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/73.0.3683.47%2520Safari/537.36%2520CrKey/1.39.154941
https://breaker.audio	Breaker/iOS
https://co.radiocut.fm/podcast-episode/how-is-google-podcasts-doing-and/?replay=1	Mozilla/5.0%2520(Linux;%2520Android%25206.0.1;%2520Nexus%25205X%2520Build/MMB29P)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/41.0.2272.96%2520Mobile%2520Safari/537.36%2520(compatible;%2520Googlebot/2.1;%2520+http://www.google.com/bot.html)
https://podcasts.apple.com/gb/podcast/podnews-podcasting-news/id1325018583	Mozilla/5.0%2520(Windows%2520NT%25206.1;%2520Win64;%2520x64;%2520rv:67.0)%2520Gecko/20100101%2520Firefox/67.0
https://podcasts.google.com/	Mozilla/5.0%2520(Windows%2520NT%252010.0;%2520Win64;%2520x64)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/74.0.3729.169%2520Safari/537.36
https://player.fm/series/podnews-podcasting-news/how-many-podcasts-are-no-longer-being-updated	Mozilla/5.0%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko;%2520compatible;%2520Googlebot/2.1;%2520+http://www.google.com/bot.html)%2520Safari/537.36
http://pca.st/w6GI	Mozilla/5.0%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko;%2520compatible;%2520Googlebot/2.1;%2520+http://www.google.com/bot.html)%2520Safari/537.36
https://ar.radiocut.fm/podcast-episode/audiobooks-pitted-against-movies/	Mozilla/5.0%2520(Linux;%2520Android%25206.0.1;%2520Nexus%25205X%2520Build/MMB29P)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/41.0.2272.96%2520Mobile%2520Safari/537.36%2520(compatible;%2520Googlebot/2.1;%2520+http://www.google.com/bot.html)
https://podknife.com/tags/fitness-nutrition	Mozilla/5.0%2520(Windows%2520NT%252010.0;%2520Win64;%2520x64)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/74.0.3729.169%2520Safari/537.36
https://www.gstatic.com/cast/sdk/default_receiver/1.0/app.html?skin=https://chromecast.pocketcasts.com/receiver.css	Mozilla/5.0%2520(X11;%2520Linux%2520armv7l)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/73.0.3683.47%2520Safari/537.36%2520CrKey/1.39.154941
https://tunein.com/radio/podnews-p1088271/?topicId=129761856	Mozilla/5.0%2520(Windows%2520NT%252010.0;%2520Win64;%2520x64)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/75.0.3770.80%2520Safari/537.36

In the above, you'll spot plays from Google Podcasts (web); Apple Podcasts (web) and Player FM (web). You can see someone using PocketCasts to listen via their Chromecast; using TuneIn on the web, and a few others.

Some can be caught with the useragent (Breaker being an obvious example); some can be caught with a referrer alone (like Apple Podcasts web); and some can be caught by both (the Pocketcasts example).

My suggestion might be to add a "referrer" regex, to be used alongside the "useragent" regex, to allow us to correctly attribute these plays to an actual aggregator, rather than lazily attributing them to a browser.

And, yes, we should be catching the Googlebots here and marking them as a "bot".

why escape backslashes?

It is confusing as to why backslashes \ are to be escaped. For example, the regexp ^AppleCoreMedia/1\\..*iPod doesn't match AppleCoreMedia/1.0.0.16G114 (iPod touch; U; CPU OS 12_4_2 like Mac OS X; en_us) (this is a user-agent/example pair from the provided json).

The requirement of escaping backslashes forces the user to de-escape them before testing the regexp. Is this the desired usage?

Filtering Apple Podcasts app watchOS user agent

We are working on implementing IAB's guidance to filter out downloads from Apple Podcasts app watchOS user agent documented here: https://iabtechlab.com/blog/apple-watch-os-podcast-filtering-guidance

We were hoping to match something like `app == "Apple Podcasts" && os = "watchos". As of commit 4ac49c5 though it seems this won't be possible and there does not appear to be a definitive way using the OPAWG list to satisfy the IAB requirement. It allows identifying the watchos but not the Apple Podcasts app specifically.

I'm wondering if there is a more granular regex that would work or do you think this is beyond what is possible with user agents alone, at least at present?

opawg / user-agents Goto Github PK

user-agents's Introduction

THIS IS GOING AWAY

User agent list

Contributing to the list

Slugs

Unknowns

Parsing order

Testing

python

user-agents's People

Contributors

Stargazers

Watchers

Forkers

user-agents's Issues

Recommend Projects

Recommend Topics

Recommend Org