bellingcat / wayback-google-analytics Goto Github PK

View Code? Open in Web Editor NEW

165.0 7.0 21.0 281 KB

A lightweight tool for scraping current and historic Google Analytics data

Home Page: https://pypi.org/project/wayback-google-analytics/

License: MIT License

Python 100.00%

google-analytics python wayback-machine command-line scraper open-source-research

wayback-google-analytics's Introduction

Wayback Google Analytics

A lightweight tool to gather current and historic Google analytics codes for OSINT investigations.

· Report Bug · Request Feature

Table of Contents

About The Project
Installation
- Install with pip
- Download from source
Usage
Contributing
- Bugs and feature requests
Development
- Testing
- Using Poetry for development
License
Contact
Acknowledgments

About The Project

Wayback Google Analytics is a lightweight tool that gathers current and historic Google analytics data (UA, GA and GTM codes) from a collection of website urls.

Read Bellingcat's article about using this tool to uncover disinformation networks online here.

Why do I need GA codes?

Google Analytics codes are a useful data point when examining relationships between websites. If two seemingly disparate websites share the same UA, GA or GTM code then there is a good chance that they are managed by the same individual or group. This useful breadcrumb has been used by researchers and journalists in OSINT investigations regularly over the last decade, but a recent change in how Google handles its analytics codes threatens to limit its effectiveness. Google began phasing out UA codes as part of its Google Analytics 4 upgrade in July 2023, making it significantly more challenging to use this breadcrumb during investigations.

How does this tool help me?

Luckily, the Internet Archive's Wayback Machine contains useful snapshots of websites containing their historic GA IDs. While you could feasibly check each snapshot manually, this tool automates that process with the Wayback Machines CDX API to simplify and speed up the process. Enter a list of urls and a time frame (along with extra, optional parameters) to collect current and historic GA, UA and GTM codes and return them in a format you choose (json, txt, xlsx or csv).

The raw json output for each provided url looks something like this:

        "someurl.com": {
            "current_UA_code": "UA-12345678-1",
            "current_GA_code": "G-1234567890",
            "current_GTM_code": "GTM-12345678",
            "archived_UA_codes": {
                "UA-12345678-1": {
                    "first_seen": "01/01/2019(12:30)",
                    "last_seen": "03/10/2020(00:00)",
                },
            },
            "archived_GA_codes": {
                "G-1234567890": {
                    "first_seen": "01/01/2019(12:30)",
                    "last_seen": "01/01/2019(12:30)",
                }
            },
            "archived_GTM_codes": {
                "GTM-12345678": {
                    "first_seen": "01/01/2019(12:30)",
                    "last_seen": "01/01/2019(12:30)",
                },
        },
    }

Built With

Additional libraries/tools: BeautifulSoup4, Asyncio, Aiohttp

(back to top)

Installation

Install from pypi (with pip)

The easiest way to to install Wayback Google Analytics is from the command line with pip.

Open a terminal window and navigate to your chosen directory.
Create a virtual environment and activate it (optional, but recommended; if you use Poetry or pipenv those package managers do it for you)
```
python3 -m venv venv
source venv/bin/activate
```
Install the project with pip
```
pip install wayback-google-analytics
```
Get a high-level overview
```
wayback-google-analytics -h
```

Download from source

You can also clone and download the repo from github and use the tool locally.

Clone repo:

git clone [email protected]:bellingcat/wayback-google-analytics.git

Navigate to root, create a venv and install requirements.txt:

cd wayback-google-analytics
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Get a high-level overview:

python -m wayback_google_analytics.main -h

(back to top)

Usage

Getting started

Enter a list of urls manually through the command line using --urls (-u) or from a given file using --input_file (-i).
Specify your output format (.csv, .txt, .xlsx or .csv) using --output (-o).
Add any of the following options:

Options list (run wayback-google-analytics -h to see in terminal):

options:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file INPUT_FILE
                        Enter a file path to a list of urls in a readable file type
                        (e.g. .txt, .csv, .md)
  -u URLS [URLS ...], --urls URLS [URLS ...]
                        Enter a list of urls separated by spaces to get their UA/GA
                        codes (e.g. --urls https://www.google.com
                        https://www.facebook.com)
  -o {csv,txt,json,xlsx}, --output {csv,txt,json,xlsx}
                        Enter an output type to write results to file. Defaults to
                        json.
  -s START_DATE, --start_date START_DATE
                        Start date for time range (dd/mm/YYYY:HH:MM) Defaults to
                        01/10/2012:00:00, when UA codes were adopted.
  -e END_DATE, --end_date END_DATE
                        End date for time range (dd/mm/YYYY:HH:MM). Defaults to None.
  -f {yearly,monthly,daily,hourly}, --frequency {yearly,monthly,daily,hourly}
                        Can limit snapshots to remove duplicates (1 per hr, day, month,
                        etc). Defaults to None.
  -l LIMIT, --limit LIMIT
                        Limits number of snapshots returned. Defaults to -100 (most
                        recent 100 snapshots).
  -sc, --skip_current   Add this flag to skip current UA/GA codes when getting archived
                        codes.

Examples:

To get current codes for two websites and archived codes between Oct 1, 2012 and Oct 25, 2012: wayback-google-analytics --urls https://someurl.com https://otherurl.org --output json --start_date 01/10/2012 --end_date 25/10/2012 --frequency hourly

To get current codes for a list of websites (from a file) from January 1, 2012 to the present day, checking for snapshots monthly and returning it as an excel spreadsheet: wayback-google-analytics --input_file path/to/file.txt --output xlsx --start_date 01/01/2012

To check a single website for its current codes plus codes from the last 2,000 archive.org snapshots: wayback-google-analytics --urls https://someurl.com --limit -2000

Output files & spreadsheets

Wayback Google Analytics allows you to export your findings to either .csv or .xlsx spreadsheets. When choosing to save your findings as a spreadsheet, the tool generates two databases: one where each url is the primary index and another where each identified code is the primary index. In an .xlsx file this is one spreadsheet with two sheets, while the .csv option generates one file sorted by codes and another sorted by websites. All output files can be found in /output, which is created in the directory from which the code is executed.

Example spreadsheet

Let's say we're looking into data from 4 websites from 2015 until present and we want to save what we find in an excel spreadsheet. Our start command looks something like this:

wayback-google-analytics -u https://yapatriot.ru https://zanogu.com https://whoswho.com.ua https://adamants.ru -s 01/01/2015 -f yearly -o xlsx

The result is a single .xlsx file with two sheets.

Ordered by website:

Ordered by code:

(back to top)

Limitations & Rate Limits

We recommend that you limit your list of urls to ~10 and your max snapshot limit to <500 during queries. While Wayback Google Analytics doesn't have any hardcoded limitations in regards to how many urls or snapshots you can request, large queries can cause 443 errors (rate limiting). Being rate limited can result in a temporary 5-10 minute ban from web.archive.org and the CDX api.

The app currently uses asyncio.Semaphore() along with delays between requests, but large queries or operations that take a long time can still result in a 443. Use your judgment and break large queries into smaller, more manageable pieces if you find yourself getting rate limited.

(back to top)

Contributing

Bugs and feature requests

Please feel free to open an issue should you encounter any bugs or have suggestions for new features or improvements. You can also reach out to me directly with suggestions or thoughts.

(back to top)

Development

Testing

Run tests with python -m unittest discover
Check coverage with coverage run -m unittest

Using Poetry for Development

Wayback Google Analytics uses Poetry, a Python dependency management and packaging tool. A GitHub workflow automates the tests on PRs and to main (see our workflow here), be sure to update the semantic version number in pyproject.toml when opening a PR.

If you have push access, follow these steps to trigger the GitHub workflow that will build and release a new version to PyPI :

Change the version number in pyproject.toml
Create a new tag for that version git tag "vX.0.0"
Push the tag git push --tags

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

You can contact me through email or social media.

email: jclarksummit at gmail dot com
Twitter/X: @JustinClarkJO
Linkedin: Justin Clark

Project Link: https://github.com/bellingcat/wayback-google-analytics

(back to top)

Acknowledgments

Bellingcat for hosting this project
Miguel Ramalho for constant support, thoughtful code reviews and suggesting the original idea for this project

(back to top)

wayback-google-analytics's People

Contributors

Stargazers

Watchers

wayback-google-analytics's Issues

Connection error if url list > ~10

Overview

The project uses aiohttp and asyncio to gather and run large numbers of tasks asynchronously. This is great for running 10 or so urls as it is lightning fast, but large numbers of urls can still cause us to get rate limited and see a aiohttp.client_exceptions.ClientConnectionError error.

Solution

The easiest and probably best solution is to add a semaphore to limit concurrent requests. This actually already happens when processing urls (when requesting snapshots from the CDX url, for example) but there's no limit at a higher level. The final semaphore limit should probably be set to somewhere between 10-20 and will take a bit of tweaking to find the balance between performance and not getting rate limited.

Module pandas is not mentioned in requirements.txt

Not a big issue just slight inconvenience
Traceback (most recent call last): File "/home/langley/SOFT/osint-google-analytics/main.py", line 14, in <module> from osint_ga.output import ( File "/home/langley/SOFT/osint-google-analytics/osint_ga/output.py", line 4, in <module> import pandas as pd ModuleNotFoundError: No module named 'pandas'

4 - Add async tests

Async functions, as well as functions in main.py, still need tests.

bug: KeyError: 'current_UA_code'

wayback_google_analytics/output.py", line 101, in get_urls_df
    "UA_Code": info["current_UA_code"],
KeyError: 'current_UA_code'

This is due, I believe, to this code section where the keys are assigned, but only if the html variable is not None.

One simple solution is to do dict.get instead of dict[] for the optionally present parameters, eg: info.get("current_UA_code", ""), but I wonder if there's some missing logic between the 2 different parts and if there's a better way to address it.

Unable to retrieve any codes

Hello, the tool only occasionally retrieves relevant codes from webarchive -- otherwise the output is empty:

wayback-google-analytics -u https://yapatriot.ru https://zanogu.com https://whoswho.com.ua https://adamants.ru -s 01/01/2015 -f yearly -o xlsx

Gave me this:

[{'https://yapatriot.ru': {'archived_UA_codes': {'UA-65087228-1': {'first_seen': '20/01/2017:03:55', 'last_seen': '30/06/2019:05:32'}, 'UA-53176102-14': {'first_seen': '15/06/2015:19:36', 'last_seen': '15/06/2015:19:36'}}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://zanogu.com': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://whoswho.com.ua': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://adamants.ru': {'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}]

And this:

[{'https://yapatriot.ru': {'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://zanogu.com': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://whoswho.com.ua': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://adamants.ru': {'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}]

on two separate runs. I was able to retrieve UA for zanogu.com and none of the others on another try. No error messages are displayed in the process; I ran the tool on two virtual environments and a remote machine.

feat: export findings to GraphML extension

Overview

Several investigations focused on Google Analytics IDs have visualized the relationships between UA codes and websites. Lawrence Alexander's 2015 Bellingcat piece comes to mind as a great example of this: he uses GraphML files and the yEd graph editor to neatly display the connections between websites.

It would be great to add support for the GraphML filetype to the project and allow users to export their findings to .graphml.

Solution

We would need to add .graphml as an export option then write the appropriate code to generate the file. networkx is a popular Python library for building .graphml files that might be a good choice for this feature.

While building out this feature it might be a good idea to group UA codes together even if the final digit doesn't match. For example: UA-12345-1 and UA-12345-2 are currently treated as different, but if we're mapping websites visually it likely best to have UA-12345 as its own node and each website sharing that part of the code be connected nodes. There is an example of this in the linked article above.

2 - cli-interface

Needs CLI to be more accessible to the average user. Some considerations:

CLI should include options for reading URLs directly from the command line or reading from a text file of urls
Should output results in CL in basic text, with an option to print the results to a csv file or text file.

Better solutions to web.archive.org rate limiting

Overview

The tool works best when given smaller requests of <10 urls and a snapshot limit of <500. Currently, the asyncio library's build in semaphore does an ok job of avoiding rate limiting when kept to these recommended parameters, but I wonder if there is a better or more dynamic way to deal with this issue? The issue does not appear to be with the CDX api itself, but rather a larger issue with making numerous requests to web.archive.org when getting snapshots that causes a temporary ban. All in all, I'm finding web.archive.org to be a bit unpredictable and cannot find consistent documentation for making requests to the site.

Possible solutions

Incorporating a library w/ exponential delays

There are some Python libraries like Backoff and aiohttp_retry that provide some wrappers for dealing with getting rate limited. I've messed around with both, but wasn't able to get large requests (>50 urls + >1000 limit) to work reliably.

Custom solution

There might be a way to determine the best parameters based on the size of the request. Such a solution might dynamically generate a semaphore value or incorporate some kind of jitter between calls, or maybe pause the operation and prompt the user to wait 5 minutes before attempting to resume.

Convert to python package

Update the directory and file structure to convert into python package

1 - async requests

The program could run significantly faster if requests to the Wayback API were async. Library options include aiohttp + asyncio.

Output File Naming Issue on Windows Leads to OSError

When attempting to run wayback-google-analytics with the --output json option on Windows, the program fails during the output file generation phase, raising an OSError related to an invalid argument in the file path. This issue appears to be caused by the inclusion of characters in the automatically generated file name that are not allowed in Windows file paths (e.g., :).

Steps to Reproduce:

Execute the command with the following parameters:
wayback-google-analytics --urls https://someurl.com https://otherurl.org --output json --start_date 01/10/2012 --end_date 25/10/2012 --frequency hourly

The command fails after processing with an OSError regarding an invalid file name argument.

The execution fails with the following error:
OSError: [Errno 22] Invalid argument: './output\\16-02-2024(15:38:18).json'

Additional Information:

Operating System: Windows
Python Version: 3.11.8
Tool Version: 0.2.2

Ciao!

start and end parameters often result in error

Hey! First of all: great tool - this can be super useful!

I've been dabbling around with it for a bit, but it still throws a lot of errors. I especially encounter a couple of issues with the start and end parameters, e.g.:

This one works:

wayback-google-analytics -u https://tagesschau.de https://washingtonpost.com https://nytimes.com https://spiegel.de -s 01/01/2015 -f yearly -o xlsx

While this one throws an error:

wayback-google-analytics -u https://tagesschau.de https://washingtonpost.com https://nytimes.com https://spiegel.de -e 01/01/2015 -f yearly -o xlsx

Also, this one doesn't work:

wayback-google-analytics -u https://tagesschau.de https://washingtonpost.com https://nytimes.com https://spiegel.de -s 01/01/2012 -e 01/01/2015 -f yearly -o xlsx

That's what I'm getting in the console:

wayback-google-analytics -u https://tagesschau.de https://washingtonpost.com https://nytimes.com https://spiegel.de -s 01/01/2012 -e 01/01/2015 -f yearly -o xlsx
Retrieving current codes for: https://tagesschau.de
Finished gathering current codes for: https://tagesschau.de
Retrieving archived codes for: https://tagesschau.de
CDX url: http://web.archive.org/cdx/search/cdx?url=https://tagesschau.de&matchType=domain&filter=statuscode:200&fl=timestamp&output=JSON&collapse=timestamp:4&limit=5&from=20120101000000&to=20150101000000
Timestamps from CDX api: ['20120101145508', '20130101134651', '20140101002220']
Retrieving codes from url: https://web.archive.org/web/20140101002220/https://tagesschau.de
Finish gathering codes for: https://web.archive.org/web/20140101002220/https://tagesschau.de
Retrieving codes from url: https://web.archive.org/web/20130101134651/https://tagesschau.de
Finish gathering codes for: https://web.archive.org/web/20130101134651/https://tagesschau.de
Retrieving codes from url: https://web.archive.org/web/20120101145508/https://tagesschau.de
Finish gathering codes for: https://web.archive.org/web/20120101145508/https://tagesschau.de
Finished retrieving archived codes for: https://tagesschau.de
Retrieving current codes for: https://nytimes.com
Finished gathering current codes for: https://nytimes.com
Retrieving archived codes for: https://nytimes.com
CDX url: http://web.archive.org/cdx/search/cdx?url=https://nytimes.com&matchType=domain&filter=statuscode:200&fl=timestamp&output=JSON&collapse=timestamp:4&limit=5&from=20120101000000&to=20150101000000
Timestamps from CDX api: ['20120101010418', '20130101011853', '20140101005350']
Retrieving current codes for: https://spiegel.de
Finished gathering current codes for: https://spiegel.de
Retrieving archived codes for: https://spiegel.de
CDX url: http://web.archive.org/cdx/search/cdx?url=https://spiegel.de&matchType=domain&filter=statuscode:200&fl=timestamp&output=JSON&collapse=timestamp:4&limit=5&from=20120101000000&to=20150101000000
Retrieving codes from url: https://web.archive.org/web/20140101005350/https://nytimes.com
Finish gathering codes for: https://web.archive.org/web/20140101005350/https://nytimes.com
Retrieving codes from url: https://web.archive.org/web/20130101011853/https://nytimes.com
Finish gathering codes for: https://web.archive.org/web/20130101011853/https://nytimes.com
Timestamps from CDX api: ['20120101003932', '20130101025925', '20140101051638', '20120102042211', '20130107072516']
Retrieving codes from url: https://web.archive.org/web/20130101025925/https://spiegel.de
Finish gathering codes for: https://web.archive.org/web/20130101025925/https://spiegel.de
Retrieving codes from url: https://web.archive.org/web/20140101051638/https://spiegel.de
Finish gathering codes for: https://web.archive.org/web/20140101051638/https://spiegel.de
Retrieving codes from url: https://web.archive.org/web/20120102042211/https://spiegel.de
Finish gathering codes for: https://web.archive.org/web/20120102042211/https://spiegel.de
Retrieving codes from url: https://web.archive.org/web/20120101010418/https://nytimes.com
Finish gathering codes for: https://web.archive.org/web/20120101010418/https://nytimes.com
Finished retrieving archived codes for: https://nytimes.com
Retrieving codes from url: https://web.archive.org/web/20130107072516/https://spiegel.de
Finish gathering codes for: https://web.archive.org/web/20130107072516/https://spiegel.de
Retrieving codes from url: https://web.archive.org/web/20120101003932/https://spiegel.de
Finish gathering codes for: https://web.archive.org/web/20120101003932/https://spiegel.de
Finished retrieving archived codes for: https://spiegel.de
Retrieving current codes for: https://washingtonpost.com
Finished gathering current codes for: https://washingtonpost.com
Retrieving archived codes for: https://washingtonpost.com
CDX url: http://web.archive.org/cdx/search/cdx?url=https://washingtonpost.com&matchType=domain&filter=statuscode:200&fl=timestamp&output=JSON&collapse=timestamp:4&limit=5&from=20120101000000&to=20150101000000
Timestamps from CDX api: ['20120101003932', '20130101002535', '20140101000350', '20120229065623', '20130626071932']
Retrieving codes from url: https://web.archive.org/web/20130626071932/https://washingtonpost.com
Finish gathering codes for: https://web.archive.org/web/20130626071932/https://washingtonpost.com
Retrieving codes from url: https://web.archive.org/web/20130101002535/https://washingtonpost.com
Finish gathering codes for: https://web.archive.org/web/20130101002535/https://washingtonpost.com
Retrieving codes from url: https://web.archive.org/web/20140101000350/https://washingtonpost.com
Finish gathering codes for: https://web.archive.org/web/20140101000350/https://washingtonpost.com
Retrieving codes from url: https://web.archive.org/web/20120101003932/https://washingtonpost.com
Finish gathering codes for: https://web.archive.org/web/20120101003932/https://washingtonpost.com
Retrieving codes from url: https://web.archive.org/web/20120229065623/https://washingtonpost.com
Finish gathering codes for: https://web.archive.org/web/20120229065623/https://washingtonpost.com
Finished retrieving archived codes for: https://washingtonpost.com
[{'https://tagesschau.de': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://washingtonpost.com': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://nytimes.com': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}, {'https://spiegel.de': {'current_UA_code': [], 'current_GA_code': [], 'current_GTM_code': [], 'archived_UA_codes': {}, 'archived_GA_codes': {}, 'archived_GTM_codes': {}}}]
Traceback (most recent call last):
File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/bin/wayback-google-analytics", line 8, in
sys.exit(main_entrypoint())
File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/wayback_google_analytics/main.py", line 182, in main_entrypoint
asyncio.run(main(args))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/wayback_google_analytics/main.py", line 100, in main
write_output(output_file, args.output, results)
File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/wayback_google_analytics/output.py", line 67, in write_output
codes_df = get_codes_df(results)
File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/wayback_google_analytics/output.py", line 183, in get_codes_df
codes_df.groupby("code")
File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/pandas/core/frame.py", line 8872, in groupby
return DataFrameGroupBy(
File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1274, in init
grouper, exclusions, obj = get_grouper(
File "/mypath/wayback-google-analytics/wayback-google-analyticsEnvironment/lib/python3.10/site-packages/pandas/core/groupby/grouper.py", line 1009, in get_grouper
raise KeyError(gpr)
KeyError: 'code'

Any idea what's going on? Many thanks!

6 - Sometimes inaccurate timestamps for archived codes

Overview

On occasion, the tool returns incorrect start/end dates for codes. An example:

{
        "https://syria.tv": {
            "current_UA_code": [
                "UA-216952176-1"
            ],
            "current_GA_code": [],
            "current_GTM_code": [
                "GTM-MSBSQ22"
            ],
            "archived_UA_codes": {
                "UA-47211704-1": {
                    "first_seen": "16/02/2015:03:46",
                    "last_seen": "13/01/2016:06:26"
                },
                "UA-97335575-1": {
                    "first_seen": "18/01/2022:16:36",
                    "last_seen": "10/01/2021:16:21"
                },
                "UA-216952176-1": {
                    "first_seen": "09/06/2023:08:46",
                    "last_seen": "09/06/2023:08:46"
                }
            },
            "archived_GA_codes": {},
            "archived_GTM_codes": {
                "GTM-MSBSQ22": {
                    "first_seen": "18/01/2022:16:36",
                    "last_seen": "01/01/2023:03:50"
                }
            }
        }
    },

Last seen is after first seen in some places, and there isn't really a logical progression from code to code.

Cause and possible solution

I think the main culprit here is that the results dictionary is being updated asynchronously. Since we're calling asyncio.gather() on a collection of tasks, it isn't moving orderly and from timestamp to timestamp chronologically. Instead, a bunch of async calls are taking place and whoever finishes first sets "first_seen".

This could be solved by adding some extra logic that assumes that the dates will be processed out of order and updates first/last seen based on the value of the date being processed.

3 - Fix conflicts with params

Using certain params together for the cdx api causes errors.

Mixing 'limit' and 'frequency': breaks unless the limit is approximately how many snapshots should be returned given a start date and a frequency. Solution could be to find a way to regenerate limit provided a start date + frequency is generated, or add an extra layer of sorting after snapshots are received to filter out any unneeded timestamps.
Asc/desc results: currently not implemented. Need to see how it interacts with limit and if it is feasible.

feat: Complete README

I'd suggest adding these to the readme file:

installation instructions (maybe some badges too, eg here)
development setup (or a link to info on how to setup a poetry project)
any other cleanup like my addition of the deployment instructions

bellingcat / wayback-google-analytics Goto Github PK

wayback-google-analytics's Introduction

Wayback Google Analytics

About The Project

Why do I need GA codes?

How does this tool help me?

Further reading

Built With

Installation

Install from pypi (with pip)

Download from source

Usage

Getting started

Output files & spreadsheets

Example spreadsheet

Limitations & Rate Limits

Contributing

Bugs and feature requests

Development

Testing

Using Poetry for Development

License

Contact

Acknowledgments

wayback-google-analytics's People

Contributors

Stargazers

Watchers

Forkers

wayback-google-analytics's Issues

Overview

Solution

Overview

Solution

Overview

Possible solutions

Incorporating a library w/ exponential delays

Custom solution

Steps to Reproduce:

Additional Information:

Overview

Cause and possible solution

Recommend Projects

Recommend Topics

Recommend Org