rehanvdm / serverless-website-analytics Goto Github PK

A CDK construct that consists of a serverless backend, frontend and client side code to track website analytics

License: GNU General Public License v2.0

Shell 0.02% TypeScript 53.54% JavaScript 0.18% HTML 0.10% CSS 0.28% Vue 45.89%

serverless-website-analytics's Issues

Implement filters

Create filters for all/most API calls, just like other platforms, I want to click on a page to narrow the stats that I am seeing. Then I want to continue clicking on another stat, say referral. So these filters will be AND filters.

Pixel tracking to embed on things like blogs and emails

There will not be a time on page then, it will be 0, so no secondary request. But that is acceptable

When reauthentication happens for Cognito the query string must be removed for the callback

If it does not Cognito thinks it is the wrong callback URL

Auto create partitions

A cron to run at the beginning of each month that adds the new partitions
A CFN custom resource to run if changing the sites

Logo and favicon

Need a logo and a favicon. Also add it to the readme

Add observibility

There is currently no monitoring, we need to track the important stuff like:

Lambda hard errors & lambda soft errors
Cloudfront/Lambda concurrency for API limits
Athena maybe?
Dashboard

I am thinking that these need to all be flags so that the user can decide what to turn on. Because many CW Alarms and dashboards can push up the cost again.

Auto vacuum if X amount of records have been ingested for that day

Add time filtering

Add time to the date time component and then when the chart is selected, update the date time picker and refresh

Do not count page view with 0 time on page towards the time calculations

These still count as a view, it only means the secondary tracking event never made it to us, highly unlikely that the first did but the second did not, but possible.

Certificate defined within the stack doesn't have a region so fails construct validation

I am defining the HostedZone and certificate within my stack, so, the region check is failing because it gets a TOKEN instead of us-east-1:

Error: Certificate must be in us-east-1, not in ${Token[TOKEN.646]}, this is a requirement for CloudFront

Is having the cert (and hosted zone) in a separate stack such a strong best practice that the construct shouldn't support it?

If you want to allow this, then when a TOKEN is received, you could check the region of the stack. Alternatively, provide some prop to disable the region checking.

Stack looks like this:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Swa, SwaProps } from 'serverless-website-analytics'

export class WebAnalyticsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const analyticsZone = new cdk.aws_route53.PublicHostedZone(this, 'HostedZone', {
      zoneName: 'analytics.example.com',
    });

    const certificate = new cdk.aws_certificatemanager.Certificate(this, 'Certificate', {
      domainName: 'analytics.example.com',
      subjectAlternativeNames: ["*.analytics.example.com"],
      validation: cdk.aws_certificatemanager.CertificateValidation.fromDns(analyticsZone),
    });

    const options: SwaProps = {
      environment: 'production',
      awsEnv: {
        account: this.account,
        region: this.region
      },
      sites: ['example.com'],
      domain: {
        name: 'web.analytics.example.com',
        certificate: certificate,
        hostedZone: analyticsZone,
      },
      allowedOrigins: ["*"]
    }

    new Swa(this, 'web-analytics', options)
  }
}

Auto analyize traffic

The idea would be to auto attribute/tag certain points in time that are of importance. Being able to identify a spike in traffic is because of a certain referrer/utm source/location. Maybe even going out to the internet and trying to find where it was shared.

A better name might be spike analysis or traffic finder.

Cannot create multiple constructs in one stack

The domain property creates a Route53 record for CloudFront with a fixed construct id, so if it already exists then you can only ever deploy 1 serverless-website-analytics component per stack.

This is on version 0.x.x.

You get the following error when using CDK deploy:

Error: There is already a Construct with name 'cloudfront-record' in App [xxx]

Handle invalid IPs

Also double check if IPV6 works

Collapse all google domain refferers as google

Collapse all google domain refferers as google, currently we get many depending on region like:

https://www.google.co.in/
https://www.google.com.ph/
https://www.google.it/
https://www.google.com.ua/

So if it starts with https://www.google then we just make it google

Bug: wrong A record for Cognito domain if using a subdomain

I have Hosted Zone example.com. With this config:

new Swa(this, "swa-example-com", {
	...
	auth: {
		cognito: {
			loginSubDomain: "login",
			users: [],
		},
	},
	domain: {
		name: 'swa.example.com',
		usEast1Certificate: props.certificate,
		hostedZone: route53.HostedZone.fromHostedZoneAttributes(this, "HostedZone", {
			hostedZoneId: 'Z01234',
			zoneName: 'example.com',
		}),
	},
});

What's created is:

Hosted Zone swa.example.com A record for the website ✅
Cognito Custom Domain login.swa.example.com ✅
Hosted Zone login.example.com A record for the Cognito Custom Domain ❌ -> should be login.swa.example.com

I'm pretty sure the fix is to change

serverless-website-analytics/infra/src/frontend.ts

Line 222 in 1a1ac09

recordName: props.auth.cognito.loginSubDomain,

recordName: cognitoDomain.domainName,

The route53.ARecord Construct accepts either subdomain name or fully qualified domain name: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_route53.ARecord.html#recordname
Theoretically, the fully qualified domain name should end with . so we can do recordName: cognitoDomain.domainName + '.' but in fact CDK auto-fixes it 🤷‍♂️

Plotly.js must not fetch JSON for the map from an external CDN

It must use the embedded JSON value for the map, from the package npm or we must bundle and point to it ourselves.

Complete tracking events

The firehose and backend API calls are all done, it just needs testing and documentation. Oh and a frontend 😅. Yes, would need a quick design in Excalidraw maybe.

Browser and OS tabs are disabled

Experiment Glue Table Statistics

Can speed up the planning phase of Athena queries. But could not get it to work. See branch https://github.com/rehanvdm/serverless-website-analytics/tree/feature/experiment-column-level-statistics

Created a gue table from the a v2 of the Glue external table for experimentation manually, but it did not want to run:

I am trying to use the new Glue Table Statistics https://aws.amazon.com/blogs/big-data/enhance-query-performance-using-aws-glue-data-catalog-column-level-statistics/ but I get an error:
Exception in User Class: java.lang.AssertionError : assertion failed: Conflicting partition column names detected: .Which does not help me at all 😅 Table details in the screenshots. I wonder if it is because I am using partition projection, but the limitations does not mention anything about it :thinking_face:
Anyone that can help me out? I don't want to/can't open a support ticket for this, it's for an open-source project: serverless-website-analytics

Not even selecting all columns individually except the partition columns worked, still got that error

Basic real time view

Would be nice to have a basic version of the real-time view from GA:

nr. of users, vists, events, pageviews in the last 15 minutes
graph of the last 4 hours or so that updates real-time
real time referrer, utm and OS counts

As discussed with @rehanvdm on X/twitter, one way to do this would be to use Kinesis Streams - but that adds a $28 floor cost. Would thus definitely have to make this optional but are there any alternative methods that have a better cost scaling for small sites?

Expose the API Lambda concurrency limits

Expose the options on the constructor. The frontend is at 100 and the ingest API at 200.

We have a Lambdalith for both APIs, we can throttle the API requests by throttling the Lambda concurrency.

Explore using AWS Timestream instead of Athena

Janitor for pruning data

Create a cron job lambda that runs a CTAS query to group and store the records by page_id and highest time_on_page, combining the initial page track and the final page track. This cuts down on the data stored. The system has been designed in such a way as to cater for this. It is basically the equivalent of a vacuum command in postgress terms, but can be done without locking.

Outbound link tracking

Use JS to track href clicks, should be able to https://gist.github.com/leighmcculloch/7596803

Store as a tracking event? Not sure yet

The dashboard is not mobile friendly

Add option to view by hour/day

This is currently decided for you by the logic here

If it is < 3 days then it will get data by the hour otherwise by the day.

Sometimes you want to look at a weeks data and see spikes to identify when traffic orginated

Make this a drop-down on the chart

Document query your own data

Add performance tracking

Demo page improvements - self track & banner

Maybe two flags on the construct:

Add a banner at the top pointing to this GH repo
Make the page track itself, send data to its own endpoint, IF this flag is set then we need to automatically add the domain to the specified sites, which will cause a CFN dependency loop between frontend and backend. So either custom resource OR just let the user add the site themselves, which is a better solution for now.

Email reports

Daily or weekly email reports configurable within the CDK construct

Add an API for page view retrival to show on the page it self

Thinking of caching the page url + page views number in a DDB table, TTL of say 15 minutes. BUT this will scan over all the partitions. So instead, let's instead scan over the last 30 days and then this feature has a liveliness counter to it as well. The average read time can also be exposed like this.

Anomaly Detection

Detect anomalies and send notifications via SNS (to email)

Vacuum does not delete old files

Update the geolite2 IP db

Add the date to the docs of the last update

Partition by day now that we have dynamic partitioning

Do not authenticate the client side script when using Basic Auth

Bug as mentioned in this post https://dev.to/aws/deploying-a-serverless-web-analytics-solution-for-your-websites-5coh

I ended up have to change the code that the CDK stack deployed slightly. As part of the deployment there is a CloudFront function that implements basic authentication. I just needed to add an exclusion so that it would not try and authenticate requests to this client-script.js. From the CloudFront console, I update the code as follows (the code is also in the repo)

Solution, just add a pass-through behavior on the cdn/* routes

Use LLRT runtime if possible

Use https://github.com/awslabs/llrt as the Lambda runtime to reduce Lambda execution time and cold starts.

See if it is possible and what improvements there is to gain by switching

Clicking on a referrer should have a context menu instead

The referrer should not be added to filter immediately, many times you want to copy the URL so provide two options when left clicking:

Filter by: "..."
Go to: "..."

Do the same with the page section as well.

This would mean that when that filter is selected we need to know about it and not show the Filter by option again.

Add standallone usage

Relates to rehanvdm/serverless-website-analytics-client#3

Expose the Firehose buffering time

By exposing the buffering time we enable the user to decide, the lower the value the more it will cost for high-volume sites. This is because firehose will write to S3 more frequently.

It is currently set at 1min and then the data shows up on the dashboard ±1min after that, so 2mins total until you see your page view.

I think the maximum is 15 mins buffering.

Udpate Anomaly Detection Algorithm so that it better detects when an anomaly is over

Currently, an anomaly is marked as over: the first time a value is under the breach threshold and the first time the slope is positive.

2024-02-15 19:00 07 (17) ========== |
2024-02-15 18:00 01 (17) = #
2024-02-15 17:00 07 (19) ========== #
2024-02-15 16:00 08 (21) =========== #
2024-02-15 15:00 17 (21) ========================
2024-02-15 14:00 21 (20) =============================#
2024-02-15 13:00 20 (19) ===========================#=

The above example is of a slow breach, there we could say that if the value is below the threshold, then mark it as over.

But for big spikes it is a different story, test and find a solution for both scenarios. Here we are trying to find the whole event, if we just say when the value is not breaching then we loose the parts I marked in red.

Explore having a third EMA with a much slower reaction, maybe an Alpha of 0.01, then if that is crossed after an anomaly, mark the evaluation as OK? Something like that

Consider adding pricing info in the readme

Great work with this solution!

It’d be great to have in the readme a section that discusses the average expected pricing for the solution.

For instance: running this solution with about 10M events/page views per month and a certain amount of dashboard interactions would cost you $$/month.

I think this would help people evaluate whether this is for them or not.

Uptime monitoring

Show the bot vs human stat on the dashboard

There is a column (boolean) on each record that marks if the view was a bot or human. We can also show this on the dashboard.

Increase queries performance and efficency

We can get a 2x to 3x improvement by specifying the partitions explicitly and not letting Athena infer them in queries. This also scans less data and possibly fewer S3 records.

Experiment

Doing a range query is not as efficient as specifying the partitions directly. Compare these:

Using the exact partition field (page_opened_at_date):

AND (
  page_opened_at_date = '2023-08-27' OR
  page_opened_at_date = '2023-08-28' OR
  page_opened_at_date = '2023-08-29' OR
  page_opened_at_date = '2023-08-30' OR
  page_opened_at_date = '2023-08-31' OR
  page_opened_at_date = '2023-09-01' OR
  page_opened_at_date = '2023-09-02' OR
  page_opened_at_date = '2023-09-03' OR
  page_opened_at_date = '2023-09-04')

Results (71) | Time in queue: 143 ms | Run time: 1.308 sec | Data scanned: 1.12 MB

Letting Athena extrapolate the partition from the timestamp field(page_opened_at):

 page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 

Results (71) | Time in queue: 129 ms | Run time: 4.663 sec | Data scanned: 1.49 MB

Results

Query Type
Specify partitions	1.3s
Infer partitions from range query	4.7s

Conclusion

Possible reasons for the increase:

Simplified logic, less planning
Athena does not have to identify partitions, we explicitly specify them
Less data to be scanned, the range query scans more than it needs to

The initial assumption that Athena will infer the partitions if we are using automatic partition projection is still correct. But there seems to be quite a significant performance loss if it infers the partitions automatically on range queries.

We will specify the exact partition instead of the range query. The Athena query maximum length is ±260k, if 1 date condition (page_opened_at_date = '2023-08-24' OR) is 37 chars, assuming 31 days and 12 months then the extra length this adds is 31*37*12=13764 bytes or about 5.3% the maximum allowed length. Meaning it will even support 10 year queries as the queries we have are not as complex.

Exact queries used in the test:

WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') 
              AND (
              page_opened_at_date = '2023-08-27' OR
              page_opened_at_date = '2023-08-28' OR
              page_opened_at_date = '2023-08-29' OR
              page_opened_at_date = '2023-08-30' OR
              page_opened_at_date = '2023-08-31' OR
              page_opened_at_date = '2023-09-01' OR
              page_opened_at_date = '2023-09-02' OR
              page_opened_at_date = '2023-09-03' OR
              page_opened_at_date = '2023-09-04')
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1 AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC

WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC

S3 Express One Zone

Make use of S3 one zone if it is supported in that region. Maybe make this a flag that can be set, not everyone would want this. I am expecting that it will not improve performance by much if at all. But it will cut the cost of S3 retrieval by 50%.

Insights

Anomaly detection for both pages and events on:

Traffic
Referrer count
UTM Source

Store in DDB to be used by the frontend. Create an overview page with this that is the new default page showing page traffic, event traffic and then insights. Thinking of two columns, two rows. In the left column show the chart for each page and event traffic below each other and in the right column the insights as text, maybe even a condensed timeline.

Blocked by #69

rehanvdm / serverless-website-analytics Goto Github PK

serverless-website-analytics's Issues

Experiment

Results

Conclusion

Recommend Projects

Recommend Topics

Recommend Org