Giter VIP home page Giter VIP logo

serverless-website-analytics's Issues

Implement filters

Create filters for all/most API calls, just like other platforms, I want to click on a page to narrow the stats that I am seeing. Then I want to continue clicking on another stat, say referral. So these filters will be AND filters.

Auto create partitions

  • A cron to run at the beginning of each month that adds the new partitions
  • A CFN custom resource to run if changing the sites

Add observibility

There is currently no monitoring, we need to track the important stuff like:

  • Lambda hard errors & lambda soft errors
  • Cloudfront/Lambda concurrency for API limits
  • Athena maybe?
  • Dashboard

I am thinking that these need to all be flags so that the user can decide what to turn on. Because many CW Alarms and dashboards can push up the cost again.

Add time filtering

Add time to the date time component and then when the chart is selected, update the date time picker and refresh

Certificate defined within the stack doesn't have a region so fails construct validation

I am defining the HostedZone and certificate within my stack, so, the region check is failing because it gets a TOKEN instead of us-east-1:

Error: Certificate must be in us-east-1, not in ${Token[TOKEN.646]}, this is a requirement for CloudFront

Is having the cert (and hosted zone) in a separate stack such a strong best practice that the construct shouldn't support it?

If you want to allow this, then when a TOKEN is received, you could check the region of the stack. Alternatively, provide some prop to disable the region checking.

Stack looks like this:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Swa, SwaProps } from 'serverless-website-analytics'

export class WebAnalyticsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const analyticsZone = new cdk.aws_route53.PublicHostedZone(this, 'HostedZone', {
      zoneName: 'analytics.example.com',
    });

    const certificate = new cdk.aws_certificatemanager.Certificate(this, 'Certificate', {
      domainName: 'analytics.example.com',
      subjectAlternativeNames: ["*.analytics.example.com"],
      validation: cdk.aws_certificatemanager.CertificateValidation.fromDns(analyticsZone),
    });

    const options: SwaProps = {
      environment: 'production',
      awsEnv: {
        account: this.account,
        region: this.region
      },
      sites: ['example.com'],
      domain: {
        name: 'web.analytics.example.com',
        certificate: certificate,
        hostedZone: analyticsZone,
      },
      allowedOrigins: ["*"]
    }

    new Swa(this, 'web-analytics', options)
  }
}

Auto analyize traffic

The idea would be to auto attribute/tag certain points in time that are of importance. Being able to identify a spike in traffic is because of a certain referrer/utm source/location. Maybe even going out to the internet and trying to find where it was shared.

A better name might be spike analysis or traffic finder.

Cannot create multiple constructs in one stack

The domain property creates a Route53 record for CloudFront with a fixed construct id, so if it already exists then you can only ever deploy 1 serverless-website-analytics component per stack.

This is on version 0.x.x.

You get the following error when using CDK deploy:

Error: There is already a Construct with name 'cloudfront-record' in App [xxx]

Collapse all google domain refferers as google

Collapse all google domain refferers as google, currently we get many depending on region like:

https://www.google.co.in/
https://www.google.com.ph/
https://www.google.it/
https://www.google.com.ua/

So if it starts with https://www.google then we just make it google

Bug: wrong A record for Cognito domain if using a subdomain

I have Hosted Zone example.com. With this config:

new Swa(this, "swa-example-com", {
	...
	auth: {
		cognito: {
			loginSubDomain: "login",
			users: [],
		},
	},
	domain: {
		name: 'swa.example.com',
		usEast1Certificate: props.certificate,
		hostedZone: route53.HostedZone.fromHostedZoneAttributes(this, "HostedZone", {
			hostedZoneId: 'Z01234',
			zoneName: 'example.com',
		}),
	},
});

What's created is:

  • Hosted Zone swa.example.com A record for the website ✅
  • Cognito Custom Domain login.swa.example.com
  • Hosted Zone login.example.com A record for the Cognito Custom Domain ❌ -> should be login.swa.example.com

I'm pretty sure the fix is to change

recordName: props.auth.cognito.loginSubDomain,

to

recordName: cognitoDomain.domainName,

The route53.ARecord Construct accepts either subdomain name or fully qualified domain name: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_route53.ARecord.html#recordname
Theoretically, the fully qualified domain name should end with . so we can do recordName: cognitoDomain.domainName + '.' but in fact CDK auto-fixes it 🤷‍♂️

Complete tracking events

The firehose and backend API calls are all done, it just needs testing and documentation. Oh and a frontend 😅. Yes, would need a quick design in Excalidraw maybe.

Experiment Glue Table Statistics

Can speed up the planning phase of Athena queries. But could not get it to work. See branch https://github.com/rehanvdm/serverless-website-analytics/tree/feature/experiment-column-level-statistics

Created a gue table from the a v2 of the Glue external table for experimentation manually, but it did not want to run:

I am trying to use the new Glue Table Statistics https://aws.amazon.com/blogs/big-data/enhance-query-performance-using-aws-glue-data-catalog-column-level-statistics/ but I get an error:
Exception in User Class: java.lang.AssertionError : assertion failed: Conflicting partition column names detected: .Which does not help me at all 😅 Table details in the screenshots. I wonder if it is because I am using partition projection, but the limitations does not mention anything about it :thinking_face:
Anyone that can help me out? I don't want to/can't open a support ticket for this, it's for an open-source project: serverless-website-analytics

Not even selecting all columns individually except the partition columns worked, still got that error

Basic real time view

Would be nice to have a basic version of the real-time view from GA:

  • nr. of users, vists, events, pageviews in the last 15 minutes
  • graph of the last 4 hours or so that updates real-time
  • real time referrer, utm and OS counts

As discussed with @rehanvdm on X/twitter, one way to do this would be to use Kinesis Streams - but that adds a $28 floor cost. Would thus definitely have to make this optional but are there any alternative methods that have a better cost scaling for small sites?

Expose the API Lambda concurrency limits

Expose the options on the constructor. The frontend is at 100 and the ingest API at 200.

We have a Lambdalith for both APIs, we can throttle the API requests by throttling the Lambda concurrency.

Janitor for pruning data

Create a cron job lambda that runs a CTAS query to group and store the records by page_id and highest time_on_page, combining the initial page track and the final page track. This cuts down on the data stored. The system has been designed in such a way as to cater for this. It is basically the equivalent of a vacuum command in postgress terms, but can be done without locking.

Add option to view by hour/day

This is currently decided for you by the logic here

If it is < 3 days then it will get data by the hour otherwise by the day.

Sometimes you want to look at a weeks data and see spikes to identify when traffic orginated

Make this a drop-down on the chart

Demo page improvements - self track & banner

Maybe two flags on the construct:

  1. Add a banner at the top pointing to this GH repo
  2. Make the page track itself, send data to its own endpoint, IF this flag is set then we need to automatically add the domain to the specified sites, which will cause a CFN dependency loop between frontend and backend. So either custom resource OR just let the user add the site themselves, which is a better solution for now.

Email reports

Daily or weekly email reports configurable within the CDK construct

Add an API for page view retrival to show on the page it self

Thinking of caching the page url + page views number in a DDB table, TTL of say 15 minutes. BUT this will scan over all the partitions. So instead, let's instead scan over the last 30 days and then this feature has a liveliness counter to it as well. The average read time can also be exposed like this.

Do not authenticate the client side script when using Basic Auth

Bug as mentioned in this post https://dev.to/aws/deploying-a-serverless-web-analytics-solution-for-your-websites-5coh

I ended up have to change the code that the CDK stack deployed slightly. As part of the deployment there is a CloudFront function that implements basic authentication. I just needed to add an exclusion so that it would not try and authenticate requests to this client-script.js. From the CloudFront console, I update the code as follows (the code is also in the repo)

Solution, just add a pass-through behavior on the cdn/* routes

Clicking on a referrer should have a context menu instead

The referrer should not be added to filter immediately, many times you want to copy the URL so provide two options when left clicking:

  • Filter by: "..."
  • Go to: "..."

Do the same with the page section as well.

This would mean that when that filter is selected we need to know about it and not show the Filter by option again.

Expose the Firehose buffering time

By exposing the buffering time we enable the user to decide, the lower the value the more it will cost for high-volume sites. This is because firehose will write to S3 more frequently.

It is currently set at 1min and then the data shows up on the dashboard ±1min after that, so 2mins total until you see your page view.

I think the maximum is 15 mins buffering.

Udpate Anomaly Detection Algorithm so that it better detects when an anomaly is over

Currently, an anomaly is marked as over: the first time a value is under the breach threshold and the first time the slope is positive.

2024-02-15 19:00 07 (17) ========== |
2024-02-15 18:00 01 (17) = #
2024-02-15 17:00 07 (19) ========== #
2024-02-15 16:00 08 (21) =========== #
2024-02-15 15:00 17 (21) ========================
2024-02-15 14:00 21 (20) =============================#
2024-02-15 13:00 20 (19) ===========================#=

The above example is of a slow breach, there we could say that if the value is below the threshold, then mark it as over.

But for big spikes it is a different story, test and find a solution for both scenarios. Here we are trying to find the whole event, if we just say when the value is not breaching then we loose the parts I marked in red.
image

Explore having a third EMA with a much slower reaction, maybe an Alpha of 0.01, then if that is crossed after an anomaly, mark the evaluation as OK? Something like that

Consider adding pricing info in the readme

Great work with this solution!

It’d be great to have in the readme a section that discusses the average expected pricing for the solution.

For instance: running this solution with about 10M events/page views per month and a certain amount of dashboard interactions would cost you $$/month.

I think this would help people evaluate whether this is for them or not.

Increase queries performance and efficency

We can get a 2x to 3x improvement by specifying the partitions explicitly and not letting Athena infer them in queries. This also scans less data and possibly fewer S3 records.

Experiment

Doing a range query is not as efficient as specifying the partitions directly. Compare these:

Using the exact partition field (page_opened_at_date):

AND (
  page_opened_at_date = '2023-08-27' OR
  page_opened_at_date = '2023-08-28' OR
  page_opened_at_date = '2023-08-29' OR
  page_opened_at_date = '2023-08-30' OR
  page_opened_at_date = '2023-08-31' OR
  page_opened_at_date = '2023-09-01' OR
  page_opened_at_date = '2023-09-02' OR
  page_opened_at_date = '2023-09-03' OR
  page_opened_at_date = '2023-09-04')

Results (71) | Time in queue: 143 ms | Run time: 1.308 sec | Data scanned: 1.12 MB

image

Letting Athena extrapolate the partition from the timestamp field(page_opened_at):

 page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 

Results (71) | Time in queue: 129 ms | Run time: 4.663 sec | Data scanned: 1.49 MB

image

Results

Query Type
Specify partitions 1.3s
Infer partitions from range query 4.7s

Conclusion

Possible reasons for the increase:

  • Simplified logic, less planning
  • Athena does not have to identify partitions, we explicitly specify them
  • Less data to be scanned, the range query scans more than it needs to

The initial assumption that Athena will infer the partitions if we are using automatic partition projection is still correct. But there seems to be quite a significant performance loss if it infers the partitions automatically on range queries.

We will specify the exact partition instead of the range query. The Athena query maximum length is ±260k, if 1 date condition (page_opened_at_date = '2023-08-24' OR) is 37 chars, assuming 31 days and 12 months then the extra length this adds is 31*37*12=13764 bytes or about 5.3% the maximum allowed length. Meaning it will even support 10 year queries as the queries we have are not as complex.


Exact queries used in the test:

WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') 
              AND (
              page_opened_at_date = '2023-08-27' OR
              page_opened_at_date = '2023-08-28' OR
              page_opened_at_date = '2023-08-29' OR
              page_opened_at_date = '2023-08-30' OR
              page_opened_at_date = '2023-08-31' OR
              page_opened_at_date = '2023-09-01' OR
              page_opened_at_date = '2023-09-02' OR
              page_opened_at_date = '2023-09-03' OR
              page_opened_at_date = '2023-09-04')
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1 AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC
WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC

S3 Express One Zone

Make use of S3 one zone if it is supported in that region. Maybe make this a flag that can be set, not everyone would want this. I am expecting that it will not improve performance by much if at all. But it will cut the cost of S3 retrieval by 50%.

Insights

Anomaly detection for both pages and events on:

  • Traffic
  • Referrer count
  • UTM Source

Store in DDB to be used by the frontend. Create an overview page with this that is the new default page showing page traffic, event traffic and then insights. Thinking of two columns, two rows. In the left column show the chart for each page and event traffic below each other and in the right column the insights as text, maybe even a condensed timeline.

Blocked by #69

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.