Giter VIP home page Giter VIP logo

serverless-website-analytics's Introduction

Serverless Website Analytics

This is a CDK serverless website analytics construct that can be deployed to AWS. This construct creates backend, frontend and the ingestion APIs.

This solution was designed for multiple websites with low to moderate traffic. It is designed to be as cheap as possible, but it is not free. The cost is mostly driven by the ingestion API that saves the data to S3 through a Kinesis Firehose.

You can see a LIVE DEMO HERE and read about the simulated traffic here

Features

  • Serverless, only pay for the AWS services you use
  • Track multiple site
  • Custom domain (use your own domain or a generic CloudFront domain)
  • Page view tracking (includes time on page)
  • Event tracking
  • Anomaly detection and alerts
  • Privacy focused, don't store any Personally Identifiable Information (PII)
  • You own your data
  • Three Dashboard authentication options; none, basic auth or AWS Cognito.
  • Tracks:
    • Referrers
    • Location by country and city
    • Device type
    • All UTM parameters
    • Query parameters
    • Bot detection
  • Easy integration in any JS framework through:
    • JS/TS SDK
    • Standalone 1-liner <script> import

Objectives

  • Serverless
  • Privacy focused
  • Lowest possible cost, pay for the AWS services you use (scale to 0)
  • KISS
  • No direct server-side state
  • Low maintenance
  • Easy to deploy in your AWS account, any *region
  • The target audience is small to medium website(s) with low to moderate page view traffic (equal or less than 10M views)

The main objective is to keep it simple and the operational cost low, keeping true to "scale to 0" tenants of serverless, even if it goes against "best practices".

Getting started

πŸ“– Alternatively, read a step-by-step guide written by Ricardo Sueiras

Serverside setup

⚠️ Requires your project aws-cdk and aws-cdk-lib packages to be greater than 2.79.1

Install the CDK construct library in your project:

npm install serverless-website-analytics

Add the construct to your stack:

import { ServerlessWebsiteAnalytics } from 'serverless-website-analytics';

export class App extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    ...

    new Swa(this, 'swa-demo-codesnippet-screenshot', {
      environment: 'prod',
      awsEnv: {
        account: this.account,
        region: this.region,
      },
      sites: ['example.com', 'tests1', 'tests2'],
      allowedOrigins: ['*'],
      /* None and Basic Auth also available, see options below */
      auth: {
        cognito: {
          loginSubDomain: 'login',
          users: [
              { name: '<full name>',  email: '<[email protected]>' },
          ]
        }
      },
      /* Optional, if not specified uses default CloudFront and Cognito domains */
      domain: {
        name: 'demo.serverless-website-analytics.com',
        /* The certificate must be in us-east-1 */
        usEast1Certificate: wildCardCertUsEast1,
        /* Optional, if not specified then no DNS records will be created. You will have to create the DNS records yourself. */
        hostedZone: route53.HostedZone.fromHostedZoneAttributes(this, 'HostedZone', {
          hostedZoneId: 'Z00387321EPPVXNC20CIS',
          zoneName: 'demo.serverless-website-analytics.com',
        }),
      },
      /* Optional, adds alarms and dashboards but also raises the cost */
      observability: {
        dashboard: true,
        alarms: {
          alarmTopic,
          alarmTypes: AllAlarmTypes
        },
      },
      /* Optional, anomaly detection and alerts. Might raise cost */
      anomaly: {
        alert: {
          topic: alarmTopic,
        }
      }
    });

  }
}

Quick option rundown:

  • sites: The list of allowed sites. This does not have to be a domain name, it can also be string. It can be anything you want to use to identify a site. The client-side script that sends analytics will have to specify one of these names.
  • firehoseBufferInterval: The number in seconds for the Firehose buffer interval. The default is 15 minutes (900 seconds), minimum is 60 and maximum is 900.
  • allowedOrigins: The origins that are allowed to make requests to the backend Ingest API. This CORS check is done as an extra security measure to prevent other sites from making requests to your backend. It must include the protocol and full domain. Ex: If your site is example.com and it can be accessed using https://example.com and https://www.example.com then both need to be listed. A value of * specifies all origins are allowed.
  • auth: Defaults to none. If you want to enable auth, you can specify either Basic Auth or Cognito auth but not both.
    • undefined: If not specified, then no authentication is applied, everything is publicly available.
    • basicAuth: Uses a CloudFront function to validate the Basic Auth credentials. The credentials are hard coded in the Lambda function. This is not recommended for production, it also only secures the HTML page abd API is still accessible without auth.
    • cognito: Uses an AWS Cognito user pool. Users will get a temporary password via email after deployment. They will then be prompted to change their password on the first login. This is the recommended option for production as it uses JWT tokens to secure the API as well.
  • domain: If specified, it will create the CloudFront and Cognito resources at the specified domain and optionally create the DNS records in the specified Route53 hosted zone. If not specified, it uses the default autogenerated CloudFront(cloudfront.net) and Cognito(auth.us-east-1.amazoncognito.com) domains. You can read the website URL from the stack output.
  • observability: Adds a CloudWatch Dashboard and Alarms if specified.
  • rateLimit: Adds a rate limit to the Ingest API and Frontend/Dashboard API. Defaults to 200 and 100 respectively.
  • anomaly: Adds anomaly detection for page views. The evaluation happens 20 min past the hour. The alert window defaults to 2 evaluations and both evaluations need to be breaching, where an evaluation is breaching if the value exceeds the breaching multiplier, which defaults to 2x the predicted value. An SNS Topic notifies the user via email when there is an anomaly.

For a full list of options see the API.md docs.

You can see an example implementation of the demo site here

Certificate Requirements

When specifying a domain, the certificate must be in us-east-1 but your stack can be in ANY region. This is because CloudFront requires the certificate to be in us-east-1.

You have one of two choices:

  • Create the certificate in us-east-1 manually (Click Ops) and import it from the Cert ARN as in the demo example.
  • Create a us-east-1 stack that your main stack (that contains this construct) depends. This main stack can be in any region. Create the Certificate in the us-east-1 stack and export the cert ARN. Then import the cert ARN in your main stack. Ensure that you have the crossRegionReferences flag set on both stacks so that the CDK can export and import the Cert ARN via SSM. This is necessary because CloudFormation can not export and import values across regions. Alternatively you can DIY it, here is a blog from AWS and a quick example from SO.

Client side setup

There are two ways to use the client:

  • Standalone import script - Single line, standard JS script in your HTML.
  • SDK client - Import the SDK client into your project and use in any SPA.

Standalone Import Script Usage

Then include the standalone script in your HTML:

<html lang="en">
<head> ... </head>
<body>
...
<script src="<YOUR BACKEND ORIGIN>/cdn/client-script.js" site="<THE SITE YOU ARE TRACKING>" attr-tracking="true"></script>
</body>
</html>

You need to replace <YOUR BACKEND ORIGIN> with the origin of your deployed backend. Available attributes on the script are:

  • site - Required. The name of your site, this must correspond with the name you specified when deploying the serverless-website-analytics backend.
  • attr-tracking - Optional. If "true", the script will track all button and a HTML elements that have the swa-event attribute on them. Example: <button swa-event="download">Download</button>, it is also possible to specify a category(swa-event-category) and the data(swa-event-data).

See the client-side library for more options.

Beacon/pixel tracking can be used as alternative to HTML attribute tracking. Beacon tracking is useful for tracking events outside your domain, like email opens, external blog views, etc. See the client-side library for more info.

<img src="<YOUR BACKEND ORIGIN>/api-ingest/v1/event/track/beacon.gif?site=<SITE>&event=<EVENT>" height="1" width="1" alt="">

SDK Client Usage

Install the client-side library:

npm install serverless-website-analytics-client

Irrelevant of the framework, you have to do the following to track page views on your site:

  1. Initialize the client only once with analyticsPageInit. The site name must correspond with the one that you specified when deploying the serverless-website-analytics backend. You also need the URL to the backend. Make sure your frontend site's Origin is whitelisted in the backend config.
  2. On each route change call the analyticsPageChange function with the name of the new page.

Important

The serverless-website-analytics can be used in ANY framework. To demonstrate this, find examples for Svelte and React in the client project

Vue

./serverless-website-analytics-client/usage/vue/vue-project/src/main.ts

...
import * as swaClient from 'serverless-website-analytics-client';

const app = createApp(App);
app.use(router);

swaClient.v1.analyticsPageInit({
  inBrowser: true, //Not SSR
  site: "<Friendly site name>", //example.com
  apiUrl: "<YOUR BACKEND ORIGIN>", //https://my-serverless-website-analytics-backend.com
  // debug: true,
});
router.afterEach((event) => {
  swaClient.v1.analyticsPageChange(event.path);
});

app.mount('#app');

export { swaClient };

Event Tracking:

./usage/vue/vue-project/src/App.vue

import {swaClient} from "./main";
...
//                         (event: string, data?: number, category?: string)
swaClient.v1.analyticsTrack("subscribe", 1, "clicks")

Alternatively, you can use a beacon/pixel for tracking as described above in standalone import script usage.

Worst case projected costs

SEE THE FULL COST BREAKDOWN AND SPREAD SHEET > HERE

Important

We make calculations without considering the daily vacuum cron process which reduces the S3 files stored by magnitudes. Real costs will be 10x to 100x lower than the worst case costs.

The worst case projected costs are:

Views Cost($)
10,000 0.52
100,000 1.01
1,000,000 10.18
10,000,000 58.88
100,000,000 550.32

What's in the box

The architecture consists of four components: frontend, backend, ingestion API and the client JS library.

serverless-website-analytics.drawio-2023-09-10.png

See the highlights and design decisions sections in the CONTRIBUTING file for detailed info.

Frontend

AWS CloudFront is used to host the frontend. The frontend is a Vue 3 SPA app that is hosted on S3 and served through CloudFront. The Element UI Plus frontend framework is used for the UI components and Plotly.js for the charts.

2_frontend_1.png 2_frontend_2.png

2_frontend_3.png

Backend

This is a Lambda-lith hit through the Lambda Function URLs (FURL) by reverse proxying through CloudFront. It is written in TypeScript and uses tRPC to handle API requests.

The Queries to Athena are synchronous, the connection timeout between CloudFront and the FURL has been increased to 60 seconds. Partitions are dynamic, they do not need to be added manually.

There are three available authentication configurations:

  • None, it is open to the public
  • Basic Authentication, basic protection for the index.html file
  • AWS Cogntio, recommended for production

Anomaly detection

The serverless-website-analytics backend uses basic Anomaly Detection, see ANOMALY_DETECTION.md for more info.

screenshot_normal_breach.png

Ingestion API

Similarly to the backend, it is also a TS Lambda-lith that is hit through the FURL by reverse proxying through CloudFront. It also uses tRPC but uses the trpc-openapi package to generate an OpenAPI spec. This is used to generate the API types used in the client JS package. and can also be used to generate other language client libraries.

The lambda function then saves the data to S3 through a Kinesis Firehose. The Firehose is configured to save the data in a partitioned manner, by site, year and month. The data is saved in parquet format, buffered for 1 minute, which means the date will be stored after about 1min Β± 1min.

Location data is obtained by looking the IP address up in the MaxMind GeoLite2 database. We don't store any Personally Identifiable Information (PII) in the logs or S3, the IP address is never stored.

Querying data manually

You can query the data manually using Athena. The data is partitioned by site and date. There are two tables, one for the page views (page_views) and another for the tracking data(events).

Pages view query:

WITH
cte_data AS (
  SELECT site, page_url, time_on_page, page_opened_at,
         ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
  FROM page_views
  WHERE (site = 'site1' site = 'site2') AND (page_opened_at_date = '2023-10-26' OR page_opened_at_date = '2023-10-27')
),
cte_data_filtered AS (
  SELECT *
  FROM cte_data
  WHERE rn = 1 AND page_opened_at BETWEEN parse_datetime('2023-10-26 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
        AND parse_datetime('2023-11-03 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS')
),
cte_data_by_page_view AS (
SELECT
 site,
 page_url,
 COUNT(*) as "views",
 ROUND(AVG(time_on_page),2) as "avg_time_on_page"
FROM cte_data_filtered
GROUP BY site, page_url
)
SELECT *
FROM cte_data_by_page_view
ORDER BY views DESC, page_url ASC

Events query:

WITH
cte_data AS (
  SELECT site, category, event, data, tracked_at,
         ROW_NUMBER() OVER (PARTITION BY event_id) rn
  FROM events
  WHERE (site = 'site1' site = 'site2') AND (tracked_at_date = '2023-11-03' OR tracked_at_date = '2023-11-04')
),
cte_data_filtered AS (
  SELECT *
  FROM cte_data
  WHERE rn = 1 AND tracked_at BETWEEN parse_datetime('2023-11-03 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
        AND parse_datetime('2023-11-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS')
),
cte_data_by_event AS (
SELECT
 site,
 category,
 event,
 COUNT(data) as "count",
 ROUND(AVG(data),2) as "avg",
 MIN(data) as "min",
 MAX(data) as "max",
 SUM(data) as "sum"
FROM cte_data_filtered
GROUP BY site, category, event
)
SELECT *
FROM cte_data_by_event
ORDER BY count DESC, category ASC, event ASC

A few things to note:

  • The first CTE query is used to get the latest page view/event for each page/event, but it is only in the second query where we select the top row of that query.
  • The first query specifies the partitions, the site and dates. The dates can be specified with a range query, but it is more performant to specify the exact partitions.
  • The second query along with selecting the latest row frm the first, specifies the date range exactly, taking into consideration the time zone. Within the code we over fetch the data to be returned by 2 days, this is to ensure that this secondary query has the data the specific time query that takes into consideration the zone.
  • The third query does the aggregation and the last one the ordering.

Upgrading

From V0 to V1

This upgrade brings two breaking changes:

  1. Daily partitions, querying is not backwards compatible. The data is still there, it is just in a different location so the dashboard will look empty after migrating.
  2. A change of Route53 record construct IDs that need manual intervention (only if you specified the domains property)

Install the new version:

npm install npm install serverless-website-analytics@~1

Data "loss" because of S3 path changes to accommodate daily partitions

Data will seem lost after upgrading to V1 because of the S3 path changes to accommodate daily partitions. The data is still there, it is just in a different location. The backend won't know about the old location and only use the new location so your dashboard will look empty after migrating. You can possibly run an Athena CTAS query to migrate the data to the new location, but it would need to be crafted carefully. If this is really important for you, please create a ticket and I can see if I can help.

Recreate the old Route53 records (only if you specified the `domains' property)

This is because we needed to change the CDK construct IDs of the Route53 records and Route53 can not create duplicate record names. See issue: #26

There will be some downtime, it should be less than 10 minutes. If downtime is not acceptable then use CDK escape hatches to hardcode the Route53 record IDs of your existing constructs.

IMPORTANT: Take note of the names and values of these DNS records as we need to recreate them manually after deleting them.

Order of operation:

  1. Delete DNS records with AWS CLI/Console 1.1 Delete the A record pointing to your CloudFront as defined by the domain.name property. 1.2 Optional, if using auth.cognito delete the Cognito login A record as well, which is defined as: {auth.cognito.loginSubDomain}.{domain.name}
  2. CDK deploy
  3. Recreate the DNS records with AWS CLI/Console that you deleted in step 1.

If you do not delete them before upgrading, you will get one of these errors in CloudFormation and it will roll back.

[Tried to create resource record set [name='analytics.rehanvdm.com.', type='A'] but it already exists]
[Tried to create resource record set [name='login.analytics.rehanvdm.com.', type='A'] but it already exists]

Sponsors

Proudly sponsored by:

Contributing

See CONTRIBUTING.md for more info on how to contribute + design decisions.

Docs

Roadmap

Can be found in the here

serverless-website-analytics's People

Contributors

astuyve avatar kc0bfv avatar omrilotan avatar rehanvdm avatar semantic-release-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

serverless-website-analytics's Issues

Add observibility

There is currently no monitoring, we need to track the important stuff like:

  • Lambda hard errors & lambda soft errors
  • Cloudfront/Lambda concurrency for API limits
  • Athena maybe?
  • Dashboard

I am thinking that these need to all be flags so that the user can decide what to turn on. Because many CW Alarms and dashboards can push up the cost again.

Implement filters

Create filters for all/most API calls, just like other platforms, I want to click on a page to narrow the stats that I am seeing. Then I want to continue clicking on another stat, say referral. So these filters will be AND filters.

Experiment Glue Table Statistics

Can speed up the planning phase of Athena queries. But could not get it to work. See branch https://github.com/rehanvdm/serverless-website-analytics/tree/feature/experiment-column-level-statistics

Created a gue table from the a v2 of the Glue external table for experimentation manually, but it did not want to run:

I am trying to use the new Glue Table Statistics https://aws.amazon.com/blogs/big-data/enhance-query-performance-using-aws-glue-data-catalog-column-level-statistics/ but I get an error:
Exception in User Class: java.lang.AssertionError : assertion failed: Conflicting partition column names detected: .Which does not help me at all πŸ˜… Table details in the screenshots. I wonder if it is because I am using partition projection, but the limitations does not mention anything about it :thinking_face:
Anyone that can help me out? I don't want to/can't open a support ticket for this, it's for an open-source project: serverless-website-analytics

Not even selecting all columns individually except the partition columns worked, still got that error

Auto analyize traffic

The idea would be to auto attribute/tag certain points in time that are of importance. Being able to identify a spike in traffic is because of a certain referrer/utm source/location. Maybe even going out to the internet and trying to find where it was shared.

A better name might be spike analysis or traffic finder.

Add an API for page view retrival to show on the page it self

Thinking of caching the page url + page views number in a DDB table, TTL of say 15 minutes. BUT this will scan over all the partitions. So instead, let's instead scan over the last 30 days and then this feature has a liveliness counter to it as well. The average read time can also be exposed like this.

Email reports

Daily or weekly email reports configurable within the CDK construct

Basic real time view

Would be nice to have a basic version of the real-time view from GA:

  • nr. of users, vists, events, pageviews in the last 15 minutes
  • graph of the last 4 hours or so that updates real-time
  • real time referrer, utm and OS counts

As discussed with @rehanvdm on X/twitter, one way to do this would be to use Kinesis Streams - but that adds a $28 floor cost. Would thus definitely have to make this optional but are there any alternative methods that have a better cost scaling for small sites?

Bug: wrong A record for Cognito domain if using a subdomain

I have Hosted Zone example.com. With this config:

new Swa(this, "swa-example-com", {
	...
	auth: {
		cognito: {
			loginSubDomain: "login",
			users: [],
		},
	},
	domain: {
		name: 'swa.example.com',
		usEast1Certificate: props.certificate,
		hostedZone: route53.HostedZone.fromHostedZoneAttributes(this, "HostedZone", {
			hostedZoneId: 'Z01234',
			zoneName: 'example.com',
		}),
	},
});

What's created is:

  • Hosted Zone swa.example.com A record for the website βœ…
  • Cognito Custom Domain login.swa.example.com βœ…
  • Hosted Zone login.example.com A record for the Cognito Custom Domain ❌ -> should be login.swa.example.com

I'm pretty sure the fix is to change

recordName: props.auth.cognito.loginSubDomain,

to

recordName: cognitoDomain.domainName,

The route53.ARecord Construct accepts either subdomain name or fully qualified domain name: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_route53.ARecord.html#recordname
Theoretically, the fully qualified domain name should end with . so we can do recordName: cognitoDomain.domainName + '.' but in fact CDK auto-fixes it πŸ€·β€β™‚οΈ

Collapse all google domain refferers as google

Collapse all google domain refferers as google, currently we get many depending on region like:

https://www.google.co.in/
https://www.google.com.ph/
https://www.google.it/
https://www.google.com.ua/

So if it starts with https://www.google then we just make it google

Certificate defined within the stack doesn't have a region so fails construct validation

I am defining the HostedZone and certificate within my stack, so, the region check is failing because it gets a TOKEN instead of us-east-1:

Error: Certificate must be in us-east-1, not in ${Token[TOKEN.646]}, this is a requirement for CloudFront

Is having the cert (and hosted zone) in a separate stack such a strong best practice that the construct shouldn't support it?

If you want to allow this, then when a TOKEN is received, you could check the region of the stack. Alternatively, provide some prop to disable the region checking.

Stack looks like this:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Swa, SwaProps } from 'serverless-website-analytics'

export class WebAnalyticsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const analyticsZone = new cdk.aws_route53.PublicHostedZone(this, 'HostedZone', {
      zoneName: 'analytics.example.com',
    });

    const certificate = new cdk.aws_certificatemanager.Certificate(this, 'Certificate', {
      domainName: 'analytics.example.com',
      subjectAlternativeNames: ["*.analytics.example.com"],
      validation: cdk.aws_certificatemanager.CertificateValidation.fromDns(analyticsZone),
    });

    const options: SwaProps = {
      environment: 'production',
      awsEnv: {
        account: this.account,
        region: this.region
      },
      sites: ['example.com'],
      domain: {
        name: 'web.analytics.example.com',
        certificate: certificate,
        hostedZone: analyticsZone,
      },
      allowedOrigins: ["*"]
    }

    new Swa(this, 'web-analytics', options)
  }
}

Cannot create multiple constructs in one stack

The domain property creates a Route53 record for CloudFront with a fixed construct id, so if it already exists then you can only ever deploy 1 serverless-website-analytics component per stack.

This is on version 0.x.x.

You get the following error when using CDK deploy:

Error: There is already a Construct with name 'cloudfront-record' in App [xxx]

Do not authenticate the client side script when using Basic Auth

Bug as mentioned in this post https://dev.to/aws/deploying-a-serverless-web-analytics-solution-for-your-websites-5coh

I ended up have to change the code that the CDK stack deployed slightly. As part of the deployment there is a CloudFront function that implements basic authentication. I just needed to add an exclusion so that it would not try and authenticate requests to this client-script.js. From the CloudFront console, I update the code as follows (the code is also in the repo)

Solution, just add a pass-through behavior on the cdn/* routes

Janitor for pruning data

Create a cron job lambda that runs a CTAS query to group and store the records by page_id and highest time_on_page, combining the initial page track and the final page track. This cuts down on the data stored. The system has been designed in such a way as to cater for this. It is basically the equivalent of a vacuum command in postgress terms, but can be done without locking.

Insights

Anomaly detection for both pages and events on:

  • Traffic
  • Referrer count
  • UTM Source

Store in DDB to be used by the frontend. Create an overview page with this that is the new default page showing page traffic, event traffic and then insights. Thinking of two columns, two rows. In the left column show the chart for each page and event traffic below each other and in the right column the insights as text, maybe even a condensed timeline.

Blocked by #69

Increase queries performance and efficency

We can get a 2x to 3x improvement by specifying the partitions explicitly and not letting Athena infer them in queries. This also scans less data and possibly fewer S3 records.

Experiment

Doing a range query is not as efficient as specifying the partitions directly. Compare these:

Using the exact partition field (page_opened_at_date):

AND (
  page_opened_at_date = '2023-08-27' OR
  page_opened_at_date = '2023-08-28' OR
  page_opened_at_date = '2023-08-29' OR
  page_opened_at_date = '2023-08-30' OR
  page_opened_at_date = '2023-08-31' OR
  page_opened_at_date = '2023-09-01' OR
  page_opened_at_date = '2023-09-02' OR
  page_opened_at_date = '2023-09-03' OR
  page_opened_at_date = '2023-09-04')

Results (71) | Time in queue: 143 ms | Run time: 1.308 sec | Data scanned: 1.12 MB

image

Letting Athena extrapolate the partition from the timestamp field(page_opened_at):

 page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 

Results (71) | Time in queue: 129 ms | Run time: 4.663 sec | Data scanned: 1.49 MB

image

Results

Query Type
Specify partitions 1.3s
Infer partitions from range query 4.7s

Conclusion

Possible reasons for the increase:

  • Simplified logic, less planning
  • Athena does not have to identify partitions, we explicitly specify them
  • Less data to be scanned, the range query scans more than it needs to

The initial assumption that Athena will infer the partitions if we are using automatic partition projection is still correct. But there seems to be quite a significant performance loss if it infers the partitions automatically on range queries.

We will specify the exact partition instead of the range query. The Athena query maximum length is Β±260k, if 1 date condition (page_opened_at_date = '2023-08-24' OR) is 37 chars, assuming 31 days and 12 months then the extra length this adds is 31*37*12=13764 bytes or about 5.3% the maximum allowed length. Meaning it will even support 10 year queries as the queries we have are not as complex.


Exact queries used in the test:

WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') 
              AND (
              page_opened_at_date = '2023-08-27' OR
              page_opened_at_date = '2023-08-28' OR
              page_opened_at_date = '2023-08-29' OR
              page_opened_at_date = '2023-08-30' OR
              page_opened_at_date = '2023-08-31' OR
              page_opened_at_date = '2023-09-01' OR
              page_opened_at_date = '2023-09-02' OR
              page_opened_at_date = '2023-09-03' OR
              page_opened_at_date = '2023-09-04')
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1 AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC
WITH 
          cte_data AS (
              SELECT user_id, country_name, page_opened_at,
                     ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
              FROM page_views
              WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
                    AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS') 
          ),
          cte_data_filtered AS (
              SELECT *
              FROM cte_data
              WHERE rn = 1
          ),
          user_distinct_stat AS (
            SELECT
              user_id, country_name,
              COUNT(*) as "visitors"
            FROM cte_data_filtered
            WHERE country_name IS NOT NULL
            GROUP BY 1, 2
            ORDER BY 3 DESC
          )
          SELECT
            country_name  as "group",
            COUNT(*) as "visitors"
          FROM user_distinct_stat
          GROUP BY country_name
          ORDER BY visitors DESC

Clicking on a referrer should have a context menu instead

The referrer should not be added to filter immediately, many times you want to copy the URL so provide two options when left clicking:

  • Filter by: "..."
  • Go to: "..."

Do the same with the page section as well.

This would mean that when that filter is selected we need to know about it and not show the Filter by option again.

S3 Express One Zone

Make use of S3 one zone if it is supported in that region. Maybe make this a flag that can be set, not everyone would want this. I am expecting that it will not improve performance by much if at all. But it will cut the cost of S3 retrieval by 50%.

Expose the API Lambda concurrency limits

Expose the options on the constructor. The frontend is at 100 and the ingest API at 200.

We have a Lambdalith for both APIs, we can throttle the API requests by throttling the Lambda concurrency.

Complete tracking events

The firehose and backend API calls are all done, it just needs testing and documentation. Oh and a frontend πŸ˜…. Yes, would need a quick design in Excalidraw maybe.

Udpate Anomaly Detection Algorithm so that it better detects when an anomaly is over

Currently, an anomaly is marked as over: the first time a value is under the breach threshold and the first time the slope is positive.

2024-02-15 19:00 07 (17) ========== |
2024-02-15 18:00 01 (17) = #
2024-02-15 17:00 07 (19) ========== #
2024-02-15 16:00 08 (21) =========== #
2024-02-15 15:00 17 (21) ========================
2024-02-15 14:00 21 (20) =============================#
2024-02-15 13:00 20 (19) ===========================#=

The above example is of a slow breach, there we could say that if the value is below the threshold, then mark it as over.

But for big spikes it is a different story, test and find a solution for both scenarios. Here we are trying to find the whole event, if we just say when the value is not breaching then we loose the parts I marked in red.
image

Explore having a third EMA with a much slower reaction, maybe an Alpha of 0.01, then if that is crossed after an anomaly, mark the evaluation as OK? Something like that

Expose the Firehose buffering time

By exposing the buffering time we enable the user to decide, the lower the value the more it will cost for high-volume sites. This is because firehose will write to S3 more frequently.

It is currently set at 1min and then the data shows up on the dashboard Β±1min after that, so 2mins total until you see your page view.

I think the maximum is 15 mins buffering.

Add option to view by hour/day

This is currently decided for you by the logic here

If it is < 3 days then it will get data by the hour otherwise by the day.

Sometimes you want to look at a weeks data and see spikes to identify when traffic orginated

Make this a drop-down on the chart

Add time filtering

Add time to the date time component and then when the chart is selected, update the date time picker and refresh

Auto create partitions

  • A cron to run at the beginning of each month that adds the new partitions
  • A CFN custom resource to run if changing the sites

Demo page improvements - self track & banner

Maybe two flags on the construct:

  1. Add a banner at the top pointing to this GH repo
  2. Make the page track itself, send data to its own endpoint, IF this flag is set then we need to automatically add the domain to the specified sites, which will cause a CFN dependency loop between frontend and backend. So either custom resource OR just let the user add the site themselves, which is a better solution for now.

Consider adding pricing info in the readme

Great work with this solution!

It’d be great to have in the readme a section that discusses the average expected pricing for the solution.

For instance: running this solution with about 10M events/page views per month and a certain amount of dashboard interactions would cost you $$/month.

I think this would help people evaluate whether this is for them or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.