rehanvdm / serverless-website-analytics Goto Github PK
View Code? Open in Web Editor NEWA CDK construct that consists of a serverless backend, frontend and client side code to track website analytics
License: GNU General Public License v2.0
A CDK construct that consists of a serverless backend, frontend and client side code to track website analytics
License: GNU General Public License v2.0
Create filters for all/most API calls, just like other platforms, I want to click on a page to narrow the stats that I am seeing. Then I want to continue clicking on another stat, say referral. So these filters will be AND filters.
There will not be a time on page then, it will be 0, so no secondary request. But that is acceptable
If it does not Cognito thinks it is the wrong callback URL
Need a logo and a favicon. Also add it to the readme
There is currently no monitoring, we need to track the important stuff like:
I am thinking that these need to all be flags so that the user can decide what to turn on. Because many CW Alarms and dashboards can push up the cost again.
Add time to the date time component and then when the chart is selected, update the date time picker and refresh
These still count as a view, it only means the secondary tracking event never made it to us, highly unlikely that the first did but the second did not, but possible.
I am defining the HostedZone and certificate within my stack, so, the region check is failing because it gets a TOKEN instead of us-east-1
:
Error: Certificate must be in us-east-1, not in ${Token[TOKEN.646]}, this is a requirement for CloudFront
Is having the cert (and hosted zone) in a separate stack such a strong best practice that the construct shouldn't support it?
If you want to allow this, then when a TOKEN is received, you could check the region of the stack. Alternatively, provide some prop to disable the region checking.
Stack looks like this:
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Swa, SwaProps } from 'serverless-website-analytics'
export class WebAnalyticsStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const analyticsZone = new cdk.aws_route53.PublicHostedZone(this, 'HostedZone', {
zoneName: 'analytics.example.com',
});
const certificate = new cdk.aws_certificatemanager.Certificate(this, 'Certificate', {
domainName: 'analytics.example.com',
subjectAlternativeNames: ["*.analytics.example.com"],
validation: cdk.aws_certificatemanager.CertificateValidation.fromDns(analyticsZone),
});
const options: SwaProps = {
environment: 'production',
awsEnv: {
account: this.account,
region: this.region
},
sites: ['example.com'],
domain: {
name: 'web.analytics.example.com',
certificate: certificate,
hostedZone: analyticsZone,
},
allowedOrigins: ["*"]
}
new Swa(this, 'web-analytics', options)
}
}
The idea would be to auto attribute/tag certain points in time that are of importance. Being able to identify a spike in traffic is because of a certain referrer/utm source/location. Maybe even going out to the internet and trying to find where it was shared.
A better name might be spike analysis or traffic finder.
The domain
property creates a Route53 record for CloudFront with a fixed construct id, so if it already exists then you can only ever deploy 1 serverless-website-analytics
component per stack.
This is on version 0.x.x.
You get the following error when using CDK deploy:
Error: There is already a Construct with name 'cloudfront-record' in App [xxx]
Also double check if IPV6 works
Collapse all google domain refferers as google
, currently we get many depending on region like:
https://www.google.co.in/
https://www.google.com.ph/
https://www.google.it/
https://www.google.com.ua/
So if it starts with https://www.google
then we just make it google
I have Hosted Zone example.com
. With this config:
new Swa(this, "swa-example-com", {
...
auth: {
cognito: {
loginSubDomain: "login",
users: [],
},
},
domain: {
name: 'swa.example.com',
usEast1Certificate: props.certificate,
hostedZone: route53.HostedZone.fromHostedZoneAttributes(this, "HostedZone", {
hostedZoneId: 'Z01234',
zoneName: 'example.com',
}),
},
});
What's created is:
swa.example.com
A record for the website ✅login.swa.example.com
✅login.example.com
A record for the Cognito Custom Domain ❌ -> should be login.swa.example.com
I'm pretty sure the fix is to change
to
recordName: cognitoDomain.domainName,
The route53.ARecord
Construct accepts either subdomain name or fully qualified domain name: https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_route53.ARecord.html#recordname
Theoretically, the fully qualified domain name should end with .
so we can do recordName: cognitoDomain.domainName + '.'
but in fact CDK auto-fixes it 🤷♂️
The firehose and backend API calls are all done, it just needs testing and documentation. Oh and a frontend 😅. Yes, would need a quick design in Excalidraw maybe.
Can speed up the planning phase of Athena queries. But could not get it to work. See branch https://github.com/rehanvdm/serverless-website-analytics/tree/feature/experiment-column-level-statistics
Created a gue table from the a v2 of the Glue external table for experimentation manually, but it did not want to run:
I am trying to use the new Glue Table Statistics https://aws.amazon.com/blogs/big-data/enhance-query-performance-using-aws-glue-data-catalog-column-level-statistics/ but I get an error:
Exception in User Class: java.lang.AssertionError : assertion failed: Conflicting partition column names detected: .Which does not help me at all 😅 Table details in the screenshots. I wonder if it is because I am using partition projection, but the limitations does not mention anything about it :thinking_face:
Anyone that can help me out? I don't want to/can't open a support ticket for this, it's for an open-source project: serverless-website-analytics
Not even selecting all columns individually except the partition columns worked, still got that error
Would be nice to have a basic version of the real-time view from GA:
As discussed with @rehanvdm on X/twitter, one way to do this would be to use Kinesis Streams - but that adds a $28 floor cost. Would thus definitely have to make this optional but are there any alternative methods that have a better cost scaling for small sites?
Expose the options on the constructor. The frontend is at 100 and the ingest API at 200.
We have a Lambdalith for both APIs, we can throttle the API requests by throttling the Lambda concurrency.
Create a cron job lambda that runs a CTAS query to group and store the records by page_id and highest time_on_page, combining the initial page track and the final page track. This cuts down on the data stored. The system has been designed in such a way as to cater for this. It is basically the equivalent of a vacuum command in postgress terms, but can be done without locking.
Use JS to track href clicks, should be able to https://gist.github.com/leighmcculloch/7596803
Store as a tracking event? Not sure yet
This is currently decided for you by the logic here
If it is < 3 days then it will get data by the hour otherwise by the day.
Sometimes you want to look at a weeks data and see spikes to identify when traffic orginated
Make this a drop-down on the chart
Maybe two flags on the construct:
Daily or weekly email reports configurable within the CDK construct
Thinking of caching the page url + page views number in a DDB table, TTL of say 15 minutes. BUT this will scan over all the partitions. So instead, let's instead scan over the last 30 days and then this feature has a liveliness counter to it as well. The average read time can also be exposed like this.
Detect anomalies and send notifications via SNS (to email)
Bug as mentioned in this post https://dev.to/aws/deploying-a-serverless-web-analytics-solution-for-your-websites-5coh
I ended up have to change the code that the CDK stack deployed slightly. As part of the deployment there is a CloudFront function that implements basic authentication. I just needed to add an exclusion so that it would not try and authenticate requests to this client-script.js. From the CloudFront console, I update the code as follows (the code is also in the repo)
Solution, just add a pass-through behavior on the cdn/*
routes
Use https://github.com/awslabs/llrt as the Lambda runtime to reduce Lambda execution time and cold starts.
See if it is possible and what improvements there is to gain by switching
The referrer should not be added to filter immediately, many times you want to copy the URL so provide two options when left clicking:
Do the same with the page section as well.
This would mean that when that filter is selected we need to know about it and not show the Filter by option again.
By exposing the buffering time we enable the user to decide, the lower the value the more it will cost for high-volume sites. This is because firehose will write to S3 more frequently.
It is currently set at 1min and then the data shows up on the dashboard ±1min after that, so 2mins total until you see your page view.
I think the maximum is 15 mins buffering.
Currently, an anomaly is marked as over: the first time a value is under the breach threshold and the first time the slope is positive.
2024-02-15 19:00 07 (17) ========== |
2024-02-15 18:00 01 (17) = #
2024-02-15 17:00 07 (19) ========== #
2024-02-15 16:00 08 (21) =========== #
2024-02-15 15:00 17 (21) ========================
2024-02-15 14:00 21 (20) =============================#
2024-02-15 13:00 20 (19) ===========================#=
The above example is of a slow breach, there we could say that if the value is below the threshold, then mark it as over.
But for big spikes it is a different story, test and find a solution for both scenarios. Here we are trying to find the whole event, if we just say when the value is not breaching then we loose the parts I marked in red.
Explore having a third EMA with a much slower reaction, maybe an Alpha of 0.01, then if that is crossed after an anomaly, mark the evaluation as OK? Something like that
Great work with this solution!
It’d be great to have in the readme a section that discusses the average expected pricing for the solution.
For instance: running this solution with about 10M events/page views per month and a certain amount of dashboard interactions would cost you $$/month.
I think this would help people evaluate whether this is for them or not.
There is a column (boolean) on each record that marks if the view was a bot or human. We can also show this on the dashboard.
We can get a 2x to 3x improvement by specifying the partitions explicitly and not letting Athena infer them in queries. This also scans less data and possibly fewer S3 records.
Doing a range query is not as efficient as specifying the partitions directly. Compare these:
Using the exact partition field (page_opened_at_date
):
AND (
page_opened_at_date = '2023-08-27' OR
page_opened_at_date = '2023-08-28' OR
page_opened_at_date = '2023-08-29' OR
page_opened_at_date = '2023-08-30' OR
page_opened_at_date = '2023-08-31' OR
page_opened_at_date = '2023-09-01' OR
page_opened_at_date = '2023-09-02' OR
page_opened_at_date = '2023-09-03' OR
page_opened_at_date = '2023-09-04')
Results (71) | Time in queue: 143 ms | Run time: 1.308 sec | Data scanned: 1.12 MB
Letting Athena extrapolate the partition from the timestamp field(page_opened_at
):
page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS')
Results (71) | Time in queue: 129 ms | Run time: 4.663 sec | Data scanned: 1.49 MB
Query Type | |
---|---|
Specify partitions | 1.3s |
Infer partitions from range query | 4.7s |
Possible reasons for the increase:
The initial assumption that Athena will infer the partitions if we are using automatic partition projection is still correct. But there seems to be quite a significant performance loss if it infers the partitions automatically on range queries.
We will specify the exact partition instead of the range query. The Athena query maximum length is ±260k, if 1 date condition (page_opened_at_date = '2023-08-24' OR
) is 37 chars, assuming 31 days and 12 months then the extra length this adds is 31*37*12=13764 bytes
or about 5.3% the maximum allowed length. Meaning it will even support 10 year queries as the queries we have are not as complex.
Exact queries used in the test:
WITH
cte_data AS (
SELECT user_id, country_name, page_opened_at,
ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
FROM page_views
WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests')
AND (
page_opened_at_date = '2023-08-27' OR
page_opened_at_date = '2023-08-28' OR
page_opened_at_date = '2023-08-29' OR
page_opened_at_date = '2023-08-30' OR
page_opened_at_date = '2023-08-31' OR
page_opened_at_date = '2023-09-01' OR
page_opened_at_date = '2023-09-02' OR
page_opened_at_date = '2023-09-03' OR
page_opened_at_date = '2023-09-04')
),
cte_data_filtered AS (
SELECT *
FROM cte_data
WHERE rn = 1 AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS')
),
user_distinct_stat AS (
SELECT
user_id, country_name,
COUNT(*) as "visitors"
FROM cte_data_filtered
WHERE country_name IS NOT NULL
GROUP BY 1, 2
ORDER BY 3 DESC
)
SELECT
country_name as "group",
COUNT(*) as "visitors"
FROM user_distinct_stat
GROUP BY country_name
ORDER BY visitors DESC
WITH
cte_data AS (
SELECT user_id, country_name, page_opened_at,
ROW_NUMBER() OVER (PARTITION BY page_id ORDER BY time_on_page DESC) rn
FROM page_views
WHERE (site = 'rehanvdm.com' OR site = 'cloudglance.dev' OR site = 'blog.cloudglance.dev' OR site = 'docs.cloudglance.dev' OR site = 'tests') AND page_opened_at BETWEEN parse_datetime('2023-08-27 22:00:00.000','yyyy-MM-dd HH:mm:ss.SSS')
AND parse_datetime('2023-09-04 21:59:59.999','yyyy-MM-dd HH:mm:ss.SSS')
),
cte_data_filtered AS (
SELECT *
FROM cte_data
WHERE rn = 1
),
user_distinct_stat AS (
SELECT
user_id, country_name,
COUNT(*) as "visitors"
FROM cte_data_filtered
WHERE country_name IS NOT NULL
GROUP BY 1, 2
ORDER BY 3 DESC
)
SELECT
country_name as "group",
COUNT(*) as "visitors"
FROM user_distinct_stat
GROUP BY country_name
ORDER BY visitors DESC
Make use of S3 one zone if it is supported in that region. Maybe make this a flag that can be set, not everyone would want this. I am expecting that it will not improve performance by much if at all. But it will cut the cost of S3 retrieval by 50%.
Anomaly detection for both pages and events on:
Store in DDB to be used by the frontend. Create an overview page with this that is the new default page showing page traffic, event traffic and then insights. Thinking of two columns, two rows. In the left column show the chart for each page and event traffic below each other and in the right column the insights as text, maybe even a condensed timeline.
Blocked by #69
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.