evanboyle / pulumi-serverless-db Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 1.87 MB

TypeScript 99.57% JavaScript 0.43%

pulumi-serverless-db's People

Contributors

Stargazers

Watchers

Forkers

aterreno

pulumi-serverless-db's Issues

Support formats other than PARQUET

Parquet is a nice default for withStreamingInputTable as kinesis handles transforming JSON into parquet. There are npm libraries to do this conversion manually yourself, but it can be inconvenient. We should support other file formats so that it is easy to create something like a static fact table:

const tableArgs: TableArgs = {
    columns,
    format: "JSON"
}

const dataWarehouse = new ServerlessDataWarehouse("analytics_dw")
    .withTable(impressionsTableName, genericTableArgs)

const s3Bucket = this.dataWarehouseBucket;

// ... configure an s3 client with a connection to the bucket ...
const dataFacts =  [
{
ad: 1, author: "evan", date: "12/26/2019"
},
{
ad: 2, author: "jordan", date: "12/27/2019
}
];
s3Client.put(s3Path, JSON.stringify(dataFacts));

Now we have a table with a static JSON dataset that is uploaded at deployment time. Will also be useful for batch tables.

Integration Tests

One thing that became painfully obvious to me during development of #8 was that we need an integration test. My current process for making changes is tearing down my stack (with requires manually deleting all of the data from the S3 buckets first), recreating them, and then issuing an athena query to make sure I get data back from both tables. We could definitely write a script that automates this.

Integration test should do the following:

Configure a uniquely named test stack
Pulumi Up
Do an athena query on one or more tables, with retries until a specified interval
i. Given that it could take a minute or two for the test data to trickle through firehose and for the partition to be registered, the first query might not succeed.
ii. We should retry with some sort of delay (10s?), over some period (2 mins?) before declaring the test a failure. We should experiment a little bit to figure out stable values for the delay and period.
Report results.
Tear down the stack with "pulumi destory", making sure to delete the s3 buckets. This might be tricky as an error is thrown if the data isn't deleted first. We need to make sure we do this even when the test fails, as we don't want to leak resources.
Expose some sort of npm run test:integration command. We don't want this to run as a part of jest unit tests. But we should be able to invoke this easily with a separate command (could still use jest somehow).

Implement .withBatchInputTable()

https://github.com/EvanBoyle/pulumi-serverless-db/pull/8/files#diff-f95bdcb0f919d600e736e8e9da74022dR93

We should implement a batch counterpart to withStreamInputTable. The primary use I have in mind is aggregation of streaming events. Consider a DW with two tables impressions, and clicks. The business would like to create a higher level table that includes some aggregated hourly facts such as click through rate. To accomplish this once an hour you need to run a job once an hour that scans an hours worth of clicks, and an hours worth of impressions, and outputs an s3 file with the contents of the summary:

adID: 1, impressions: 100, clicks: 5, CTR: .05
adId: 2, impressions: 10, clicks 1, CTR: .1

We can provide an api withBatchInputTable that enables this sort of scenario using ECS Fargate. Mainly we can allow the user to define a function to run in a container (the code that issues a query to one or more tables, and writes the results to the correct s3 location, and even creates a partition if necessary), and define the interval that the task should run on.

We can certainly start with a prototype using lambda. I think that would be a good proof of concept to get the API surface area fleshed out. Eventually we should probably using something like ECS fargate given lambda's memore and disk limitations. ECS fargate gives you up to 30 GB of RAM which enables more flexibility in the scale of aggregations and queries that users will be able to do.

Some examples for fargate, which involves building a docker image: https://github.com/pulumi/examples/tree/master/aws-ts-hello-fargate
https://www.pulumi.com/blog/get-started-with-docker-on-aws-fargate-using-pulumi/
https://www.pulumi.com/docs/tutorials/aws/ecs-fargate/

evanboyle / pulumi-serverless-db Goto Github PK

pulumi-serverless-db's People

Contributors

Stargazers

Watchers

Forkers

pulumi-serverless-db's Issues

Support formats other than PARQUET

Integration Tests

Implement .withBatchInputTable()

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent