Giter VIP home page Giter VIP logo

pulumi-serverless-db's People

Contributors

dependabot[bot] avatar evanboyle avatar jmaysrowland avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

aterreno

pulumi-serverless-db's Issues

Support formats other than PARQUET

Parquet is a nice default for withStreamingInputTable as kinesis handles transforming JSON into parquet. There are npm libraries to do this conversion manually yourself, but it can be inconvenient. We should support other file formats so that it is easy to create something like a static fact table:

const tableArgs: TableArgs = {
    columns,
    format: "JSON"
}

const dataWarehouse = new ServerlessDataWarehouse("analytics_dw")
    .withTable(impressionsTableName, genericTableArgs)

const s3Bucket = this.dataWarehouseBucket;

// ... configure an s3 client with a connection to the bucket ...
const dataFacts =  [
{
ad: 1, author: "evan", date: "12/26/2019"
},
{
ad: 2, author: "jordan", date: "12/27/2019
}
];
s3Client.put(s3Path, JSON.stringify(dataFacts));

Now we have a table with a static JSON dataset that is uploaded at deployment time. Will also be useful for batch tables.

Integration Tests

One thing that became painfully obvious to me during development of #8 was that we need an integration test. My current process for making changes is tearing down my stack (with requires manually deleting all of the data from the S3 buckets first), recreating them, and then issuing an athena query to make sure I get data back from both tables. We could definitely write a script that automates this.

Integration test should do the following:

  1. Configure a uniquely named test stack
  2. Pulumi Up
  3. Do an athena query on one or more tables, with retries until a specified interval
    i. Given that it could take a minute or two for the test data to trickle through firehose and for the partition to be registered, the first query might not succeed.
    ii. We should retry with some sort of delay (10s?), over some period (2 mins?) before declaring the test a failure. We should experiment a little bit to figure out stable values for the delay and period.
  4. Report results.
  5. Tear down the stack with "pulumi destory", making sure to delete the s3 buckets. This might be tricky as an error is thrown if the data isn't deleted first. We need to make sure we do this even when the test fails, as we don't want to leak resources.
  6. Expose some sort of npm run test:integration command. We don't want this to run as a part of jest unit tests. But we should be able to invoke this easily with a separate command (could still use jest somehow).

Implement .withBatchInputTable()

https://github.com/EvanBoyle/pulumi-serverless-db/pull/8/files#diff-f95bdcb0f919d600e736e8e9da74022dR93

We should implement a batch counterpart to withStreamInputTable. The primary use I have in mind is aggregation of streaming events. Consider a DW with two tables impressions, and clicks. The business would like to create a higher level table that includes some aggregated hourly facts such as click through rate. To accomplish this once an hour you need to run a job once an hour that scans an hours worth of clicks, and an hours worth of impressions, and outputs an s3 file with the contents of the summary:

adID: 1, impressions: 100, clicks: 5, CTR: .05
adId: 2, impressions: 10, clicks 1, CTR: .1

We can provide an api withBatchInputTable that enables this sort of scenario using ECS Fargate. Mainly we can allow the user to define a function to run in a container (the code that issues a query to one or more tables, and writes the results to the correct s3 location, and even creates a partition if necessary), and define the interval that the task should run on.

We can certainly start with a prototype using lambda. I think that would be a good proof of concept to get the API surface area fleshed out. Eventually we should probably using something like ECS fargate given lambda's memore and disk limitations. ECS fargate gives you up to 30 GB of RAM which enables more flexibility in the scale of aggregations and queries that users will be able to do.

Some examples for fargate, which involves building a docker image: https://github.com/pulumi/examples/tree/master/aws-ts-hello-fargate
https://www.pulumi.com/blog/get-started-with-docker-on-aws-fargate-using-pulumi/
https://www.pulumi.com/docs/tutorials/aws/ecs-fargate/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.