meltanolabs / tap-dynamodb Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 678 KB

Singer Tap for AWS DynamoDB built with the Meltano SDK

License: Apache License 2.0

Python 100.00%

tap-dynamodb's People

Contributors

Watchers

Forkers

b2tgame

tap-dynamodb's Issues

feature: add additional schema strategies and make configurable

Right now the tap always uses the "infer" strategy based on sample records to build a json schema with genson. From tap-mongodb theres 3 common strategies that we'd probably want to support:

The strategy to use for schema resolution. Defaults to 'raw'. The 'raw' strategy uses a relaxed schema using additionalProperties: true to accept the document as-is leaving the target to respect it. Useful for blob or jsonl. The 'envelope' strategy will envelope the document under a key named document. The target should use a variant type for this key. The 'infer' strategy will infer the schema from the data based on a configurable number of documents.

infer
raw
envelope

feature: implement log based replication

This can be done using DynamoDB streams https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

The stream records within a shard are removed automatically after 24 hours.

I believe that it will require:

Also optionally using https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_ListStreams.html

docs: Document Required AWS Permissions

I think these are:

ListTables
Scan
DescribeTable
GetRecords

Future needs for Streams:

DescribeStream
ListStreams
GetShardIterator

bug: properties in record message that arent in schema

I noticed that when I have a small infer sample i.e. 100 records it sometimes doesnt get every property. The schema message then doesnt include the property, but those properties still show up in the record messages. The tap should filter those properties out if they arent in the inferred schema.

bug: discover is querying more records than expected

The limit kwarg thats used was intended to only query a subset of the table but it looks like the tap logic continues to iterate for more batches instead of breaking.

The maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off. Also, if the processed dataset size exceeds 1 MB before DynamoDB reaches this limit, it stops the operation and returns the matching values up to the limit, and a key in LastEvaluatedKey to apply in a subsequent operation to continue the operation. For more information, see Working with Queries in the Amazon DynamoDB Developer Guide.

Potential Solutions:

Add an argument to the get_items_iter method

tap-dynamodb/tap_dynamodb/dynamo.py

Line 40 in 3d7ccda

self, table_name: str, scan_kwargs: dict = {"ConsistentRead": True}

to break after a certain amount of batches. This way we could send in a limit 100 and batch count = 1 to only return 100 rows.
Its possible theres a cleaner way to pass these arguments into the method, consider other solutions.

Require all credentials options to be explicit

After this is resolved we'll have a code path where implicit credentials are used. This could be install config/credentials files on the machine or instance roles.

The idea would be to require the tap user to always explicitly define where the credentials are coming from so theres no chance of accidentally using the wrong ones.

My thoughts from slack https://meltano.slack.com/archives/C04TSH483DF/p1681752224881609?thread_ts=1681741163.334529&cid=C04TSH483DF

This was a temporary solution to an opinion that I had around requiring the tap user to explicitly configure how they want to authenticate. I described a bit in #3 (comment). I've had weird behavior in taps that pull credentials from my environment or aws config files on my machine so I was hoping to require explicit auth for every method the tap supports. For your use case it might make sense to have a instance_auth=True tap setting that allows a session to be created without parameters or maybe a generic installed_auth=True that means it could allows aws/config.json or aws/credentials files to be used as well. I would just want the user to tell the tap explicitly "use pre-installed configurations from my machine" vs inferring that.

feature: implement state capability

No state is tracked as of today.

LastEvaluatedKey + ExclusiveStartKey

I'm not exactly sure how the LastEvaluatedKey attribute works but we already have to use it for paginating within a single scan

tap-dynamodb/tap_dynamodb/dynamo.py

Line 53 in 9e582d9

start_key = response.get("LastEvaluatedKey", None)

so it might work for incrementally reading across syncs.

https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html#DDB-Scan-request-ExclusiveStartKey

FilterExpression

A string that contains conditions that DynamoDB applies after the Scan operation, but before the data is returned to you. Items that do not satisfy the FilterExpression criteria are not returned.

https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html#DDB-Scan-request-FilterExpression

Stream based

Streams allow you to incrementally request stream data if it hasnt expired yet.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetRecords.html. Depends on #6

bug: exception when using implicit credentials

From a discussion in slack https://meltano.slack.com/archives/C04TSH483DF/p1681741163334529

I think I understand your idea behind it but I still feel like inferring as a last resort is still better than throwing an error for now. Maybe once the flag/setting is live it might make sense to not do that.

I agree with this statement. Until we decide how to handle #15 this should default to session = boto3.Session() instead of an exception.

refactor: consider using the Scan paginator instead of custom logic

I think this would eliminate much of the while loop in

tap-dynamodb/tap_dynamodb/dynamodb_connector.py

Line 63 in b159f2d

def get_items_iter(

feat: use input catalog properties to optimize dynamodb request

Following #11 closed by #22 the tap respects input catalogs but the selections arent accounted for in the scan request. This is inefficient because data is retrieved that is not emitted. Instead the tap could use the selected properties to build the ProjectionExpression argument for the scan operation, this way the API request will only return the desired properties.

bug: input catalog is not respected

From slack https://meltano.slack.com/archives/C04TSH483DF/p1681325283410999

As far as I know, the way this is set up, it will not respect a passed in catalog fwiw (edited)
It may be unexpected if a user "discovers" a catalog, mutates it via meltano extras on-the-fly or directly, and it is not respected. Changes might include dropping a stream or changing a data type incorrectly inferred by genson.