Giter VIP home page Giter VIP logo

tap-dynamodb's People

Contributors

dependabot[bot] avatar edgarrmondragon avatar meltybot avatar pnadolny13 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

b2tgame

tap-dynamodb's Issues

feature: add additional schema strategies and make configurable

Right now the tap always uses the "infer" strategy based on sample records to build a json schema with genson. From tap-mongodb theres 3 common strategies that we'd probably want to support:

The strategy to use for schema resolution. Defaults to 'raw'. The 'raw' strategy uses a relaxed schema using additionalProperties: true to accept the document as-is leaving the target to respect it. Useful for blob or jsonl. The 'envelope' strategy will envelope the document under a key named document. The target should use a variant type for this key. The 'infer' strategy will infer the schema from the data based on a configurable number of documents.

  • infer
  • raw
  • envelope

feature: implement log based replication

bug: properties in record message that arent in schema

I noticed that when I have a small infer sample i.e. 100 records it sometimes doesnt get every property. The schema message then doesnt include the property, but those properties still show up in the record messages. The tap should filter those properties out if they arent in the inferred schema.

bug: discover is querying more records than expected

The limit kwarg thats used was intended to only query a subset of the table but it looks like the tap logic continues to iterate for more batches instead of breaking.

The maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off. Also, if the processed dataset size exceeds 1 MB before DynamoDB reaches this limit, it stops the operation and returns the matching values up to the limit, and a key in LastEvaluatedKey to apply in a subsequent operation to continue the operation. For more information, see Working with Queries in the Amazon DynamoDB Developer Guide.

Potential Solutions:

  • Add an argument to the get_items_iter method
    self, table_name: str, scan_kwargs: dict = {"ConsistentRead": True}
    to break after a certain amount of batches. This way we could send in a limit 100 and batch count = 1 to only return 100 rows.
  • Its possible theres a cleaner way to pass these arguments into the method, consider other solutions.

Require all credentials options to be explicit

After this is resolved we'll have a code path where implicit credentials are used. This could be install config/credentials files on the machine or instance roles.

The idea would be to require the tap user to always explicitly define where the credentials are coming from so theres no chance of accidentally using the wrong ones.

My thoughts from slack https://meltano.slack.com/archives/C04TSH483DF/p1681752224881609?thread_ts=1681741163.334529&cid=C04TSH483DF

This was a temporary solution to an opinion that I had around requiring the tap user to explicitly configure how they want to authenticate. I described a bit in #3 (comment). I've had weird behavior in taps that pull credentials from my environment or aws config files on my machine so I was hoping to require explicit auth for every method the tap supports. For your use case it might make sense to have a instance_auth=True tap setting that allows a session to be created without parameters or maybe a generic installed_auth=True that means it could allows aws/config.json or aws/credentials files to be used as well. I would just want the user to tell the tap explicitly "use pre-installed configurations from my machine" vs inferring that.

feature: implement state capability

No state is tracked as of today.

LastEvaluatedKey + ExclusiveStartKey

I'm not exactly sure how the LastEvaluatedKey attribute works but we already have to use it for paginating within a single scan

start_key = response.get("LastEvaluatedKey", None)
so it might work for incrementally reading across syncs.

https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html#DDB-Scan-request-ExclusiveStartKey

FilterExpression

A string that contains conditions that DynamoDB applies after the Scan operation, but before the data is returned to you. Items that do not satisfy the FilterExpression criteria are not returned.

https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html#DDB-Scan-request-FilterExpression

Stream based

Streams allow you to incrementally request stream data if it hasnt expired yet.
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetRecords.html. Depends on #6

feat: use input catalog properties to optimize dynamodb request

Following #11 closed by #22 the tap respects input catalogs but the selections arent accounted for in the scan request. This is inefficient because data is retrieved that is not emitted. Instead the tap could use the selected properties to build the ProjectionExpression argument for the scan operation, this way the API request will only return the desired properties.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.