Giter VIP home page Giter VIP logo

tap-postgres's Introduction

tap-postgres

set rds.logical_replication in parameter(reboot)= 1

This should also set max_wal_senders && max_replication_slots > 0

Singer tap for PostgreSQL supporting Full Table & Logical Replication using the wal2json decoder plugin.

SELECT * FROM pg_create_logical_replication_slot('stitch', 'wal2json');

Copyright © 2018 Stitch

tap-postgres's People

Contributors

asaf-erlich avatar bhtowles avatar cosimon avatar dmosorast avatar drdee avatar dsprayberry avatar iterati avatar jacobrobertbaca avatar jyothinarayansetty avatar kallan357 avatar kspeer825 avatar luandy64 avatar mdelaurentis avatar mortenhegewald avatar nick-mccoy avatar psantacl avatar robmoore avatar zachharris1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tap-postgres's Issues

Create (dataless) transaction to advance replication log for postgres RDS

When you have a Postgres database in Amazon RDS that receives very little or no new data, the hard drive will fill-up with unreleased wal file segments. The reasons for that is (according to AWS support team):

In RDS, we write to a heartbeat table in our internal "rdsadmin" database every 5 minutes, so RDS
 instances are usually never "idle" for longer than 5 minutes, whereas PostgreSQL installed on EC2
 machine may appear "idle" indefinitely. On RDS hosts, we periodically write heartbeats to our admin 
DB on the host. As a result, even if there is no user activity, WAL files are generated. 

A similar issued affected Debezium and is discussed at https://issues.jboss.org/browse/DBZ-892?attachmentViewMode=gallery&_sscc=t

One way of fixing this is if tap-postgres could create a (dataless) transaction that advances the replication and thereby RDS releases the unused WAL files and preventing the hard drive from filling up. We have implemented our own _heartbeat table on the RDS databases and upsert the current timestamp in that table. Perhaps this is something that tap-postgres can do out-of-the-box.

Lack of documentation

This project has no documentation regarding how to use it. If it really is similar to tap-mysql, as it appears to be, could the readme file be copied from there and tweaked a little? The current readme file makes a reference to the WAL log, but there is no documentation as to why. Is it related to the Stitch functionality for log based replication? It would be good to know what it's referring to, so this tap can be as useful as it should be.

[Question] The --state flag is not saving consecutive

Hi, I am running the tap-postgres as follows.

~/.virtualenvs/tap-postgres/bin/tap-postgres -c postgres_config/tap_postgres_config.json --properties catalog_id.json --state state_id.json | ~/.virtualenvs/target-postgresql/bin/target-postgres -c target_postgres_config.json

But in the state.json the last record is not being written, for that reason it always uses the same records and not from the last id.

This is the structure of the state.json. Do you know what is missing ??

{
  "bookmarks": {
    "ebdb-public-zone_class": {
      "replication_key": "updated_at", 
      "version": 1600725425036,
      "replication_key_value": 0
    }
  }
}

Unsorted Record fields

The fields of a database table are sorted by alphabet when a record is returned. When using (in my case) the target-csv the headers of the csv field are by alphabet. I wish the list would not be sorted (or this behaviour is configurable) so I keep my original order.

tap-postgres incompatible with singer-target-postgres

This is a pretty annoying aspect between packages maintained between singer-io and datamill-co orgs. This project seems to be the more out-of-date one.

Compare this project (tap-postgres):

'psycopg2==2.7.4',

to the datamill maintained project (singer-target-postgres):
setup.py link which contains

        'psycopg2==2.8.5',

This makes it impossible to install both tap-postgres and singer-target-postgres in the same python environment.

Another side-issue is in the singer docs page for the target, it recommends installing target-postgresql, which does not exist on pypi. There's a package target-postgres maintained by statsbot but the docs link to datamill's maintained project singer-target-postgres. It's a bit risky to not reference a clear package.

OSError: [Errno 22] Invalid argument when use tap-postgres with target-bigquery

`INFO Failed to retrieve SSL status
INFO Selected streams: ['poseidon-public-po_order']
INFO No currently_syncing found
INFO Beginning sync of stream(poseidon-public-po_order) with sync method(incremental)
INFO Stream poseidon-public-po_order is using incremental replication with replication key id
CRITICAL [Errno 22] Invalid argument
Traceback (most recent call last):
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\Scripts\tap-postgres.exe\__main__.py", line 7, in <module>
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\tap_postgres\__init__.py", line 711, in main
    raise exc
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\tap_postgres\__init__.py", line 708, in main
    main_impl()
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\tap_postgres\__init__.py", line 702, in main_impl
    do_sync(conn_config, args.properties, args.config.get('default_replication_method'), state)
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\tap_postgres\__init__.py", line 669, in do_sync
    state = sync_traditional_stream(conn_config, stream, state, sync_method_lookup[stream['tap_stream_id']], end_lsn)
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\tap_postgres\__init__.py", line 558, in sync_traditional_stream
    state = do_sync_incremental(conn_config, stream, state, desired_columns, md_map)
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\tap_postgres\__init__.py", line 467, in do_sync_incremental
    sync_common.send_schema_message(stream, [replication_key])
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\tap_postgres\sync_strategies\common.py", line 31, in send_schema_message
    write_schema_message(schema_message)
  File "c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\tap_postgres\sync_strategies\common.py", line 15, in write_schema_message
    sys.stdout.flush()
OSError: [Errno 22] Invalid argument
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='cp1252'>
OSError: [Errno 22] Invalid argument`

PGRES_COPY_BOTH

Getting this error after doing the initial full table sync, during the first incremental sync, about 1.5min after the "Starting Logical Replication" INFO message:
psycopg2.DatabaseError: error with status PGRES_COPY_BOTH and no message from the libpq

Not sure what other information to include. Seems like "COPY_BOTH" is a type from Postgres that only appears during logical replication. psycopg2 added support for it in 2016 or so. Not sure why there's an issue here.

Running Postgres 11.4 on RDS.

full error:
00:40:09 CRITICAL connection already closed 00:40:09 Traceback (most recent call last): 00:40:09 File "/usr/local/lib/python3.7/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 353, in sync_tables 00:40:09 msg = cur.read_message() 00:40:09 psycopg2.DatabaseError: error with status PGRES_COPY_BOTH and no message from the libpq 00:40:09 During handling of the above exception, another exception occurred: 00:40:09 Traceback (most recent call last): 00:40:09 File "/usr/local/bin/tap-postgres", line 10, in <module> 00:40:09 sys.exit(main()) 00:40:09 File "/usr/local/lib/python3.7/site-packages/tap_postgres/__init__.py", line 705, in main 00:40:09 raise exc 00:40:09 File "/usr/local/lib/python3.7/site-packages/tap_postgres/__init__.py", line 702, in main 00:40:09 main_impl() 00:40:09 File "/usr/local/lib/python3.7/site-packages/tap_postgres/__init__.py", line 696, in main_impl 00:40:09 do_sync(conn_config, args.catalog.to_dict() if args.catalog else args.properties, args.config.get('default_replication_method'), state) 00:40:09 File "/usr/local/lib/python3.7/site-packages/tap_postgres/__init__.py", line 671, in do_sync 00:40:09 state = sync_logical_streams(conn_config, list(streams), state, end_lsn) 00:40:09 File "/usr/local/lib/python3.7/site-packages/tap_postgres/__init__.py", line 581, in sync_logical_streams 00:40:09 state = logical_replication.sync_tables(conn_config, logical_streams, state, end_lsn) 00:40:09 File "/usr/local/lib/python3.7/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 377, in sync_tables 00:40:09 pass # recalculate timeout and continue 00:40:09 psycopg2.InterfaceError: connection already closed

Domain types unsupported

Catalog domain types correctly

This would result in no additional catalog modifications are required after discovery for domain types prior to running Singer. Currently we get errors if a primary key is a domain type as it's automatically excluded.

There seems to be a line of SQL commented out above the system information query which would handle this use case, based on the postgres documentation, but I can't see any reference to why it is excluded?

https://github.com/singer-io/tap-postgres/blob/e990e88e4c41720e982c49f77dee8fc6149124de/tap_postgres/__init__.py

Expected

The type of a domain type is catalogued as the underlying data type, e.g if I create a domain as such

CREATE DOMAIN posint AS integer CHECK (VALUE > 0);
CREATE TABLE mytable (id posint);

Then the type of the posint column would be integer, rather than posint.

Actual

Domain types are catalogued with their type as the domain name, rather than the underyling data type.

This type is then marked as unsupported and excluded by default, even if the column is a primary key with a supported underlying data type.

Catalog flag not supported despite usage statement

If I run tap-postgres without any options, I see this usage statement:

usage: tap-postgres [-h] -c CONFIG [-s STATE] [-p PROPERTIES]
                    [--catalog CATALOG] [-d]

So it looks like the catalog flag is supported. However, if I try to use the catalog flag it doesn't work.

$ tap-postgres --config tap-postgres-config.json --catalog catalog.json 
INFO No properties were selected

If I use properties instead, it works.

While this may seem like a small issue, when I use downstream tools that support multiple taps I run into issues as they use catalog instead of properties. It would be great if this could be remedied.

Re-license

Hi,

I would like to use tap-postgres in a product and my customers are unable to accept AGPL code.

Are you able to release this with a more permissive license such as Apache 2.0, MIT or BSD?

Kind regards,
Dylan Just

The --state flag is not saving consecutive

Hi, I am running the tap-postgres as follows.

~/.virtualenvs/tap-postgres/bin/tap-postgres -c postgres_config/tap_postgres_config.json --properties catalog_id.json --state state_id.json | ~/.virtualenvs/target-postgresql/bin/target-postgres -c target_postgres_config.json

But in the state.json the last record is not being written, for that reason it always uses the same records and not from the last id.

This is the structure of the state.json. Do you know what is missing ??

{
  "bookmarks": {
    "ebdb-public-zone_class": {
      "replication_key": "updated_at", 
      "version": 1600725425036,
      "replication_key_value": 0
    }
  }
}

Logical replication is not advancing if high amount of changes generated in different database

Background:

  • Current code flushing changes and updates bookmark only when it encounters any messages for the replication slot.
  • Postgres reading log files sequentially but responds only with messages for replication slot database
  • If another database has a large amount of changes - Postgres does not have enough time to read the large log file and send messages for current replication slot that will advance LSN

Infrastructure:
Postgress instance:

  • database1 -> Generates xGB of log file in short burst -> has replication_slot1 -> replicates every hour
  • database2 -> Generates 2000 records every 6 hours -> has replication_slot2 -> replicates every 6 hours

Scenario:

  1. all of the changes are replicated from both instances and last LSN points to 10000
  2. database1 has a burst and generates 100Gb wal log files and advancing LSN to 20000
  3. database 2 adds 2000 records 1 hour after the burst and advancing LSN to 20100
  4. replication for database2 starts reading changes starting from 10000 LSN
  5. in 30 minutes DB is able to read logs only up to 15000 LSN -> No replication_slot2 messages are produced
  6. Since none of the messages where replicated - no flush has been issued and bookmark was not updated by this part of code
  7. Next time replication starts - it starts again from 10000 LSN

Possible Solution:

  • Advance bookmark using wal_end data from cursor if there are no messages has been produced
  • Trigger flush_lsn with the same wal_end value.

Minimum date range not supported

Source database has a date field with default values of "0001-01-01"

The tap fails when it hits one of these, with "Year 0 is out of range"

Happens on incremental.py, line 97

default Boolean values set as nulls when false

While copying data from a Postgres table defined with a default Boolean value (default is false), the resulting data in the target database contains null values rather than false. The column is not-null in the source database so is always either true or false.

Add secondary watermark to handle robustly case where xmin spans many rows

Currently, the xmin pseudo system column is used as watermark column for the initial load of a table. When the table is very large and it takes more than 6 hours to ingest the data it presents one of the following problems:

  1. Job gets stuck on ingesting xmin because it never processes all rows with same xmin in a 6 hour window. This happens, depending on the number of columns in the table, around 50M rows with the same xmin.
  2. Job does proceed on ingesting xmin with many rows but it typically takes two attempts. The first attempt happens near the end of the runtime window. This attempt will fail, the 2nd attempt will pass because there is a more runtime available because it's the first xmin being processed. However, the fist attempt will have loaded rows into the destination table and hence there will be duplicate data in the destination table that needs to be manually cleaned.

Adding support for a secondary watermark, either a timestamp column or an ID field will prevent both problems from happening.

filter by schema - enhancement

it would be nice to filter discovery by schema, this would reduce the amount of unnecessary data returned in some larger databases, including timescaledb (which has a separate schema for chunks).

During incremental copy, the last row is always updated

I cannot think of a reason why we have ">=" instead of just ">" for INCREMENTAL replication logic?

It seems that we always end up updating the last row. Why is this a desired behavior?

if replication_key_value:
select_sql = """SELECT {}
FROM {}
WHERE {} >= '{}'::{}
ORDER BY {} ASC""".format(','.join(escaped_columns),
post_db.fully_qualified_table_name(schema_name, stream['table_name']),
post_db.prepare_columns_sql(replication_key), replication_key_value, replication_key_sql_datatype,
post_db.prepare_columns_sql(replication_key))

Add support for byte arrays

Currently, the tap-postgres does not offer support for columns of datatype bytea[] which gives the following error:

 do not know how to marshall value of class( <class 'memoryview'> ) and sql_datatype ( bytea )

Better logging when row_to_singer_message fails

We ran into an issue where we had an invalid timestamp (year 21206) in our database and Python couldn't handle it. The error thrown was pretty generic (see logs below). It didn't tell me what table or row the bad timestamp was in. Eventually, I thought to peek into the WAL log using pg_logical_slot_peek_changes to see if I could find it, and I was able to spot it pretty quickly.

I have two suggestions here:

  1. Wrap selected_value_to_singer_value in a try/except and output a useful error message detailing the table and log entry that caused the failure.
  2. The error message should suggest using pg_logical_slot_peek_changes as a way to troubleshoot and find the bad data.
2020-02-26 21:38:34,995Z    tap - CRITICAL year is out of range: 21202-02-26 00:00:00+00
2020-02-26 21:38:34,998Z    tap - Traceback (most recent call last):
2020-02-26 21:38:34,998Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/dateutil/parser/_parser.py", line 655, in parse
2020-02-26 21:38:34,998Z    tap -     ret = self._build_naive(res, default)
2020-02-26 21:38:34,998Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/dateutil/parser/_parser.py", line 1241, in _build_naive
2020-02-26 21:38:34,998Z    tap -     naive = default.replace(**repl)
2020-02-26 21:38:34,998Z    tap - ValueError: year is out of range
2020-02-26 21:38:34,998Z    tap - 
2020-02-26 21:38:34,999Z    tap - The above exception was the direct cause of the following exception:
2020-02-26 21:38:34,999Z    tap - 
2020-02-26 21:38:34,999Z    tap - Traceback (most recent call last):
2020-02-26 21:38:34,999Z    tap -   File "tap-env/bin/tap-postgres", line 11, in <module>
2020-02-26 21:38:34,999Z    tap -     load_entry_point('tap-postgres==0.0.67', 'console_scripts', 'tap-postgres')()
2020-02-26 21:38:34,999Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 708, in main
2020-02-26 21:38:34,999Z    tap -     raise exc
2020-02-26 21:38:34,999Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 705, in main
2020-02-26 21:38:34,999Z    tap -     main_impl()
2020-02-26 21:38:34,999Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 699, in main_impl
2020-02-26 21:38:34,999Z    tap -     do_sync(conn_config, args.properties, args.config.get('default_replication_method'), state)
2020-02-26 21:38:34,999Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 674, in do_sync
2020-02-26 21:38:34,999Z    tap -     state = sync_logical_streams(conn_config, list(streams), state, end_lsn)
2020-02-26 21:38:34,999Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 584, in sync_logical_streams
2020-02-26 21:38:34,999Z    tap -     state = logical_replication.sync_tables(conn_config, logical_streams, state, end_lsn)
2020-02-26 21:38:35,000Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 361, in sync_tables
2020-02-26 21:38:35,000Z    tap -     state = consume_message(logical_streams, state, msg, time_extracted, conn_info, end_lsn)
2020-02-26 21:38:35,000Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 269, in consume_message
2020-02-26 21:38:35,000Z    tap -     record_message = row_to_singer_message(target_stream, col_vals, stream_version, col_names, time_extracted, stream_md_map, conn_info)
2020-02-26 21:38:35,000Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 209, in row_to_singer_message
2020-02-26 21:38:35,000Z    tap -     cleaned_elem = selected_value_to_singer_value(elem, sql_datatype, conn_info)
2020-02-26 21:38:35,000Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 195, in selected_value_to_singer_value
2020-02-26 21:38:35,000Z    tap -     return selected_value_to_singer_value_impl(elem, sql_datatype, conn_info)
2020-02-26 21:38:35,000Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 156, in selected_value_to_singer_value_impl
2020-02-26 21:38:35,000Z    tap -     return parse(elem).isoformat()
2020-02-26 21:38:35,000Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/dateutil/parser/_parser.py", line 1374, in parse
2020-02-26 21:38:35,000Z    tap -     return DEFAULTPARSER.parse(timestr, **kwargs)
2020-02-26 21:38:35,000Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/dateutil/parser/_parser.py", line 657, in parse
2020-02-26 21:38:35,000Z    tap -     six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
2020-02-26 21:38:35,000Z    tap -   File "<string>", line 3, in raise_from
2020-02-26 21:38:35,000Z    tap - dateutil.parser._parser.ParserError: year is out of range: 21202-02-26 00:00:00+00
2020-02-26 21:38:35,035Z target - INFO Requests complete, stopping loop
2020-02-26 21:38:35,085Z   main - INFO Target exited normally with status 0
2020-02-26 21:38:35,087Z   main - INFO Closed tunnel and deleted temporary keyfile
2020-02-26 21:38:35,087Z   main - INFO Exit status is: Discovery succeeded. Tap failed with code 1 and error message: "year is out of range: 21202-02-26 00:00:00+00". Target succeeded.

Wrong psycopg2 version

Description of change

Hi, I was looking at the last commit 'Revert "Bump python 396 (#130)" (#131)' and I have noticed that on the circleCI config.yml psycopg2==2.8.4 is used but on the setup.py file psycopg2==2.7.4 is used. I think it was a typo.

multipleOf validation on M1 Macs

schema['multipleOf'] = post_db.numeric_multiple_of(scale)

This "multipleOf" validation seems to give problems with M1 processors.

Me, and my colleagues @omarghetti and @PaoloAlbano, found this issue while trying to import data from a Postgres Datasource having some columns of type "number".

What's the reason to have such validation check? We found our usecases to work as expected in all scenarios by simply removing it.

We run this code on docker containers running the official python:3.6 and python:3.5 images.

Common.py ModuleNotFound: import simplejson as json

The following error is thrown when running the tap-postgres extractor with Meltano.

Screen Shot 2021-11-22 at 4 29 50 PM

Steps to reproduce:

  1. Create a meltano project
  2. Follow the custom extractor installation instructions here
    2a. Establish a target loader as well
  3. Run the following command: meltano --log-level=debug elt tap-postgres target-X

The issue is resolved by navigating to the .meltano/extractors/tap-postgres/ and activating the virtual environment venv there and then installing simplejson with pip install simplejson

hstore_to_array() does not escape single quotes in value

We have a column with an hstore datatype in a table that we are replicating using logical replication.

The select command that is being issued is:

SELECT hstore_to_array('"nickname"=>"Dave's Courtyard"')

the single quote should be escaped because this throws the following error:

2019-07-03 13:57:31,042Z    tap - Traceback (most recent call last):
2019-07-03 13:57:31,042Z    tap -   File "tap-env/bin/tap-postgres", line 10, in <module>
2019-07-03 13:57:31,043Z    tap -     sys.exit(main())
2019-07-03 13:57:31,043Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 705, in main
2019-07-03 13:57:31,043Z    tap -     raise exc
2019-07-03 13:57:31,043Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 702, in main
2019-07-03 13:57:31,043Z    tap -     main_impl()
2019-07-03 13:57:31,043Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 696, in main_impl
2019-07-03 13:57:31,043Z    tap -     do_sync(conn_config, args.catalog.to_dict() if args.catalog else args.properties, args.config.get('default_replication_method'), state)
2019-07-03 13:57:31,043Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 671, in do_sync
2019-07-03 13:57:31,043Z    tap -     state = sync_logical_streams(conn_config, list(streams), state, end_lsn)
2019-07-03 13:57:31,043Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/__init__.py", line 581, in sync_logical_streams
2019-07-03 13:57:31,043Z    tap -     state = logical_replication.sync_tables(conn_config, logical_streams, state, end_lsn)
2019-07-03 13:57:31,043Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 355, in sync_tables
2019-07-03 13:57:31,044Z    tap -     state = consume_message(logical_streams, state, msg, time_extracted, conn_info, end_lsn)
2019-07-03 13:57:31,044Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 263, in consume_message
2019-07-03 13:57:31,044Z    tap -     record_message = row_to_singer_message(target_stream, col_vals, stream_version, col_names, time_extracted, stream_md_map, conn_info)
2019-07-03 13:57:31,044Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 203, in row_to_singer_message
2019-07-03 13:57:31,044Z    tap -     cleaned_elem = selected_value_to_singer_value(elem, sql_datatype, conn_info)
2019-07-03 13:57:31,044Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 189, in selected_value_to_singer_value
2019-07-03 13:57:31,044Z    tap -     return selected_value_to_singer_value_impl(elem, sql_datatype, conn_info)
2019-07-03 13:57:31,044Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 165, in selected_value_to_singer_value_impl
2019-07-03 13:57:31,044Z    tap -     return create_hstore_elem(conn_info, elem)
2019-07-03 13:57:31,044Z    tap -   File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_postgres/sync_strategies/logical_replication.py", line 75, in create_hstore_elem
2019-07-03 13:57:31,044Z    tap -     cur.execute(sql)
2019-07-03 13:57:31,044Z    tap - psycopg2.ProgrammingError: syntax error at or near "s"
2019-07-03 13:57:31,044Z    tap - LINE 1: SELECT hstore_to_array('"nickname"=>"Dave's Courtyard"')
2019-07-03 13:57:31,044Z    tap -                                                   ^

Support for large transactions

Our integration (using Stitch, but anyone using tap-postgres would experience this) keeps hitting the Out of Memory error when it encounters a large transaction.

There are a few ways around this:

  1. For us, a specific table gives the problem, and we're not consuming anything for that table anyway, so provide a way for us to specify the filter-tables or add-tables parameter to wal2json. From Stitch, I'd imagine add-tables would be populated by the actual "tables to replicate" selected in the UI.
  2. Use write-in-chunks=true or format-version=2 to get around the issue. These are the officially-suggested solutions from the wal2json maintainers. format-version=2 seems like the easier/preferred route, however that's a pretty recent option, so it may not be widely supported yet. So maybe write-in-chunks is the way to go.

Selected fields do not honor order

The fields are alphabetized in the select statement. If a new column is added, it is inserted somewhere in the middle of the column list, creating an inconsistent order in the target. As a result, the entire history must be re-extracted with the new column list.

Rather than alphabetize, can we adjust this to honor the order of the columns in the catalog/properties file?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.