slingdata-io / sling-cli Goto Github PK

View Code? Open in Web Editor NEW

264.0 6.0 11.0 74.36 MB

Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database.

Home Page: https://docs.slingdata.io

License: GNU General Public License v3.0

Go 98.79% Dockerfile 0.16% Shell 0.87% PowerShell 0.10% Python 0.06%

extract load etl elt

sling-cli's Introduction

Slings from a data source to a data target.

See docs.slingdata.io for more details.

Sling is a passion project turned into a free CLI Product which offers an easy solution to create and maintain high volume data pipelines using the Extract & Load (EL) approach. It focuses on data movement between:

Database to Database
File System to Database
Database to File System

sling.demo.2023.10.4k.mp4

Some key features:

Single Binary deployment (built with Go). See installation page.
Use Custom SQL as a stream: --src-stream='SELECT * from my_table where col1 > 10'
Manage / View / Test / Discover your connections with the sling conns sub-command
Use Environment Variable as connections if you prefer (export MY_PG='postgres//...)'
Provide YAML or JSON configurations (perfect for git version control).
Powerful Replication logic, to replication many tables with a wildcard (my_schema.*).
Reads your existing DBT connections
Use your environment variable in your YAML / JSON config (SELECT * from my_table where date = '{date}')
Convenient Transformations, such as the flatten option, which auto-creates columns from your nested fields.
Run Pre & Post SQL commands.
many more!

Example Replication:

Available Connectors:

Databases: bigquery bigtable clickhouse duckdb mariadb motherduck mysql oracle postgres redshift snowflake sqlite sqlserver starrocks prometheus
File Systems: azure b2 dospaces gs local minio r2 s3 sftp wasabi
File Formats: csv, parquet, xlsx, json, avro, xml, sas7bday

Here are some additional links:

Ever wanted to quickly pipe in a CSV or JSON file into your database? Use sling to do so:

cat my_file.csv | sling run --tgt-conn MYDB --tgt-object my_schema.my_table

Or want to copy data between two databases? Do it with sling:

sling run --src-conn PG_DB --src-stream public.transactions \
  --tgt-conn MYSQL_DB --tgt-object mysql.bank_transactions \
  --mode full-refresh

Sling can also easily manage our local connections with the sling conns command:

$ sling conns set MY_PG url='postgresql://postgres:myPassword@pghost:5432/postgres'

$ sling conns list
+--------------------------+-----------------+-------------------+
| CONN NAME                | CONN TYPE       | SOURCE            |
+--------------------------+-----------------+-------------------+
| AWS_S3                   | FileSys - S3    | sling env yaml    |
| FINANCE_BQ               | DB - BigQuery   | sling env yaml    |
| DO_SPACES                | FileSys - S3    | sling env yaml    |
| LOCALHOST_DEV            | DB - PostgreSQL | dbt profiles yaml |
| MSSQL                    | DB - SQLServer  | sling env yaml    |
| MYSQL                    | DB - MySQL      | sling env yaml    |
| ORACLE_DB                | DB - Oracle     | env variable      |
| MY_PG                    | DB - PostgreSQL | sling env yaml    |
+--------------------------+-----------------+-------------------+

$ sling conns discover LOCALHOST_DEV
9:05AM INF Found 344 streams:
 - "public"."accounts"
 - "public"."bills"
 - "public"."connections"
 ...

Installation

Brew on Mac

brew install slingdata-io/sling/sling

# You're good to go!
sling -h

Scoop on Windows

scoop bucket add sling https://github.com/slingdata-io/scoop-sling.git
scoop install sling

# You're good to go!
sling -h

Binary on Linux

curl -LO 'https://github.com/slingdata-io/sling-cli/releases/latest/download/sling_linux_amd64.tar.gz' \
  && tar xf sling_linux_amd64.tar.gz \
  && rm -f sling_linux_amd64.tar.gz \
  && chmod +x sling

# You're good to go!
sling -h

Compiling From Source

Linux or Mac

git clone https://github.com/slingdata-io/sling-cli.git
cd sling-cli
bash scripts/build.sh

./sling --help

Windows (PowerShell)

git clone https://github.com/slingdata-io/sling-cli.git
cd sling-cli

.\scripts\build.ps1

.\sling --help

Installing via Python Wrapper

pip install sling

Then you should be able to run sling --help from command line.

Running a Extract-Load Task

CLI

sling run --src-conn POSTGRES_URL --src-stream myschema.mytable \
  --tgt-conn SNOWFLAKE_URL --tgt-object yourschema.yourtable \
  --mode full-refresh

Or passing a yaml/json string or file

sling run -c '
source:
  conn: $POSTGRES_URL
  stream: myschema.mytable

target:
  conn: $SNOWFLAKE_URL
  object: yourschema.yourtable

mode: full-refresh
'
# OR
sling run -c /path/to/config.json

From Lib

package main

import (
	"log"

	"github.com/slingdata-io/sling-cli/core/sling"
)

func main() {
  // cfgStr can be JSON or YAML
	cfgStr := `
    source:
        conn: $POSTGRES_URL
        stream: myschema.mytable
    
    target:
        conn: $SNOWFLAKE_URL
        object: yourschema.yourtable
    
    mode: full-refresh
  `
	cfg, err := sling.NewConfig(cfgStr)
	if err != nil {
		log.Fatal(err)
	}

	err = sling.Sling(cfg)
	if err != nil {
		log.Fatal(err)
	}
}

Config Schema

An example. Put this in https://jsonschema.net/

--src-conn/source.conn and --tgt-conn/target.conn can be a name or URL of a folder:

MY_PG (connection ref in db, profile or env)
$MY_PG (connection ref in env)
postgresql://user:[email protected]:5432/database
s3://my_bucket/my_folder/file.csv
gs://my_google_bucket/my_folder/file.json
file:///tmp/my_folder/file.csv (local storage)

--src-stream/source.stream can be an object name to stream from:

TABLE1
SCHEMA1.TABLE2
OBJECT_NAME
select * from SCHEMA1.TABLE3
/path/to/file.sql (if source conn is DB)

--tgt-object/target.object can be an object name to write to:

TABLE1
SCHEMA1.TABLE2

Example as JSON

{
  "source": {
    "conn": "MY_PG_URL",
    "stream": "select * from my_table",
    "options": {}
  },
  "target": {
    "conn": "s3://my_bucket/my_folder/new_file.csv",
    "options": {
      "header": false
    }
  }
}

sling-cli's People

Contributors

Stargazers

Watchers

Forkers

gu-xie temminks alberttwong soumilshah1995 makwana-ashish sohooo tired-engineer miconnell jovezhong

sling-cli's Issues

Parquet: Row Group repeated for every row

Hi, Row Groups in Parquet are being repeated for each row on the file. I believe this should be on shown once, regardless of the number of rows in parquet file.

Eg. Below is the first 2 Row Groups. I chose file_max_rows: 10000 and so below is repeated 10000x.

  "RowGroups": [
    {
      "Ordinal": null,
      "NumRows": 1,
      "SortingColumns": [],
      "FileOffset": null,
      "TotalByteSize": 15460,
      "TotalCompressedSize": 12614
    },
    {
      "Ordinal": null,
      "NumRows": 1,
      "SortingColumns": [],
      "FileOffset": null,
      "TotalByteSize": 15460,
      "TotalCompressedSize": 12614
    },
   ...
]

Parquet is written to S3 bucket. Let me know if you need any additional information.

mssql: Invalid column name

Looks like the column casing for MSSQL is not preserved with the introduction of quotes in update-key expression. This can be an issue for case sensitive collations.

-- v1.0.60
WHERE GJ.CREATED > '2023-11-29 09:40:27.000'

-- v1.0.61
WHERE GJ."created" > '2023-11-29 09:40:27.000'

mssql: Invalid column name 'created'.

Support non UTF-8 Encoding for Sources

For both file and database sources it would be great if you could support non utf-8 encoding as a configuration option

Mysql decimal(38,18) to Bigquery

hello. I have a question about column type.

The source is Mysql and the column type is decimal(38,18).
The target is Bigquery and when I synchronize, the source data is truncated as it is stored as NUMERIC.

ex: 2418474.103366277600000000 to 2418474.103366277

So I added table_ddl in yaml like below.

streams:
  test.historical:
    target_options:
      table_ddl: |
        CREATE TABLE IF NOT EXISTS `test_project.test.sling_historical`
        (
          id INT64,
          name STRING,
          test BIGNUMERIC
        )        
        ;

The table_ddl has created a BIGNUMERIC column in bigquery, but the data is still truncated like NUMERIC.

Is there any workaround?

Additionally, is it possible to use custom SQL in the replication like below?

streams:
  | 
  select id from test_table where id > 10;
  object: 'test.test_table'

Thanks.

Camel-Case update-key support

Update-keys that contain uppercase characters don't get properly quoted in the where clause for incremental updates.

In particular it is failing for me with Postgres database with the update-key named "createDate" which is a timestanptz datatype.

I have a work-around using a stream configuration:

source:
    conn: SL_PG_CORE
    stream: |
        select * from (
        select *, "updateDate" as updatekey
        from public.order
        ) o
        where {incremental_where_cond}
        primary_key: [id]
        update_key: updatekey

But it would be nice not to HAVE to work-around it

Thanks

Delta lake support

Is there any plans to add support for Delta lake tables?

panic when using MSSQL as source

When using an Azure MSSQL database as the source, running a sync fails with:

8:14AM INF connecting to source database (azuresql)
8:14AM INF reading from source database
panic: Not implemented

goroutine 37 [running]:
github.com/denisenkom/go-mssqldb.passthroughConn.SetWriteDeadline(...)
	/home/runner/go/pkg/mod/github.com/denisenkom/[email protected]/net.go:167
crypto/tls.(*Conn).SetWriteDeadline(...)
	/opt/hostedtoolcache/go/1.19.12/x64/src/crypto/tls/conn.go:152
crypto/tls.(*Conn).closeNotify(0xc000d3e000)
	/opt/hostedtoolcache/go/1.19.12/x64/src/crypto/tls/conn.go:1399 +0xdb
crypto/tls.(*Conn).Close(0x0?)
	/opt/hostedtoolcache/go/1.19.12/x64/src/crypto/tls/conn.go:1369 +0x69
github.com/denisenkom/go-mssqldb.(*Conn).Close(0xc000249af0?)
	/home/runner/go/pkg/mod/github.com/denisenkom/[email protected]/mssql.go:361 +0x28
database/sql.(*driverConn).finalClose.func2()
	/opt/hostedtoolcache/go/1.19.12/x64/src/database/sql/sql.go:645 +0x3c
database/sql.withLock({0x2687988, 0xc0009ec6c0}, 0xc000249ba8)
	/opt/hostedtoolcache/go/1.19.12/x64/src/database/sql/sql.go:3439 +0x8c
database/sql.(*driverConn).finalClose(0xc0009ec6c0)
	/opt/hostedtoolcache/go/1.19.12/x64/src/database/sql/sql.go:643 +0x116
database/sql.(*DB).Close(0xc0005fc340)
	/opt/hostedtoolcache/go/1.19.12/x64/src/database/sql/sql.go:903 +0x297
github.com/flarco/dbio/database.(*BaseConn).Close(0xc00077d6c0)
	/home/runner/go/pkg/mod/github.com/flarco/[email protected]/database/database.go:664 +0x34
github.com/slingdata-io/sling-cli/core/sling.(*TaskExecution).runDbToFile(0xc000218b40)
	/home/runner/work/sling-cli/sling-cli/core/sling/task_run.go:262 +0x528
github.com/slingdata-io/sling-cli/core/sling.(*TaskExecution).Execute.func1()
	/home/runner/work/sling-cli/sling-cli/core/sling/task_run.go:87 +0x259
created by github.com/slingdata-io/sling-cli/core/sling.(*TaskExecution).Execute
	/home/runner/work/sling-cli/sling-cli/core/sling/task_run.go:68 +0x41b

Running in Docker (latest, 1.0.15) on M1 Mac

"Config.Source.Options.Columns not handled" when using the sling-python and sling-cli

This issue relates to the #68 issue.

Use this script to create blank the SQLite DB name people.db, the CSV file name people.csv and sling-python to sync data to people.db

from sling import Sling
import csv
import io
import sqlite3

# tested with version:  sling-1.0.60

csv_str = """NAME,BIRTHDATE,_CDC_SOURCE_DATE,AGE
Alice,1987-03-01,2023-11-21,25
Bob,1980-11-12,2023-11-21,30
Charlie,1975-01-31,2023-11-21,NAN
"""

# Convert the string to a file-like object
csv_file = io.StringIO(csv_str)

# Read the CSV data
reader = csv.reader(csv_file)

# Write the CSV data to a temporary file
with open('people.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    for row in reader:
        writer.writerow(row)

# Create target SQLite database (if it doesnt exist)
database_path = './people.db'
conn = sqlite3.connect(database_path)

# We want Sling to create our database Tables.
# Since we just want an empty database, we'll just close the connection.
conn.close()

print(f"Target SQLite database at file: '{database_path}' .")

# Sling Configuration
config = {
    'source': {
        'stream': f'file://./people.csv',
        'options': {
            'columns': {  # enforce all column types as strings
                'NAME': 'string',
                'BIRTHDATE': 'string',
                '_CDC_SOURCE_DATE': 'string',
                'AGE': 'string'
            }
        }
    },
    'target': {
        'conn': f'sqlite://{database_path}',
        'object': 'MAIN.PEOPLE',
        'options': {}
    },
    'mode': 'full-refresh'
}
Sling(**config).run()

Here is the command output:

$ python test.py                                                                                                                                                                                                               
Target SQLite database at file: './people.db' .
6:46PM INF connecting to target database (sqlite)
6:46PM INF reading from source file system (file)
6:46PM WRN Config.Source.Options.Columns not handled     <--- WARNING HERE
6:46PM INF writing to target database [mode: full-refresh]
6:46PM INF streaming data
6:46PM INF dropped table "main"."people"
6:46PM INF created table "main"."people"
6:46PM INF inserted 3 rows in 0 secs [32 r/s]
6:46PM INF execution succeeded

I expect the data type of all column should be a string but for some reason, it skips the source option:

I also got the same error when I tried the sling-cli with the config.yml or config.json and also got the same problem.
Here my config.json

{
    "source": {
        "stream": "file://./people.csv",
        "options": {
            "columns": {
                "name": "string",
                "birthdate": "string",
                "_cdc_source_date": "string",
                "age": "string"
            }
        }
    },
    "target": {
        "conn": "sqlite://./people.db",
        "object": "MAIN.PEOPLE",
        "options": {}
    },
    "mode": "full-refresh"
}

and my config.yml

source:
  stream: file://./people.csv
  options:
    columns:
        name: string
        birthdate: string
        _cdc_source_date: string
        age: string
target:
  conn: sqlite://./people.db
  object: MAIN.PEOPLE
mode: full-refresh

And the command-line I use:

sling run -c config.yml

sling run -c config.json

Bug: Snowflake Driver for GCP Customers

Snowflake driver for GCP customers had a recent breaking behavior change with regards to PUT operation.
This requires forcing driver upgrades to at minimum for Go v1.6.17.

Error when loading to Snowflake (GCP)
091032 (0A000): Your client app version, Go 1.6.15, is using a deprecated pre-signed URL for PUT. Please upgrade to a version that supports GCP downscoped token. See https://community.snowflake.com/s/article/faq-2023-client-driver-deprecation-for-GCP-customers.

Reference: https://community.snowflake.com/s/article/faq-2023-client-driver-deprecation-for-GCP-customers

MySQL connection error

If the password for the mysql user account contains special characters, an Access denied error occurs.

In my case, if the password is 'password!', an Access denied error occurs, but if I change the password to 'password', I can connect without error.

Can you help me?

Login Fails with `*` in password

Hello,

I tried to run sling conns discover MY_SQLSERVER_DB, but the command is failing with the error:

$ sling conns discover MY_SQLSERVER_DB
fatal:
~
~ could not discover MY_SQLSERVER_DB (See https://docs.slingdata.io/sling-cli/environment)
~ could not connect to MY_SQLSERVER_DB
~ could not connect to database
mssql: login error: Login failed for user 'britz'.

My ~/.sling/env.yaml looks like this:

connections:
  MY_SQLSERVER_DB:
    type: sqlserver
    url: 'sqlserver://britz:<pw-redacted>@<host-redacted>:1433/<db-redacted>'

variables: {}

My Password does have a * in it, and I think that may be what is throwing things off. I believe this is the case because I tried another connection with a SQL Server Database, with a similar env.yaml, but my password in that database does not have a * in it, and that does work.

Provided column not found in table

sling --version
Version: 1.0.35

This looks like a problem with name mangling mismatch between the quoted table name (from the original data) and the column name that sling has "cleaned up".

sling run --src-stream 'file:///tmp/001.parquet' --tgt-conn CLICKHOUSE --tgt-object default.ag3 -d --mode full-refresh
2:13PM DBG could not load dbt connection ~ required keys not provided
--- sling.go:398 func1 ---
--- sling.go:439 cliInit ---
--- cli.go:283 CliProcess ---
--- sling_logic.go:150 processRun ---
--- sling_logic.go:196 runTask ---
--- config.go:298 Prepare ---
--- local.go:53 GetLocalConns ---
--- connection.go:590 ReadDbtConnections ---
--- connection.go:92 NewConnectionFromMap ---
--- connection.go:66 NewConnection ---
--- connection.go:499 setURL ---
--- connection.go:369 func2 ---
Prop value not provided: database, connName: %!s(MISSING)
2:13PM DBG could not load dbt connection ~ required keys not provided
--- sling.go:398 func1 ---
--- sling.go:439 cliInit ---
--- cli.go:283 CliProcess ---
--- sling_logic.go:150 processRun ---
--- sling_logic.go:196 runTask ---
--- config.go:298 Prepare ---
--- local.go:53 GetLocalConns ---
--- connection.go:590 ReadDbtConnections ---
--- connection.go:92 NewConnectionFromMap ---
--- connection.go:66 NewConnection ---
--- connection.go:499 setURL ---
--- connection.go:369 func2 ---
Prop value not provided: instance, connName: %!s(MISSING)
2023-10-07 14:13:24 DBG type is file-db
2023-10-07 14:13:24 INF connecting to target database (clickhouse)
2023-10-07 14:13:24 INF reading from source file system (file)
2023-10-07 14:13:24 DBG ds.1696702404774.ujj, reading datastream from /tmp/001.parquet
2023-10-07 14:13:26 INF writing to target database [mode: full-refresh]
2023-10-07 14:13:26 DBG drop table if exists default.ag3_tmp
2023-10-07 14:13:26 DBG table default.ag3_tmp dropped
2023-10-07 14:13:26 DBG create table default.ag3_tmp (`county` Nullable(String),
`fips` Nullable(String),
`age_adjusted_incidence_raterate_note___cases_per_100,000` Nullable(String),
`lower_95pct_confidence_interval` Nullable(String),
`upper_95pct_confidence_interval` Nullable(String),
`ci_rankrank_note` Nullable(String),
`lower_ci_ci_rank` Nullable(String),
`upper_ci_ci_rank` Nullable(String),
`average_annual_count` Nullable(String),
`recent_trend` Nullable(String),
`recent_5_year_trend_trend_note_in_incidence_rates` Nullable(String),
`lower_95pct_confidence_interval.1` Nullable(String),
`upper_95pct_confidence_interval.1` Nullable(String),
`year` Nullable(String),
`sex` Nullable(String),
`stage` Nullable(String),
`race` Nullable(String),
`cancer` Nullable(String),
`areatype` Nullable(String),
`age` Nullable(String),
`_sling_loaded_at` Nullable(Int64)) engine=MergeTree ORDER BY tuple()
2023-10-07 14:13:26 INF streaming data
2023-10-07 14:13:26 DBG drop table if exists default.ag3_tmp
2023-10-07 14:13:26 DBG table default.ag3_tmp dropped
2023-10-07 14:13:26 DBG connection was closed, reconnecting
2023-10-07 14:13:26 DBG drop table if exists default.ag3_tmp
2023-10-07 14:13:26 DBG table default.ag3_tmp dropped
2023-10-07 14:13:26 INF execution failed
fatal:
--- sling.go:398 func1 ---
--- sling.go:439 cliInit ---
--- cli.go:283 CliProcess ---
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
--- sling_logic.go:150 processRun ---
--- sling_logic.go:212 runTask ---
~ execution failed
--- task_run.go:125 Execute ---
--- task_run.go:83 func1 ---
--- task_run.go:490 runFileToDB ---
--- task_run_write.go:233 WriteToDb ---
--- database.go:2359 BulkImportFlow ---
--- database.go:2346 func1 ---
--- database_clickhouse.go:100 BulkImportStream ---
~ could not bulk import
--- database.go:1535 GetColumns ---
~ could not get matching list of columns from table
--- database.go:1484 GetTableColumns ---
provided field 'age_adjusted_incidence_raterate_note___cases_per_100_000' not found in table `default`.`ag3_tmp`

context canceled

--- task_run.go:83 func1 ---
~ could not write to database
--- task_run.go:490 runFileToDB ---
~ could not insert into default.ag3. Maybe try a higher sample size (SAMPLE_SIZE=2000)?
--- task_run_write.go:237 WriteToDb ---
--- task_run.go:83 func1 ---
--- task_run.go:490 runFileToDB ---
--- task_run_write.go:233 WriteToDb ---
--- database.go:2359 BulkImportFlow ---
--- database.go:2346 func1 ---
--- database_clickhouse.go:100 BulkImportStream ---
~ could not bulk import
--- database.go:1535 GetColumns ---
~ could not get matching list of columns from table
--- database.go:1484 GetTableColumns ---
provided field 'age_adjusted_incidence_raterate_note___cases_per_100_000' not found in table `default`.`ag3_tmp`

context canceled

Azure storage connection stores files locally instead of Azure storage account

I'm fetching data from SQL server and trying to write them to files in Azure storage account container. When running Sling it stores files locally in the folder with same name as connection name, instead of Azure storage account.

My config:

connections:
  adls:
    type: azure
    account: my_storage_account
    container: my_container
    conn_str: DefaultEndpointsProtocol=https;AccountName=...

Files are saved to folder named adls

Quoted Columns

Not sure if the option exists and I was reading the docs wrong but for a replication between Snowflake and Postgres it looks like sling doesn't correctly read or provide an option to specify that the Snowflake identifiers are quoted

[sqlserver] use nvarchar to load unicode data?

$ sling --version
```
  Version: 1.0.48
```

file /c/temp/test_unicode.csv

  /c/temp/test_unicode.csv: Unicode text, UTF-8 text, with CRLF line terminators

cat /c/temp/test_unicode.csv
```
  a;b;c
  aaa;bbb;ñññ
```

cat /c/temp/test_unicode.csv | sling run --tgt-conn dev --tgt-object dbo.test

  INF connecting to target database (sqlserver)
  INF reading from stream (stdin)
  INF delimiter auto-detected: ";"
  INF writing to target database [mode: full-refresh]
  INF streaming data
  WRN bcp version 13 is old. This may give issues with sling, consider upgrading.
  INF dropped table "dbo"."test"
  INF created table "dbo"."test"
  INF inserted 1 rows in 0 secs [2 r/s]
  INF execution succeeded

This creates varchar columns:

CREATE TABLE [dbo].[test](
	[a] [varchar](255) NULL,
	[b] [varchar](255) NULL,
	[c] [varchar](255) NULL,
	[_sling_loaded_at] [bigint] NULL
) ON [PRIMARY]
GO

...which use the database default collation, corrupting any incompatible characters:

a	b	c
aaa	bbb	+-+-+-

Would it be possible to use the unicode nvarchar type instead of varchar?

Support for CDC in MS SQL

Is there any plans to support incremental load based on CDC for MS SQL sources?

Allow Direct INSERT in Final Table

This would only apply to snapshot mode, or incremental mode with only update_key (no primary_key).

See #33 (comment) for more details.

Changes needed in WriteToDb:

Add environment variable to capture setting. Something like SLING_ALLOW_DIRECT_INSERT=TRUE.
Add error handling, if mode is not snapshot or incremental mode with only update_key.
- Add config method to determine if direct insert is allowed: DirectInsertAllowed()
Ensure PreSQL is executed before direct insert. Currently, is executed after successful insert into temp table
Clean up and reorg WriteToDb a bit for easier read

Problem when using MySQL as source with SSL true

Hi,

Once again, thank you for an amazing tool. I have tried to work with a specific mysql source, that have SSL.

Normally i would add an ?ssl=true, or something similar. That has no impact however. I tried to look at the dbio repo, but did not get any wiser.

Do you have any suggestion on how to set sslmode to true when using MySQL as source?

Thanks

sqlserver connection to server instance

I'm trying to connect to a SQL server instance Server01\Instance01 and getting error: invalid character "\\" in host name
Combined with Windows authentication I ended up with following config:

connections:
  mssql:
    type: sqlserver
    host: Server01\Instance01
    database: DB     
    schema: dbo
    user: DOMAIN%5Cuser01
    password: xxx

To test it out I have tried to connect with the same user config to another SQL server which is using default instance and it works fine.
How to connect to a specific SQL server instance?

SFTP Ciphers

Receiving this error when connecting to a vendor's SFTP:

ssh: handshake failed: ssh: no common algorithm for client to server cipher; client offered: [[email protected] [email protected] aes128-ctr aes192-ctr aes256-ctr], server offered: [3des-cbc blowfish-cbc aes256-cbc aes128-cbc cast128-cbc]

Is it possible to specify different ciphers?

Discover command hangs when recursive flag is not set for an S3 connection

Running sling conns discover --folder s3://path/to/files/ --recursive my_conn returns streams quickly, but sling conns discover --folder s3://path/to/files/ my_conn will just hang for a long, long while with no apparent exception.

Target-Schema with "-" not accepted in CLI

If the target database cointains "-", an error will be thrown.

sling run \ --src-stream file:///file.json \ --src-options 'flatten: true' \ --tgt-conn MYSQL_WAREHOUSE \ --tgt-objectdb-123.test \ --mode full-refresh

The error is:

fatal: ~ ~ failure running task (see docs @ https://docs.slingdata.io/sling-cli) ~ ~ execution failed ~ could not write to database ~ could not create temp table test_tmp ~ Error checking & creating schema did not specify schema. Please specify schema in object name.

Db2 Support

Howdy,

Love how simple the dev ex for this tool! Do you have any plans to support IBM Db2?

mssql unique identifier is mapped to a binary string

When running a MSSQL sync (thanks for updating!) where the source tables has UUID columns, the following warning is emitted in the logs:

WRN using string since type 'uniqueidentifier' not mapped for col 'Id': ""

No big deal, because UUIDs can be stored as strings too, no problem. However, when pushing the data to ClickHouse, it seems the mapping is not entirely correct, because an example Id looks like �4��k@�听Ur�8

ClickHouse also supports UUIDs, so there is not really a need to convert to string first. Is there a way to configure this?

MSSQL connector relies on 2016+ feature "drop table if exists"

The docs don't mention it, but the MSSQL connector relies on 2016+ feature "drop table if exists".

It's fairly easy to work around, but I understand that it might not be worth it to support archaic versions of MSSQL. Maybe add a note to the docs in that case?

$ cat test.csv | sling run --tgt-conn MSSQL --tgt-object dbo.test_tbl

~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not drop table dbo.test_tbl
~ Error for drop table if exists dbo.test_tbl
~ Could not execute SQL
~ Error executing [tx: false] drop table if exists dbo.test_tbl
mssql: Incorrect syntax near the keyword 'if'.

Define explicit column definitions for CSV file

When I specify string column definitions for a CSV sling run command to Snowflake, I expect that the data will be loaded and no coercion will happen.

I other words, I dont what Sling to infer any data types.

What seems to happen instead is my CSV data is scanned to infer types, then a type coercion to string type happens. This is totally redundant as I just want to load string field types into string columns (VARCHAR in Snowflake)

How can I explicitly define the column type definitions for my CSV file input file (in my case I want them all as strings) ?

for example

sling run -d \
--src-stream 'file://./csv/people.csv' \
--src-options '{columns: {NAME:string,BIRTHDATE:string,_CDC_SOURCE_DATE:string,AGE: string}}' \
--tgt-conn GNM_SNOWFLAKE \
--tgt-object 'DEFAULT.PEOPLE' \
--mode full-refresh

with a CSV file :

people.csv

"NAME","BIRTHDATE","_CDC_SOURCE_DATE","AGE"
"Alice","1987-03-01","2023-11-21",21
"Bob","1980-11-12","2023-11-21",82
"JIM","1980-11-12","2023-11-21",*

gives me this table in Snowflake with unexpected results :

BIRTHDATE and _CDC_SOURCE_DATE have column type defined as string but end up as a TIMESTAMP...
AGE gets the correct type but as you can see in the log output, this seems to be from coercion of an inferred number type to string.

create or replace TABLE DEV_DBT_MISC.DEFAULT.PEOPLE (
	NAME VARCHAR(16777216),
	BIRTHDATE TIMESTAMP_NTZ(9),          -- !! Should be VARCHAR
	_CDC_SOURCE_DATE TIMESTAMP_NTZ(9),   -- !! Should be VARCHAR
	AGE VARCHAR(16777216),               -- !! Is VARCHAR but was coerced from Number
	_SLING_LOADED_AT NUMBER(38,0)
);

Here is the command with debug output

2023-11-25 14:49:37 DBG type is file-db
2023-11-25 14:49:37 INF connecting to target database (snowflake)
2023-11-25 14:49:38 INF reading from source file system (file)
2023-11-25 14:49:38 DBG ds.1700884178386.w2B, reading datastream from ./csv/people.csv
2023-11-25 14:49:38 DBG casting column 'age' as 'string'
2023-11-25 14:49:38 INF writing to target database [mode: full-refresh]
2023-11-25 14:49:38 DBG drop table if exists "DEFAULT"."PEOPLE_TMP"
2023-11-25 14:49:38 DBG table "DEFAULT"."PEOPLE_TMP" dropped
2023-11-25 14:49:38 DBG create table "DEFAULT"."PEOPLE_TMP" ("NAME" text,
"BIRTHDATE" timestamp,
"_CDC_SOURCE_DATE" timestamp,
"AGE" varchar,
"_SLING_LOADED_AT" integer)
2023-11-25 14:49:38 INF streaming data
2023-11-25 14:49:38 DBG USE SCHEMA DEFAULT
2023-11-25 14:49:38 DBG writing to /var/folders/8q/cs5qcmm564d2q18k0z7k10nr0000gn/T/snowflake/put/2023-11-25T144938.997
2023-11-25 14:49:39 DBG writing to /var/folders/8q/cs5qcmm564d2q18k0z7k10nr0000gn/T/snowflake/put/2023-11-25T144938.997/part.01 [fileRowLimit=500000 fileBytesLimit=96000000 compression=ZSTD concurrency=7 useBufferedStream=false fileFormat=csv]
2023-11-25 14:49:39 DBG REMOVE @DEFAULT.sling_staging/"DEFAULT"."PEOPLE_TMP"/2023-11-25T144938.997
2023-11-25 14:49:39 DBG PUT file:///var/folders/8q/cs5qcmm564d2q18k0z7k10nr0000gn/T/snowflake/put/2023-11-25T144938.997/part.01.0001.csv.zst @DEFAULT.sling_staging/"DEFAULT"."PEOPLE_TMP"/2023-11-25T144938.997 PARALLEL=1 AUTO_COMPRESS=FALSE
2023-11-25 14:49:39 DBG COPY INTO "DEFAULT"."PEOPLE_TMP" ("NAME", "BIRTHDATE", "_CDC_SOURCE_DATE", "AGE", "_SLING_LOADED_AT")
FROM ( 
  SELECT T.$1, T.$2, T.$3, T.$4, T.$5
  FROM @DEFAULT.sling_staging/"DEFAULT"."PEOPLE_TMP"/2023-11-25T144938.997/part.01.0001.csv.zst as T
)
FILE_FORMAT = (
  TYPE = CSV
  RECORD_DELIMITER = '\n'
  ESCAPE_UNENCLOSED_FIELD = NONE
  FIELD_OPTIONALLY_ENCLOSED_BY='0x22'
  EMPTY_FIELD_AS_NULL=TRUE
  NULL_IF = ''
  SKIP_HEADER=1
  REPLACE_INVALID_CHARACTERS=TRUE
)
ON_ERROR = ABORT_STATEMENT

2023-11-25 14:49:40 DBG REMOVE @DEFAULT.sling_staging/"DEFAULT"."PEOPLE_TMP"/2023-11-25T144938.997
2023-11-25 14:49:40 DBG select count(*) cnt from "DEFAULT"."PEOPLE_TMP"
2023-11-25 14:49:40 DBG comparing checksums []string{"NAME", "BIRTHDATE", "_CDC_SOURCE_DATE", "AGE", "_SLING_LOADED_AT"} vs []string{"name", "birthdate", "_cdc_source_date", "age", "_sling_loaded_at"}: []string{"NAME", "BIRTHDATE", "_CDC_SOURCE_DATE", "AGE", "_SLING_LOADED_AT"}
2023-11-25 14:49:40 DBG select sum(length("NAME"::string)) as "NAME", sum(DATE_PART('epoch_microsecond', "BIRTHDATE")) as "BIRTHDATE", sum(DATE_PART('epoch_microsecond', "_CDC_SOURCE_DATE")) as "_CDC_SOURCE_DATE", sum(length("AGE"::string)) as "AGE", sum(abs("_SLING_LOADED_AT")) as "_SLING_LOADED_AT" from "DEFAULT"."PEOPLE_TMP" 
2023-11-25 14:49:41 DBG drop table if exists "DEFAULT"."PEOPLE"
2023-11-25 14:49:41 DBG table "DEFAULT"."PEOPLE" dropped
2023-11-25 14:49:41 INF dropped table "DEFAULT"."PEOPLE"
2023-11-25 14:49:41 DBG create table "DEFAULT"."PEOPLE" ("NAME" varchar,
"BIRTHDATE" timestamp,
"_CDC_SOURCE_DATE" timestamp,
"AGE" varchar,
"_SLING_LOADED_AT" bigint)
2023-11-25 14:49:41 INF created table "DEFAULT"."PEOPLE"
2023-11-25 14:49:41 DBG insert into "DEFAULT"."PEOPLE" ("NAME", "BIRTHDATE", "_CDC_SOURCE_DATE", "AGE", "_SLING_LOADED_AT") select "NAME", "BIRTHDATE", "_CDC_SOURCE_DATE", "AGE", "_SLING_LOADED_AT" from "DEFAULT"."PEOPLE_TMP"
2023-11-25 14:49:42 DBG inserted rows into `"DEFAULT"."PEOPLE"` from temp table `"DEFAULT"."PEOPLE_TMP"`
2023-11-25 14:49:42 INF inserted 3 rows in 4 secs [1 r/s]
2023-11-25 14:49:42 DBG connection was closed, reconnecting
2023-11-25 14:49:42 DBG drop table if exists "DEFAULT"."PEOPLE_TMP"
2023-11-25 14:49:42 DBG table "DEFAULT"."PEOPLE_TMP" dropped
2023-11-25 14:49:42 INF execution succeeded
FYI there is a new sling version released (1.0.60). Please run `brew upgrade slingdata-io/sling/sling`.

Sling docker image not starting

Hi,

Mac version works great, however when trying out the docker image(slingdata/sling), i get the following failures on startup

sling: /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by sling)
sling: /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.33' not found (required by sling) sling: /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.34' not found (required by sling)`

This is both on mac m1, but also on linux. Any ideas?

The file split option does not work on AWS S3.

hello.
I'm testing AWS S3 as a data load target.

I have written a yaml file like below to store the files in 4000 bytes increments.

defaults:
  mode: full-refresh
  
  target_options:
    file_max_bytes: 4000
    format: parquet

streams:
  test.dummy_data:
    object: s3://test-bucket/temporary/sling/

When I checked S3 after the synchronization was completed, the file splitting was not done and the data was loaded into only 1 file.

Is it possible that I wrote the yaml incorrectly?

thanks.

Clickhouse Connections Issue Using Broken Down Format

What Happened

When using broken down format of connecting with clickhouse, we found that the connection test is successful when using default user. When using other user we get the same error everytime. For example when one of my teammate is using their connection detail:

  clickhouse_production:
    type: clickhouse
    host: [redacted_ip]
    user: [redacted_user]
    password: [redacted_password]
    database:

we get the following error:

fatal:
~ 
~ could not test CLICKHOUSE_PRODUCTION (See https://docs.slingdata.io/sling-cli/environment)
~ could not connect to CLICKHOUSE_PRODUCTION
~ could not connect to database
code: 516, message: default: Authentication failed: password is incorrect, or there is no user with such name.

If you have installed ClickHouse and forgot password you can reset it in the configuration file.
The password for default user is typically located at /etc/clickhouse-server/users.d/default-password.xml
and deleting this file will reset the password.
See also /etc/clickhouse-server/users.xml on the server where ClickHouse is installed.

Possible Issue

Looking at the error message, we found that the following line is quite peculiar:

code: 516, message: default: Authentication failed: password is incorrect, or there is no user with such name.

it seems like the clickhouse server receive connection information for default user instead of another user that my teammate's environment suggest. Looking into this, my first assumption is that there are conflicts with multiple configuration methods, but after making sure that they only use ~/.sling/env.yml the problem persists.

So I tried to understand how the connection test works and stumble upon this line in farco/dbio repo:

	case dbio.TypeDbClickhouse:
		setIfMissing("username", "")
		setIfMissing("password", "")
		setIfMissing("schema", c.Data["database"])
		setIfMissing("port", c.Type.DefPort())
		template = "clickhouse://{username}:{password}@{host}:{port}/{database}"

While other database configuration use c.Data["user"] to set username if missing, the CH case doesn't use that. So looking at the key name, I assume that user key in env.yaml won't work with clickhouse and instead either use URL format or use username as key instead.

Current Workaround

As mentioned above currently I have two workarounds

Using `username` key

By defining the user name in username key instead I got it working as expected

  clickhouse_production:
    type: clickhouse
    host: [redacted_host]
    username: [redacted_user]
    password: [redacted_password]
    database:

but it introduces inconsistency in the yml config which is fine but I'd rather not to.

Using URL pattern instead

As seen above, I assume the template will extract each key from the pattern. So clickhouse://{username}:{password}@{host}:{port}/{database} will detect value in {username} as username key. And after testing this with the same user, it worked as well.

Proposed Solution

Currently the only option I could think of is using the same pattern as other database connection, i.e. using c.Data["user"] to fill the username part. To be like the following:

	case dbio.TypeDbClickhouse:
-		setIfMissing("username", "")
+		setIfMissing("username", c.Data["user"])
		setIfMissing("password", "")
		setIfMissing("schema", c.Data["database"])
		setIfMissing("port", c.Type.DefPort())
		template = "clickhouse://{username}:{password}@{host}:{port}/{database}"

Better Backfill UX

Currently, when backfilling very large tables, there is no way to incrementally backfill without manual intervention.

Some ideas on improving the ease of backfilling:

When mode is incremental, have some option, such asflush_interval, defaulting to 10M records, so that records are written in the final table incrementally instead of copying the full table.
- This will enable some resilience, in case some error occurs mid-transfer of very large tables. We would not have to start from scratch.
Have an input for a range of backfill for the update_key? This will allow specific backfill periods.
Backfills are usually ad-hoc operations. Perhaps have backfill mode, which is really incremental mode, with a specific range for the update_key.
- primary_key and update_key would be required.
- range could be all, which is really incremental
- have ability to specify particular columns that we want to backfill.

Unsupported type for storing in int64 column

I am trying to copy data from a MSSQL DB to S3 Bucket (as a PARQUET format).

I am getting to following error due to a specific column.

fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
context canceled

~
context canceled

~ error writing row
unsupported type for storing in int64 column: int => 10

~ Could not FileSysWriteDataflow
context canceled

~
context canceled

~ error writing row
unsupported type for storing in int64 column: int => 10

This happens only on columns with datatype of smallint.
When I cast this column as int the copy goes through

Copy to CSV format works fine

Column Casing

Sling CLI currently migrates data with column names as is from the source.

Ideally, we should have a configuration option to specific whether we want to force the column names to be all upper-case or lower-case.

This would make it easier to use on databases like Snowflake, where upper-case is the default and thus we don't need to wrap the column name in quotes when querying.

Not able to read snappy compressed Parquet

Hi,

I am not able to read a compressed Parquet file.

I tried with following methods with errors:

pandas

ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet file size is 51961168 bytes, smaller than the size reported by footer's (143101950bytes)

Redshift Spectrum

Invalid metadata size in file footer: 143101950 bytes. File size: 51961168 bytes.

This could be related to #31
Let me know if you need any additional information.

Cannot validate certificate for <IP address> because it doesn't contain any IP SANs

Getting following error starting with v1.0.52 using docker:

TLS Handshake failed: tls: failed to verify certificate: x509: cannot validate certificate for <IP address> because it doesn't contain any IP SANs

Previous versions pass connection test with same settings:

...
host: '<IP address>'
trust_cert: true
...

Not sure if related to the CLI or the container image failing to skip certificate verification.

Enhancement: Update Audit Column in Target Table

Sling currently replicates the data from source to target as is.
It would help immensely for post-sling processing patterns if there was a timestamp audit column to represent when the record was last updated.

Only snapshot option will populate an extra column of _sling_loaded_at to indicate when the data was loaded.
Would be great if we can enable this on other modes as a target option.

Uppercase Bigquery dataset name support

When trying to setup a target BigQuery dataset as:

export SL_BQ_STAGE='
type: bigquery
project: atlas-datawarehouse-uat
dataset: ADG_TEST_DATASET
location: us-east1
key_file: /workspaces/dagnote/creds/bigquery-atlas-datawarehouse-uat.json
'
I get the following error:

2:55PM INF writing to target database [mode: incremental]
2:55PM INF execution failed
FYI there is a new sling version released (1.0.60). Please run pip install -U sling.
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ Could not WriteToDb
~ could not drop table adg_test_dataset.order_tmp
~ Error for drop table if exists adg_test_dataset.order_tmp
~ Could not execute SQL
~ Error executing drop table if exists adg_test_dataset.order_tmp
googleapi: Error 404: Not found: Dataset atlas-datawarehouse-uat:adg_test_dataset was not found in location US, notFound

BTW I tested with both version 1.59.0 and 1.60.0.

All works fine with a lowercase dataset name.

I am confused about whether it is truely using the location "us-east1", becuase the error reads:
"Error 404: Not found: Dataset atlas-datawarehouse-uat:adg_test_dataset was not found in location US, notFound"

Use account and container values from connection config instead of full URL

Configuration for Azure storage account require account and container to be specified

connections:
  AZURE_STORAGE:
    type: azure
    account: <account>
    container: <container>
    sas_svc_url: '<sas_svc_url>'

In same time source stream and target object require full URL which also includes account and container:

source:
  conn: MY_AZURE_CONT
  stream: https://my_account.blob.core.windows.net/my-container/my_file.csv

Would it make sense to use config from MY_AZURE_CONT and have stream/object starting from container as root?

source:
  conn: MY_AZURE_CONT
  stream: my-path/my-subpath/my_file.csv

Cannot open database "dev/stderr" {sqlserver)

Issue: Execution fails when loading to SQL Server due to bcp utility command.

Description:

Problem: Execution fails when trying to load data into SQL Server.
Probable Cause: There seems to be an issue with the bcp utility command, specifically with the -e flag.
Verification: I created a simple csv and tried running the bcp command which also failed with the same error.
Temporary Solution: By removing the "-e /dev/stderr" parameter, the load is successful.

Documentation Reference:
The Microsoft documentation mentions that the -e flag should have a full path to the error file. It also states not to have a space between -e and the path if the path starts with a forward slash or hyphen.

Recommendation:
If the code can be updated to specify a full path for the error file, this issue might be resolved. However, I couldn't pinpoint the exact location in the codebase where this is happening.

>> sling run -c C:\extract.yaml

3:17PM INF connecting to source database (oracle)
3:17PM INF connecting to target database (sqlserver)
3:17PM INF reading from source database
3:18PM INF writing to target database [mode: full-refresh]
3:18PM INF streaming data
54s 56,845 1064 r/s
3:19PM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
context canceled

~ could not bulk import
context canceled

~ SQL Server BCP Import Command -> bcp dbo.testtable_tmp in C:\Users\kevin\AppData\Local\Temp\sqlserver.dbo.testtable_tmp.1.csv2503925445 -S **** -d Test_DW_L -U Test_Login -P **** -t , -m 1 -c -q -b 50000 -F 2 -e /dev/stderr      
SQL Server BCP Import Error  ->
SQLState = 37000, NativeError = 4060
Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Cannot open database "ev/stderr" requested by the login. The login failed.
SQLState = 28000, NativeError = 18456
Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Login failed for user 'Test_Login'.

exit status 1

~ could not insert into "dbo"."TestTable"
context canceled

~ could not bulk import
context canceled

~ SQL Server BCP Import Command -> bcp dbo.testtable_tmp in C:\Users\kevin\AppData\Local\Temp\sqlserver.dbo.testtable_tmp.1.csv2503925445 -S **** -d Test_DW_L -U Test_Login -P **** -t , -m 1 -c -q -b 50000 -F 2 -e /dev/stderr      
SQL Server BCP Import Error  ->
SQLState = 37000, NativeError = 4060
Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Cannot open database "ev/stderr" requested by the login. The login failed.
SQLState = 28000, NativeError = 18456
Error = [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Login failed for user 'Test_Login'.

exit status 1

Error when not specifying a schema

Hey :)

While using the sling-cli for data transfer from a PostgreSQL database to a DuckDB database using the following command:
Command ❌
sling run --src-conn dvdrental_db --src-stream 'category' --tgt-conn duck_db_conn --tgt-object 'category'

I encountered an issue when not specifying a schema for the target table in DuckDB, resulting in the following error message:
Error: near line 1: Parser Error: zero-length delimited identifier at or near """"
LINE 1: create schema ""
I know that in DuckDB the default schema is public. I modified the command to include the schema explicitly, and it worked.
Command ✅
sling run --src-conn dvdrental_db --src-stream 'category' --tgt-conn duck_db_conn --tgt-object 'public.category'
I guess that the problem occurs during command parsing in the target, as it appears that the code requires schema specification; otherwise, it throws an error.

CLI options syntax confusion (--target-opts)

Attempting to specify inline options in --target-opts results in a rather confusing experience.

In #19 you said:

It accepts YAML or JSON syntax, but it has to be surrounded by quotes from the command line.

YAML syntax: --tgt-options '{use_bulk: false}'

JSON syntax: --tgt-options '{"use_bulk": false}'

...but my tests in cmd.exe on Windows demonstrate that:

'{use_bulk: false}' and '{"use_bulk": false}' trigger yaml: found unexpected end of stream

c:\temp>type test.csv | sling run --tgt-conn MSSQL --tgt-object dbo.test -d --tgt-options '{use_bulk: false}'
fatal:
--- sling.go:398 func1 ---
--- sling.go:439 cliInit ---
--- cli.go:283 CliProcess ---
~ invalid target options -> '{use_bulk:
--- sling_logic.go:100 processRun ---
yaml: found unexpected end of stream

'{use_bulk:false}' yields yaml: unmarshal errors:

c:\temp>type test.csv | sling run --tgt-conn MSSQL --tgt-object dbo.test -d --tgt-options '{use_bulk:false}'
fatal:
~
~ invalid target options -> '{use_bulk:false}'
yaml: unmarshal errors:
  line 1: cannot unmarshal !!str `{use_bu...` into sling.TargetOptions

{use_bulk: false} and {"use_bulk": false} yield yaml: line 1: did not find expected node content

c:\temp>type test.csv | sling run --tgt-conn MSSQL --tgt-object dbo.test -d --tgt-options {use_bulk: false}
fatal:
~
~ invalid target options -> {use_bulk:
yaml: line 1: did not find expected node content

{use_bulk:false} and "{use_bulk:false}" are silently ignored

c:\temp>type test.csv | sling run --tgt-conn MSSQL --tgt-object dbo.test -d --tgt-options {use_bulk:false}
...SQL Server BCP Import Error

"{use_bulk: false}", {'use_bulk':false}, and "{'use_bulk':false}" work as expected:

c:\temp>type test.csv | sling run --tgt-conn MSSQL --tgt-object dbo.test -d --tgt-options "{use_bulk: false}"

c:\temp>type test.csv
a;b;c
1;2;3

c:\temp>sling --version
Version: 1.0.39

c:\temp>type %USERPROFILE%\.sling\env.yaml
# Environment Credentials for Sling CLI

# See https://docs.slingdata.io/sling-cli/environment

connections:
  MSSQL:
    type: sqlserver
    url: sqlserver://etl:REDACTED@sqlsrv:1433/master

variables: {}

Replication Wildcard issue for MySQL

The following replication YAML does not work.

source: MYSQL_SOURCE
target: MYSQL_TARGET

# default config options which apply to all streams
defaults:
  mode: full-refresh 

  # specify pattern to use for object naming in target connection, see below for options
  object: 'new_db.{stream_schema}_{stream_table}'

streams:
  "`my-db-name`.*": {}

The table name is rendered as "my-db-name"."table_name", when it should be using backticks.

Furthermore, when using object formating, such as object: 'new_db.{stream_schema}_{stream_table}', the {stream_schema} does not replace the - from my-db-name. Should be my_db_name.

sling 1.0.18 bug

Hello,

The last update seems to not working properly'

sling Version: 1.0.17 ( DB is postgres+ timescale)

0:57AM INF truncated table public.wh2a
10:57AM INF inserted 537 rows in 0 secs [1,732 r/s]
10:57AM INF execution succeeded
10:57AM WRN FYI, a new sling version is available (please run sling update)

After updating to 1.0.18

10:59AM INF writing to target database [mode: truncate]
10:59AM INF streaming data
10:59AM INF created table public.WH2A
10:59AM INF truncated table public.WH2A
10:59AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ Could not insert from temp
~ could not get column list for public.WH2A
~ could not get column list for public.WH2A
did not find any columns for table: "public"."WH2A"

Reverting to 1.0.17 makes sling working as expected.

flatten-Option does not work

I tried the example from the blog https://blog.slingdata.io/import-json-files-into-mysql and that JSON worked. However this json will result in an error.

This happens, because there is the special character '@' in the keys. I will do a cleanup from my side but maybe this was not intended.

[ { "Kennung": "M", "Sortennummer": "75118", "Bezeichnung1": "--Geschenk-Gutschein 3--", "Bilder": { "Bild": { "@NR": "2", "Datei": "G75118_2.jpg" } } } ]

Sling Changes File Path Capitalization Unintentionally.

Why does sling change the capitalization of the file path when I use the command or the python lib to load a PostgreSQL table into a CSV file? Specifically, I specify the path as 'file:///Downloads/assets/file.csv,' but sling saves the CSV file to 'downloads/assets' with lowercase letters. This behavior seems unexpected.

Use Correct BCP Field/Line Terminators

Links

See #19 (comment)

Parquet files created with Sling are unreadable

I'm using Sling to create parquet files where the source is either a SQL Server or Sqlite. When I dump the contents of a table or a query to a parquet file and then try to read it with parq I get the following error:

|> parq test_parquet.parquet

Traceback (most recent call last):
File "/Users/bplexico/dev/dagenv/bin/parq", line 8, in
sys.exit(main())
^^^^^^
File "/Users/bplexico/dev/dagenv/lib/python3.11/site-packages/parq/main.py", line 41, in main
pq_table = pq.read_table(cmd_args.file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/bplexico/dev/dagenv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2956, in read_table
dataset = _ParquetDatasetV2(
^^^^^^^^^^^^^^^^^^
File "/Users/bplexico/dev/dagenv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2496, in init
[fragment], schema=schema or fragment.physical_schema,
^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 1358, in pyarrow._dataset.Fragment.physical_schema.get
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

I saw in another issue that you're using parquet-cli under the hood so I tried that one out too. When I run parquet-tool on the file, I get the following:

|>>parquet-tool cat test_parquet.parquet

2023/11/29 10:07:52 failed to read the parquet header: "reading file meta data failed: invalid parquet file header"

Is this an issue with the library you're using to create the parquet files or is something else going on?

Incremental with update-key reinserting existing rows

Log

DBG type is db-db
INF connecting to source database (azuresql)
INF connecting to target database (sqlserver)
INF getting checkpoint value
DBG select max("UpdateDate") as max_val from "rpt"."Location"
INF reading from source database
DBG select * from "rpt"."Location" where UpdateDate >= '2023-10-24 15:00:34.253'

task_run_read.go

	if t.usingCheckpoint() {
		// select only records that have been modified after last max value
		incrementalWhereCond := "1=1"
		if cfg.IncrementalVal != "" {
			greaterThan := ">="
			if val := os.Getenv("SLING_GREATER_THAN_EQUAL"); val != "" {
				greaterThan = lo.Ternary(cast.ToBool(val), ">=", ">")
			}
			incrementalWhereCond = g.R(
				"{update_key} {gt} {value}",
				"update_key", cfg.Source.UpdateKey,
				"value", cfg.IncrementalVal,
				"gt", greaterThan,
			)
		}

Currently, the incrementalWhereCond is defined with operand >= instead of >. Overriding operand using environmental variable SLING_GREATER_THAN_EQUAL = '>' renders expected behavior.

rudder error

Hello,
First thanks a lot for sling !

I am getting the following errors on the log:

9:34AM INF connecting to target database (postgres)
9:34AM INF reading from source file system (file)
9:34AM INF writing to target database [mode: truncate]
9:34AM INF streaming data
1s 27,369 ? p/s 2s 86,156 58791 r/s 3s 149,002 59052 r/s 3s 208,232 73492 r/s 9:34AM INF truncated table public.wh2a
9:34AM INF inserted 257893 rows in 24 secs [10,544 r/s]
9:34AM INF execution succeeded
rudder 2023/08/19 09:34:30 INFO: response 401 401 Unauthorized – Failed to read writeKey from header
rudder 2023/08/19 09:34:30 ERROR: 1 messages dropped because they failed to be sent and the client was closed

It seems that they started after version 0.87.90.

Apart the error on the log everything seems fine.

DB is postgres + Timescale

Cheers

Max

SFTP Password Authentication?

When attempting to connect to a SFTP site using only username and password. I get this error now:

"ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain"

my env.yaml looks like this:

connections:

  MY_SFTP:
    type: sftp
    host: <host>
    user: <user>
    port: <port>
    password: <password>

Truncate Cascade

The truncate strategy should optionally allow a cascade option so that the replication doesn't break in the presence of foreign key constraints or other table dependencies