meltanolabs / tap-csv Goto Github PK
View Code? Open in Web Editor NEWA Singer Tap for extracting data from CSV files built using the Meltano SDK.
License: Apache License 2.0
A Singer Tap for extracting data from CSV files built using the Meltano SDK.
License: Apache License 2.0
From a slack thread https://meltano.slack.com/archives/C01TCRBBJD7/p1679410813099339
Its sometimes helpful to have additional metadata about the files that the records were extracted from. Theres precedence in s3-csv and sftp already.
I'd vote to leave this off by default to keep current behavior and allow it to be turned on using a config boolean. The other implementations do it by default but I dont know the implications to existing users and whether this would cause problems if a new property started being extracted.
Allow using tap-csv in environments that use the standard Python.org interpreter version 3.10. Python.org have released version 3.10 more than 4 months ago by now, and some environments only have Python.org 3.10 available as the only interpreter. More and more environments will move to Python 3.10 as the only available interpreter in the very near future (e.g. Ubuntu will move to Python 3.10 as the default in less than 60 days with the release of Ubuntu 22.04 LTS)
meltano init
tap-csv
extractor into the new project with meltano add extractor tap-tsv
tap-csv
fails with error:Added extractor 'tap-csv' to your Meltano project
Variant: meltanolabs (default)
Repository: https://github.com/MeltanoLabs/tap-csv
Documentation: https://hub.meltano.com/extractors/csv.html
Installing extractor 'tap-csv'...
Extractor 'tap-csv' could not be installed: failed to install plugin 'tap-csv'.
Running command git clone --filter=blob:none --quiet https://github.com/MeltanoLabs/tap-csv.git /tmp/pip-req-build-47nr2w7s
ERROR: Package 'tap-csv' requires a different Python: 3.10.2 not in '<3.10,>=3.6.2'
Failed to install plugin(s)
Same as steps 1-3 above.
4. Installation of tap-csv
succeeds
Related to discussions in #11, we should implement state here.
We extract configs like encoding
and others through the code (e.g. encoding default definitions) and cast to the defaults. It would be better to do the default coalescing at the top most level for consistency, to allow --about
to return those defaults, and for better readability.
Also we define the files
array in meltano.yml without the rich type definitions. We should update that to include the full output of --about --format=json
.
hi @pnadolny13,
You seem to be the most active maintainer on this project so I took the liberty to ping you.
I am wondering if there is currently any dev activity on this project? most of the commits apears to be dependabot generated.
If there is dev activity I was wondering if there is anything going on with regards to adding support for other csv sepperators, spessifically ;
.
Thanks in advance for your help.
Following up #125
It would be much convenient to set proper types in advance since we know all of them ahead
Slack thread https://meltano.slack.com/archives/C01TCRBBJD7/p1665147787434089
OG issue #44
While testing Meltano I ran into this issue:
[info ] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 5719: character maps to <undefined>
Why is this surfacing as [info]
instead of error?
I wondered if this might happen because in the Pandas based extractor/loader I wrote before trying Meltano I had to play with encodings because of non-UTF8 chars.
I think a reasonable fix would be to add a config setting of encoding:
for tap-csv, and then if set use the specified encoding it in the call to open() in the get_rows() function in client.py. e.g:
with open(file_path, "r", encoding=encoding) as f:
I've already shared this in the #windows slack channel, but I want to make sure that my use case isn't lost.
The files I'm loading have about 20% of the columns that come and go.
If I load one file with one set of columns, then another file that has a new column into the same table the new column is ignored [info ] time=2022-07-19 16:35:06 name=tap-csv level=WARNING message=Property 'Column Name (2021))' was present in the 'table_meltano' stream but not found in catalog schema. Ignoring.
.
In a loader I wrote before trying Meltano, I handled the transient columns (20% of the columns) by using Pandas and combining them into a single json column in the dataframe before loading. Then I use dbt to run some SQL that extracts that data back out into its own, properly designed table for reporting.
I think I could figure out how to create my own tap-csv to implement my method of handling the transient columns, but I've probably spent much longer working on ELT than I should and need to move on to the phase of my project where I'm setting up Superset, Hex, and perhaps Lightdash so others can start exploring the data.
Hello,
I have installed the tap-csv using:
meltano add extractor tap-csv
It appears to successfully install. However when I use the command to invoke the tap and then pull the version, it's listed as an unrecognized argument:
meltano invoke tap-csv --version
tap-csv: error: unrecognized arguments: --version
I can still invoke and it gives me options, but none appear (at least from what I see here) to be able to provide this:
usage: tap-csv [-h] [-c CONFIG] [-s STATE] [-p PROPERTIES] [--catalog CATALOG] [-d]
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Config file
-s STATE, --state STATE
State file
-p PROPERTIES, --properties PROPERTIES
Property selections
--catalog CATALOG Catalog file
-d, --discover Do schema discovery
Is there a change here to be aware of? Thanks!
Documentation states “path: Local path to the file to be ingested. Note that this may be a directory, in which case all files in that directory and any of its subdirectories will be recursively processed”, but in the code I see os.listdir function used to get files list from the path (so only top-folder files processed).
As @pnadolny13 mentioned the code ported from the legacy version https://gitlab.com/meltano/tap-csv/-/blob/master/tap_csv/__init__.py#L35 and it seems like the recursion function is missing https://gitlab.com/meltano/tap-csv/-/blob/master/tap_csv/__init__.py#L46.
Hi!
I used the files: configuration first in meltano to load one file 'patients.csv', and all was good.
Then I used the csv_files_definition: to load two files 'payors.csv' and 'patients.csv.
but then it gave me this message:
$ meltano run tap-csv target-postgres
2022-12-29T19:07:04.462545Z [warning ] Failed to create symlink to 'meltano.exe': administrator privilege required
2022-12-29T19:07:04.593348Z [info ] Environment 'dev' is active
2022-12-29T19:07:07.404553Z [info ] INFO:tap-csv:Skipping deselected stream 'payers'. cmd_type=elb consumer=False name=tap-csv producer=True stdio=stderr string_id=tap-csv
2022-12-29T19:07:07.490515Z [info ] Block run completed. block_type=ExtractLoadBlocks err=None set_number=0 success=True
I checked this directory in meltano
run>.meltano>run>tap-csv>state,json:
{
"bookmarks": {
"patients": {}
}
}
Not sure where to make it select whatever in the csv_definition.json file:
[
{
"entity": "patients",
"path": "./extract/synthea/patients.csv",
"keys": ["Id"],
"encoding": "UTF-8"
},
{
"entity": "payers",
"path": "./extract/synthea/payers.csv",
"keys": ["Id"],
"encoding": "UTF-8"
}
]
any ideas ?
(This is a partial dup of another issue, but I'll let that one focus on file encoding)
I'm getting an error when trying to meltao run
- can you find it?
If the green info
text accurately matched the level=WARNING or level=CRITICAL it would be infinitely easier to figure out what's going on.
Currently state is a listed capability but we dont bookmark files. Try to implement state tracking for files in an OS agnostic way.
Or just remove the state capability from the README and MeltanoHub and note that it doesnt manage state.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.