anelendata / tap-rest-api Goto Github PK
View Code? Open in Web Editor NEWSinger.io tap for generic Rest API
License: Apache License 2.0
Singer.io tap for generic Rest API
License: Apache License 2.0
Creating this ticket to help scope and track effort of moving to the SDK for Taps, sponsored by Meltano and documented here: https://sdk.meltano.com
Running tap within Meltano. Specifically the following command meltano invoke tap-rest-api --infer_schema
or meltano select --list --all tap-rest-api
. I have the following meltano.yml
:
version: 1
send_anonymous_usage_stats: false
elt.buffer_size: 52428800
plugins:
extractors:
- name: tap-rest-api
pip_url: tap-rest-api
namespace: tap_rest_api
executable: tap-rest-api
capabilities:
- catalog
- config
- state
- discover
settings:
- name: streams
- name: url
- name: catalog_dir
- name: schema_dir
- name: schema
- name: auth_method
config:
url: http://<whatever>.com
auth_method: no_auth
catalog_dir: ./extract
schema_dir: ./extract
streams: test_stream
schema: test_schema
Here's full text of the error. It appears to be trying to read a file that has not been created yet.
Catalog discovery failed: command ['/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/bin/tap-rest-api', '--config', '/Users/.../meltano/.meltano/run/tap-rest-api/tap.config.json', '--discover'] returned 1: INFO Loading Schemas
INFO Loading schema for test_stream
CRITICAL [Errno 2] No such file or directory: './extract/test_stream.json'
Traceback (most recent call last):
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/bin/tap-rest-api", line 8, in <module>
sys.exit(main())
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/lib/python3.7/site-packages/singer/utils.py", line 229, in wrapped
return fnc(*args, **kwargs)
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/lib/python3.7/site-packages/tap_rest_api/__init__.py", line 188, in main
discover(CONFIG, STREAMS)
File "/Users.../meltano/.meltano/extractors/tap-rest-api/venv/lib/python3.7/site-packages/tap_rest_api/schema.py", line 64, in discover
config["schema"])
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/lib/python3.7/site-packages/tap_rest_api/schema.py", line 54, in _discover_schemas
stream)})
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/lib/python3.7/site-packages/tap_rest_api/schema.py", line 39, in load_discovered_schema
schema = load_schema(schema_dir, stream.tap_stream_id)
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/lib/python3.7/site-packages/tap_rest_api/schema.py", line 33, in load_schema
schema = utils.load_json(os.path.join(schema_dir, "{}.json".format(entity)))
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/lib/python3.7/site-packages/singer/utils.py", line 108, in load_json
with open(path) as fil:
FileNotFoundError: [Errno 2] No such file or directory: './extract/test_stream.json'
One concern is that the command meltano seems to be generating seems to be using discover
instead of infer_schema
. So maybe this is a bug in meltano or just demonstrating incompatibility with Meltano?
tap-rest-api keeps a copy of the last extracted record in the bookmark (aka state) together with the last recorded index or timestamp. The extra information is used to ignore the same record in the next run with the bookmark whose start index/time is inclusive.
However, It is not a good practice to include a raw record in the bookmark items mainly for security reasons.
So we should
To ensure backward compatibility, the dup check is made for both raw record and the digest.
Using this tap within a Meltano pipeline. I'm trying to run meltano select --list --all tap-rest-api
but I'm getting the following error:
Cannot list the selected attributes: Catalog discovery failed: command ['/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/bin/tap-rest-api', '--config', '/Users/.../meltano/.meltano/run/tap-rest-api/tap.config.json', '--discover'] returned 1: CRITICAL local variable 'streams' referenced before assignment
Traceback (most recent call last):
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/bin/tap-rest-api", line 8, in <module>
sys.exit(main())
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/lib/python3.7/site-packages/singer/utils.py", line 229, in wrapped
return fnc(*args, **kwargs)
File "/Users/.../meltano/.meltano/extractors/tap-rest-api/venv/lib/python3.7/site-packages/tap_rest_api/__init__.py", line 177, in main
for stream in streams:
UnboundLocalError: local variable 'streams' referenced before assignment
Operating System: MacOS Catalina
Meltano.yml
version: 1
send_anonymous_usage_stats: false
elt.buffer_size: 52428800
plugins:
extractors:
- name: tap-rest-api
pip_url: tap-rest-api
namespace: tap_rest_api
executable: tap-rest-api
capabilities:
- catalog
- config
- state
- discover
settings:
- name: url
config:
url: http://<something>.com
Python backoff module lets the code retry in a specified interval for given maximum retries when it encounters an exception such as HTTP server errors (5xx).
tap-rest-api uses singer wrapped version references as singer.utils.backoff
.
But the parameters are set as though it is using the native backoff
@utils.backoff((backoff.expo, requests.exceptions.RequestException), _giveup)
@utils.ratelimit(20, 1)
def generate_request(stream_id, url, auth_method="no_auth", headers=None,
username=None, password=None):
https://github.com/anelendata/tap-rest-api/blob/master/tap_rest_api/helper.py#L301
This results in TypeError: catching classes that do not inherit from BaseException is not allowed
error in the backoff routine:
CRITICAL catching classes that do not inherit from BaseException is not allowed
--
Traceback (most recent call last):
File "/app/workspace/proc_01/lib/python3.6/site-packages/backoff/_sync.py", line 94, in retry
ret = target(*args, **kwargs)
File "/app/workspace/proc_01/lib/python3.6/site-packages/singer/utils.py", line 95, in wrapper
return func(*args, **kwargs)
File "/app/workspace/proc_01/lib/python3.6/site-packages/tap_rest_api/helper.py", line 321, in generate_request
resp.raise_for_status()
File "/app/workspace/proc_01/lib/python3.6/site-packages/requests/models.py", line 960, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://xxxx
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/workspace/proc_01/bin/tap-rest-api", line 8, in <module>
sys.exit(main())
File "/app/workspace/proc_01/lib/python3.6/site-packages/singer/utils.py", line 229, in wrapped
return fnc(*args, **kwargs)
File "/app/workspace/proc_01/lib/python3.6/site-packages/tap_rest_api/__init__.py", line 191, in main
auth_method, raw=args.raw, filter_by_schema=filter_by_schema)
File "/app/workspace/proc_01/lib/python3.6/site-packages/tap_rest_api/sync.py", line 211, in sync
raise e
File "/app/workspace/proc_01/lib/python3.6/site-packages/tap_rest_api/sync.py", line 208, in sync
filter_by_schema=filter_by_schema)
File "/app/workspace/proc_01/lib/python3.6/site-packages/tap_rest_api/sync.py", line 98, in sync_rows
config.get("password"))
File "/app/workspace/proc_01/lib/python3.6/site-packages/backoff/_sync.py", line 95, in retry
except exception as e:
TypeError: catching classes that do not inherit from BaseException is not allowed
--infer_schema
mode produces a null type in JSON schema:
.....
"tz": {
"type": [
"null"
]
},....
and it causes the sync to crash with
... File "/home/danyel/.virtualenvs/tap-rest-api/lib/python3.7/site-packages/getschema/impl.py", line 300, in fix_type
on_invalid_property)
File "/home/danyel/.virtualenvs/tap-rest-api/lib/python3.7/site-packages/getschema/impl.py", line 283, in fix_type
obj_type = obj_type[1]
IndexError: list index out of range
The sync runs fine when the schema is manually fixed to have non-null:
.....
"tz": {
"type": [
"null",
"string"
]
},....
tap-rest-api depends on getschema and I introduced this bug with getschema 0.2.4 :(
anelendata/getschema#13
It is fixed with getschema 0.2.5 with this pull request (it is merged, but feel free to review and comment):
https://github.com/anelendata/getschema/pull/14/files
To fix quickly, just:
pip install -U getschema
Sorry for any inconvenience it might have caused.
Thank you @mlavoie-sm360 for reporting this issue!
On manually editing the json schema for the example, I changed all variables of type "number" to "sting" (rather than "string"). This causes the tap to fail with message:
UnboundLocalError: local variable 'filtered' referenced before assignment
This is because on json2schema.py types are handled between lines 172 and 213, with no failsafe for invalid types.
I think the most helpful way to handle this would be adding something like the following before line 213:
else:
raise Exception("Schema file X contains invalid type Y")
Let me know if you want me to go ahead and make that PR.
Currently, they can only specify the properties at the first level. Use JSON Path https://restfulapi.net/json-jsonpath/ so we can specify anywhere in the schema.
Some Rest API paginate with a dynamically generated next page URL/key included in the precedent response. We need to be able to identify such entry from the response and use it as the next call.
It also need to detect the end of the page and complete the sync.
One example is Crunchbase Search API
https://data.crunchbase.com/docs/paginating-through-the-search-api
This particular API I'm dealing with (aXcelerate) requires that some requests be sent as POSTs. Could the request type be set in the config?
Things to do:
url
field accept both string and dictionary. (key=stream, value=url).This project seems like a good candidate to implement using an OpenAPI-standard specification as an input schema for generation of Singer models.
i.e. Taking the JSON generation one step further to make use of an officially published OpenAPI specification for the target source API.
For example, here is the official OpenAPI specification for the Stripe API v3: https://raw.githubusercontent.com/stripe/openapi/master/openapi/spec3.yaml
With URL & bookmark keys per stream
fixed in #14
Thank you @ReptilianBrain
Currently, tap-rest-api doesn't fully support OAuth. It can customize the header, so if the developer can manually obtain the token and if it does not expire, the tap still works with the APIs with OAuth. That isn't the case for most services.
OAuth usually implements a refresh token with which we can obtain a new token after the older one expires.
It's probably possible to implement a generic OAuth refresh token flow. If we can achieve this, we just need to set a refresh token in the config, then run the flow to obtain the token.
I recently implemented such flow in PyPardotSF project, and I'm hoping to reuse some of the logic from it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.