A minimalist (geospatial) ETL
Please read the documentation web site.
Released under the MIT License
The detailed list of Open Source dependencies can be found in our 3rd-party software report.
A minimalist (geospatial) ETL
Home Page: https://kalisio.github.io/krawler/
License: MIT License
A minimalist (geospatial) ETL
Please read the documentation web site.
Released under the MIT License
The detailed list of Open Source dependencies can be found in our 3rd-party software report.
When an error occurs and the hook chain is stopped it might be useful to have certain hooks to be run anyway, more specifically cleanup hooks to avoid polluting stores with intermediate files.
We could add a onError: true
flag in options on each hook to be applied even in case of errors.
Just like unit mapping we should be able to support date/time conversion based on e.g. https://momentjs.com/.
Could be done conditionally as well like discardIf
.
Launch the Vigicrue sample
The sample should run
(node:15636) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'create' of undefined at runJobWithOptions (D:\PROJECTS\KRAWLER\krawler\lib\cli.js:164:31)
Module versions (especially the part that's not working): 0.5
NodeJS version: 8
Operating System: windows
To enhance the global test coverage, it is needed to add some test for the Postgresql hooks. It is also needed to update the Travis configuration in order to instanciante a Postgresql database. See the following links:
Build a job forwarding a mongo connection to tasks:
hooks: {
jobs: {
before: {
connectMongo: {
url: xxx,
clientPath: 'taskTemplate.client'
}
},
after: {
disconnectMongo: {
clientPath: 'taskTemplate.client'
}
}
}
}
}
When the job is finished the client is correctly disconnected but the property is still available in the task template causing the tasks to access an invalid connection object if launched again with a CRON.
The task template should be correctly cleared
Module versions (especially the part that's not working):
NodeJS version: 8
Operating System: Windows
Because by default we need to register the services on the Feathers app in order to reach them internally they are always accessible from external point of view even when API mode if off.
Should be restricted using disallow
hook or middleware.
API calls are unauthorized (except healthcheck)
API calls are authorized
NodeJS version: 8
Operating System: Windows
Now the same pipeline is applied on all tasks of a job. If you need a slightly different pipeline for some of your tasks you need to setup a new job for them.
This leads to a problem when you need to merge the result of all your tasks, this has to be done manually and cannot be done using a job after hook.
If we could select which hooks are skipped/ran according to task properties through a filter we could setup a single dynamic pipeline to manage all takss.
For now we use json2csv and fast-csv for CSV management. The latter is more stream-friendly and should be used whenever possible.
If the command never returns.
For now we can gunzip but not unzip
When using docker, it could be great to have some hooks such as: connectDocker with the connection parameters, diconnectDocker. in the same way that it was done for Postgres and Mongo
npm install -g krawler
krawler > Command not found
The command should be globally available after install
Module versions (especially the part that's not working): 0.5.1
NodeJS version: 8.9
Operating System: Windows
We should let the interval function run and not stop the server.
A lot of hooks are to be used as before OR after hooks while they could be used in both cases. This is mostly due to the fact that we directly access data
or result
on the hook object instead of using the generic Feathers utility functions getItems()/replaceItems()
.
We recently introduced callOnHookItems() to apply a given function on hook items to support such use cases. It should be used whenever possible.
Moreover hooks reading/writing data like readJson/writeJson
are writing/reading data to/from a default path like result.data
, which is not aware of the type of the hook. The data path should adapt itself to be more easy to use, eg data.data
in a before hook and result.data
in an after hook.
See https://docs.feathersjs.com/migrating.html, should not be too much difficult.
For now we can only launch a krawler job using the CLI (once or via cron). To use it as a web app we need to customize it, ie create another node project having krawler as a dependency.
It should be possible to also directly create a web app from the CLI.
It would be great to make it work the other way around as #24, ie krawler launching jobs/tasks triggered by external events.
As we already have the description as a JSON object transform it into a text form compliant with the mermaidjs file format should be easy.
As a source of inspiration have a look to the KDK documentation diagrams.
For now we can create/delete collections and read/insert documents, some use cases also require to be able to delete/update documents or use GridFS.
We should use https://github.com/feathersjs-ecosystem/feathers-sync to dispatch krawler events to external systems.
There is a lot of debug logs but not much production logs, it would be great to have more production logs, specifically reporting errors on tolerant jobs, with a configurable log level. Moreover, we could allow to create logs using hook for job-specific messages.
Krawler already allows to plug new hooks dynamically by code. Thus, it is easy to create plugins to make it more modular and remove some (most ?) hooks from the core.
However, in order to make it continue working without any specific code (i.e. simple job file) we need to 'load' required plugins for the job on-the-fly and register hooks.
For instance node-gdal is a big dependency that would be better to integrate through a plugin.
Run a non fault-tolerant job with a connectMongo hook without the DB available.
The job stops with error
The job hangs with the following message Cannot read property 'hasOwnProperty' of undefined
Tell us about the applicable parts of your setup.
NodeJS version: 8
Operating System: windows
To be used as health check for eg Docker.
Create a job with a transform JSON/clear data hooks like this: transformJson: { dataPath: 'result', pick: ['data'] }, clearData: {}
Operations like pick actually allocate a new object (eg https://github.com/kalisio/krawler/blob/master/src/utils.js#L96) so that the referenced object by the hook result will change.
Because some references to the initial job tasks do exist, eg https://github.com/kalisio/krawler/blob/master/src/services/jobs.js#L44 and https://github.com/kalisio/krawler/blob/master/src/jobs/jobs.async.js#L14, if the target referenced object is changed this will cause old data to be kept in memory until the job is finished, even if using the clear data hook.
References to initial task objects should not be changed by transformation hook.
Module versions: 0.7
NodeJS version: 8
Operating System: Windows
Create a filter with $and or $or and the corresponding predicates
Returns the filtered data
MongoError: $and must be an array
We need to some hooks to be able to get/put files from/to an FTP server.
https://github.com/sergi/jsftp seems to be a good candidate to implement theses hooks
For now this is only supported by kue jobs.
For now fault-tolerant job might run even if all tasks fail, might be interesting to stop it once a given number of tasks have failed, similar to the maximum number of attempts in task.
Launch a job with eg 4 concurrent tasks, make one of the task fail
The others tasks are correctly ran and reported to have been ran
The tasks of the same concurrency bucket that should run after the failure are skipped, this is because Promise.all()
here rejects with the reason of the first promise that rejects.
Tell us about the applicable parts of your setup.
Module versions : 0.7
NodeJS version: 8
Operating System: Window
We should at least add:
In order to be able to download data for a specific run, time range and set of variables.
We can probably reuse part of https://github.com/weacast/weacast.
Today we have a mix between hooks, stores and task types not so easy to understand. Not sure if practically possible but we could probably unify everything behind hooks, also to provide something more easily extensible, eg:
We don't need the stores service anymore, hooks will allocate required objects on the fly and can pass it from job to tasks using taskTemplate.
The tasks service will only instantiates "empty" tasks that will trigger the pipeline execution.
Kue will add support for failover and concurrency.
worker-farm might also be used as it is more simple and does not require a side tool like redis.
agenda looks also great and will allow job scheduling.
Launch the csv2db sample with MongoDB job file.
The app ends when job is completed.
The app is stuck after completion.
Module version: 0.6
NodeJS version: 8
Operating System: Windows
For now we can setup a run interval but supporting a CRON pattern will be better, e.g. using https://github.com/kelektiv/node-cron.
Using a specification like this index: [{ gid: 1 }, { geometry: '2dsphere' }]
raises the following error The field 'geometry' is not valid for an index specification.
.
We should be able to create multiple single indices.
We can only create compound indices using the supported syntax.
Module versions : 0.7
When a server returns a HTTP error code to a download task the output file is still written and contains the error message instead of the actual data.
The expected behavior is to raise an error and stop the job.
Now most hooks requires data to be read from the file system (eg writeCSV, readXML), adding a in-memory store will allow to read from and write to any store without requiring to use intermediate files.
When running a CRON job the job is launched a first time as soon as the krawler is ran, it is usually not what is expected, eg if I want a CRON job to run once a day it will run twice on first launch.
It maybe interesting to have an option to force a first run when required.
When a job combine multiple independent tasks (eg requesting different web services) we should be able to specify which ones can be fault-tolerant, i.e. any error is cached and the job continue anyway.
The faulting task should probably removed from the results however and hooks following the faulty one should be skipped.
Create a job with 2 tasks, each task running eg the writeJson
hook.
Each task generates a different file in S3 according to its ID.
The second task overwrites the file content of the first task, no other file being created.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.