kalisio / krawler Goto Github PK

View Code? Open in Web Editor NEW

55.0 9.0 13.0 16.17 MB

A minimalist (geospatial) ETL

Home Page: https://kalisio.github.io/krawler/

License: MIT License

JavaScript 98.03% Batchfile 0.01% Smarty 0.01% Shell 1.96%

etl geospatial gdal geojson s3 postgis mongodb wms wcs ogc

krawler's Introduction

A minimalist (geospatial) ETL

Documentation

Please read the documentation web site.

License

Released under the MIT License

The detailed list of Open Source dependencies can be found in our 3rd-party software report.

krawler's People

Contributors

Stargazers

Watchers

Forkers

fossabot evaluation-alex profissionalsid ag-networks andyprasetya joabgeotec vikrant327 kazemar calysteau fork-archive-hub kalisio-nicolas wjp1975 micheleriva

krawler's Issues

Tag which hooks to be run in case of errors

When an error occurs and the hook chain is stopped it might be useful to have certain hooks to be run anyway, more specifically cleanup hooks to avoid polluting stores with intermediate files.

We could add a onError: true flag in options on each hook to be applied even in case of errors.

Date/time format conversion

Just like unit mapping we should be able to support date/time conversion based on e.g. https://momentjs.com/.

Create a apply hook to run a custom function on items

Could be done conditionally as well like discardIf.

Allow the template function to tackle properties of type of array of objects

readJson hook fails silently on perse error

Steps to reproduce

Launch the Vigicrue sample

Expected behavior

The sample should run

Actual behavior

(node:15636) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'create' of undefined at runJobWithOptions (D:\PROJECTS\KRAWLER\krawler\lib\cli.js:164:31)

System configuration

Module versions (especially the part that's not working): 0.5

NodeJS version: 8

Operating System: windows

Add tests for Postgresql hooks

To enhance the global test coverage, it is needed to add some test for the Postgresql hooks. It is also needed to update the Travis configuration in order to instanciante a Postgresql database. See the following links:

Mongo client not correctly cleared

Steps to reproduce

Build a job forwarding a mongo connection to tasks:

hooks: {
      jobs: {
        before: {
          connectMongo: {
            url: xxx,
            clientPath: 'taskTemplate.client'
          }
        },
        after: {
          disconnectMongo: {
            clientPath: 'taskTemplate.client'
          }
        }
      }
    }
  }

Actual behavior

When the job is finished the client is correctly disconnected but the property is still available in the task template causing the tasks to access an invalid connection object if launched again with a CRON.

Expected behavior

The task template should be correctly cleared

System configuration

Module versions (especially the part that's not working):

NodeJS version: 8

Operating System: Windows

Unauthorize API requests in non-API mode

Steps to reproduce

Because by default we need to register the services on the Feathers app in order to reach them internally they are always accessible from external point of view even when API mode if off.

Should be restricted using disallow hook or middleware.

Expected behavior

API calls are unauthorized (except healthcheck)

Actual behavior

API calls are authorized

System configuration

NodeJS version: 8

Operating System: Windows

Add a filter to select which tasks are processed by a given hook

Actual behavior

Now the same pipeline is applied on all tasks of a job. If you need a slightly different pipeline for some of your tasks you need to setup a new job for them.

This leads to a problem when you need to merge the result of all your tasks, this has to be done manually and cannot be done using a job after hook.

Expected behavior

If we could select which hooks are skipped/ran according to task properties through a filter we could setup a single dynamic pipeline to manage all takss.

Only use fast-csv for CSV file management

For now we use json2csv and fast-csv for CSV management. The latter is more stream-friendly and should be used whenever possible.

runCommand hook might prevent job to finish

If the command never returns.

Add ability to unzip files

For now we can gunzip but not unzip

Enhance docker hooks

When using docker, it could be great to have some hooks such as: connectDocker with the connection parameters, diconnectDocker. in the same way that it was done for Postgres and Mongo

krawler command not available

Steps to reproduce

npm install -g krawler
krawler > Command not found

Expected behavior

The command should be globally available after install

System configuration

Module versions (especially the part that's not working): 0.5.1

NodeJS version: 8.9

Operating System: Windows

Support MongoDB like PG

Recurring jobs stop on error

We should let the interval function run and not stop the server.

Make hooks more generic

A lot of hooks are to be used as before OR after hooks while they could be used in both cases. This is mostly due to the fact that we directly access data or result on the hook object instead of using the generic Feathers utility functions getItems()/replaceItems().

We recently introduced callOnHookItems() to apply a given function on hook items to support such use cases. It should be used whenever possible.

Moreover hooks reading/writing data like readJson/writeJson are writing/reading data to/from a default path like result.data, which is not aware of the type of the hook. The data path should adapt itself to be more easy to use, eg data.data in a before hook and result.data in an after hook.

Unzip hook returns before file has been written

See kalisio/k-vigicrues#9.

Allow to create swarm service

Migrate to Feathers V3

See https://docs.feathersjs.com/migrating.html, should not be too much difficult.

Generate web app from job description

For now we can only launch a krawler job using the CLI (once or via cron). To use it as a web app we need to customize it, ie create another node project having krawler as a dependency.

It should be possible to also directly create a web app from the CLI.

Add a WFS task type

Make krawler able to react to external systems

It would be great to make it work the other way around as #24, ie krawler launching jobs/tasks triggered by external events.

Generate job flowcharts automatically

As we already have the description as a JSON object transform it into a text form compliant with the mermaidjs file format should be easy.

As a source of inspiration have a look to the KDK documentation diagrams.

Enhance MongoDB vocabulary

For now we can create/delete collections and read/insert documents, some use cases also require to be able to delete/update documents or use GridFS.

Make krawler able to communicate with external systems

We should use https://github.com/feathersjs-ecosystem/feathers-sync to dispatch krawler events to external systems.

Integrate a logger like winston

There is a lot of debug logs but not much production logs, it would be great to have more production logs, specifically reporting errors on tolerant jobs, with a configurable log level. Moreover, we could allow to create logs using hook for job-specific messages.

Split features using hooks plugins

Krawler already allows to plug new hooks dynamically by code. Thus, it is easy to create plugins to make it more modular and remove some (most ?) hooks from the core.

However, in order to make it continue working without any specific code (i.e. simple job file) we need to 'load' required plugins for the job on-the-fly and register hooks.

For instance node-gdal is a big dependency that would be better to integrate through a plugin.

Job hangs after mongo connection failure

Steps to reproduce

Run a non fault-tolerant job with a connectMongo hook without the DB available.

Expected behavior

The job stops with error

Actual behavior

The job hangs with the following message Cannot read property 'hasOwnProperty' of undefined

System configuration

Tell us about the applicable parts of your setup.

NodeJS version: 8

Operating System: windows

Add a healthcheck entry point

To be used as health check for eg Docker.

Possible memory leak

Steps to reproduce

Create a job with a transform JSON/clear data hooks like this: transformJson: { dataPath: 'result', pick: ['data'] }, clearData: {}

Actual behavior

Operations like pick actually allocate a new object (eg https://github.com/kalisio/krawler/blob/master/src/utils.js#L96) so that the referenced object by the hook result will change.

Because some references to the initial job tasks do exist, eg https://github.com/kalisio/krawler/blob/master/src/services/jobs.js#L44 and https://github.com/kalisio/krawler/blob/master/src/jobs/jobs.async.js#L14, if the target referenced object is changed this will cause old data to be kept in memory until the job is finished, even if using the clear data hook.

Expected behavior

References to initial task objects should not be changed by transformation hook.

System configuration

Module versions: 0.7

NodeJS version: 8

Operating System: Windows

Add a task which can handle an Overpass API query

the templateQueryObject does not support filter containing arrays

Steps to reproduce

Create a filter with $and or $or and the corresponding predicates

Expected behavior

Returns the filtered data

Actual behavior

MongoError: $and must be an array

Add some capabilities to handle FTP

We need to some hooks to be able to get/put files from/to an FTP server.
https://github.com/sergi/jsftp seems to be a good candidate to implement theses hooks

Add a retry capability for tasks in async job

For now this is only supported by kue jobs.

Stop job after a given number of failed tasks

For now fault-tolerant job might run even if all tasks fail, might be interesting to stop it once a given number of tasks have failed, similar to the maximum number of attempts in task.

Concurrent tasks skipped when a task fails

Steps to reproduce

Launch a job with eg 4 concurrent tasks, make one of the task fail

Expected behavior

The others tasks are correctly ran and reported to have been ran

Actual behavior

The tasks of the same concurrency bucket that should run after the failure are skipped, this is because Promise.all() here rejects with the reason of the first promise that rejects.

System configuration

Tell us about the applicable parts of your setup.

Module versions : 0.7

NodeJS version: 8

Operating System: Window

Improve transformJson

We should at least add:

data filtering > https://github.com/crcn/sift.js
units mapping > http://mathjs.org/docs/datatypes/units.html
omit/pick properties > https://lodash.com

Added numerical weather prediction models management

In order to be able to download data for a specific run, time range and set of variables.

We can probably reuse part of https://github.com/weacast/weacast.

Refactor using only hooks

Today we have a mix between hooks, stores and task types not so easy to understand. Not sure if practically possible but we could probably unify everything behind hooks, also to provide something more easily extensible, eg:

hooks to create stores or other storage objects like DB connections
symmetric hooks to destroy stores or close DB connections
hooks to initiate read streams (httpRequest, etc.) in replacement of specialized task types

We don't need the stores service anymore, hooks will allocate required objects on the fly and can pass it from job to tasks using taskTemplate.

The tasks service will only instantiates "empty" tasks that will trigger the pipeline execution.

Integrate better job sequencers

Kue will add support for failover and concurrency.

worker-farm might also be used as it is more simple and does not require a side tool like redis.

agenda looks also great and will allow job scheduling.

MongoDB job never completes

Steps to reproduce

Launch the csv2db sample with MongoDB job file.

Expected behavior

The app ends when job is completed.

Actual behavior

The app is stuck after completion.

System configuration

Module version: 0.6

NodeJS version: 8

Operating System: Windows

Support CRON jobs

For now we can setup a run interval but supporting a CRON pattern will be better, e.g. using https://github.com/kelektiv/node-cron.

Creating multiple single indices does not work

Steps to reproduce

Using a specification like this index: [{ gid: 1 }, { geometry: '2dsphere' }] raises the following error The field 'geometry' is not valid for an index specification..

Expected behavior

We should be able to create multiple single indices.

Actual behavior

We can only create compound indices using the supported syntax.

System configuration

Module versions : 0.7

Steps to reproduce

Create a job with 2 tasks, each task running eg the writeJson hook.

Expected behavior

Each task generates a different file in S3 according to its ID.

Actual behavior

The second task overwrites the file content of the first task, no other file being created.

kalisio / krawler Goto Github PK

krawler's Introduction

Documentation

License

krawler's People

Contributors

Stargazers

Watchers

Forkers

krawler's Issues

Steps to reproduce

Expected behavior

Actual behavior

System configuration

Steps to reproduce

Actual behavior

Expected behavior

System configuration

Steps to reproduce

Expected behavior

Actual behavior

System configuration

Actual behavior

Expected behavior

Steps to reproduce

Expected behavior

System configuration

Steps to reproduce

Expected behavior

Actual behavior

System configuration

Steps to reproduce

Actual behavior

Expected behavior

System configuration

Steps to reproduce

Expected behavior

Actual behavior

Steps to reproduce

Expected behavior

Actual behavior

System configuration

Steps to reproduce

Expected behavior

Actual behavior

System configuration

Steps to reproduce

Expected behavior

Actual behavior

System configuration

Steps to reproduce

Expected behavior

Actual behavior

Recommend Projects

Recommend Topics

Recommend Org