Giter VIP home page Giter VIP logo

krawler's Introduction

krawler's People

Contributors

claustres avatar cnouguier avatar daffl avatar dependabot[bot] avatar fossabot avatar robinbourianes-kalisio avatar snyk-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

krawler's Issues

Tag which hooks to be run in case of errors

When an error occurs and the hook chain is stopped it might be useful to have certain hooks to be run anyway, more specifically cleanup hooks to avoid polluting stores with intermediate files.

We could add a onError: true flag in options on each hook to be applied even in case of errors.

CLI mode broken since v0.5

Steps to reproduce

Launch the Vigicrue sample

Expected behavior

The sample should run

Actual behavior

(node:15636) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'create' of undefined at runJobWithOptions (D:\PROJECTS\KRAWLER\krawler\lib\cli.js:164:31)

System configuration

Module versions (especially the part that's not working): 0.5

NodeJS version: 8

Operating System: windows

Mongo client not correctly cleared

Steps to reproduce

Build a job forwarding a mongo connection to tasks:

hooks: {
      jobs: {
        before: {
          connectMongo: {
            url: xxx,
            clientPath: 'taskTemplate.client'
          }
        },
        after: {
          disconnectMongo: {
            clientPath: 'taskTemplate.client'
          }
        }
      }
    }
  }

Actual behavior

When the job is finished the client is correctly disconnected but the property is still available in the task template causing the tasks to access an invalid connection object if launched again with a CRON.

Expected behavior

The task template should be correctly cleared

System configuration

Module versions (especially the part that's not working):

NodeJS version: 8

Operating System: Windows

Unauthorize API requests in non-API mode

Steps to reproduce

Because by default we need to register the services on the Feathers app in order to reach them internally they are always accessible from external point of view even when API mode if off.

Should be restricted using disallow hook or middleware.

Expected behavior

API calls are unauthorized (except healthcheck)

Actual behavior

API calls are authorized

System configuration

NodeJS version: 8

Operating System: Windows

Add a filter to select which tasks are processed by a given hook

Actual behavior

Now the same pipeline is applied on all tasks of a job. If you need a slightly different pipeline for some of your tasks you need to setup a new job for them.

This leads to a problem when you need to merge the result of all your tasks, this has to be done manually and cannot be done using a job after hook.

Expected behavior

If we could select which hooks are skipped/ran according to task properties through a filter we could setup a single dynamic pipeline to manage all takss.

Enhance docker hooks

When using docker, it could be great to have some hooks such as: connectDocker with the connection parameters, diconnectDocker. in the same way that it was done for Postgres and Mongo

krawler command not available

Steps to reproduce

npm install -g krawler
krawler > Command not found

Expected behavior

The command should be globally available after install

System configuration

Module versions (especially the part that's not working): 0.5.1

NodeJS version: 8.9

Operating System: Windows

Make hooks more generic

A lot of hooks are to be used as before OR after hooks while they could be used in both cases. This is mostly due to the fact that we directly access data or result on the hook object instead of using the generic Feathers utility functions getItems()/replaceItems().

We recently introduced callOnHookItems() to apply a given function on hook items to support such use cases. It should be used whenever possible.

Moreover hooks reading/writing data like readJson/writeJson are writing/reading data to/from a default path like result.data, which is not aware of the type of the hook. The data path should adapt itself to be more easy to use, eg data.data in a before hook and result.data in an after hook.

Generate web app from job description

For now we can only launch a krawler job using the CLI (once or via cron). To use it as a web app we need to customize it, ie create another node project having krawler as a dependency.

It should be possible to also directly create a web app from the CLI.

Enhance MongoDB vocabulary

For now we can create/delete collections and read/insert documents, some use cases also require to be able to delete/update documents or use GridFS.

Integrate a logger like winston

There is a lot of debug logs but not much production logs, it would be great to have more production logs, specifically reporting errors on tolerant jobs, with a configurable log level. Moreover, we could allow to create logs using hook for job-specific messages.

Split features using hooks plugins

Krawler already allows to plug new hooks dynamically by code. Thus, it is easy to create plugins to make it more modular and remove some (most ?) hooks from the core.

However, in order to make it continue working without any specific code (i.e. simple job file) we need to 'load' required plugins for the job on-the-fly and register hooks.

For instance node-gdal is a big dependency that would be better to integrate through a plugin.

Job hangs after mongo connection failure

Steps to reproduce

Run a non fault-tolerant job with a connectMongo hook without the DB available.

Expected behavior

The job stops with error

Actual behavior

The job hangs with the following message Cannot read property 'hasOwnProperty' of undefined

System configuration

Tell us about the applicable parts of your setup.

NodeJS version: 8

Operating System: windows

Possible memory leak

Steps to reproduce

Create a job with a transform JSON/clear data hooks like this: transformJson: { dataPath: 'result', pick: ['data'] }, clearData: {}

Actual behavior

Operations like pick actually allocate a new object (eg https://github.com/kalisio/krawler/blob/master/src/utils.js#L96) so that the referenced object by the hook result will change.

Because some references to the initial job tasks do exist, eg https://github.com/kalisio/krawler/blob/master/src/services/jobs.js#L44 and https://github.com/kalisio/krawler/blob/master/src/jobs/jobs.async.js#L14, if the target referenced object is changed this will cause old data to be kept in memory until the job is finished, even if using the clear data hook.

Expected behavior

References to initial task objects should not be changed by transformation hook.

System configuration

Module versions: 0.7

NodeJS version: 8

Operating System: Windows

Stop job after a given number of failed tasks

For now fault-tolerant job might run even if all tasks fail, might be interesting to stop it once a given number of tasks have failed, similar to the maximum number of attempts in task.

Concurrent tasks skipped when a task fails

Steps to reproduce

Launch a job with eg 4 concurrent tasks, make one of the task fail

Expected behavior

The others tasks are correctly ran and reported to have been ran

Actual behavior

The tasks of the same concurrency bucket that should run after the failure are skipped, this is because Promise.all() here rejects with the reason of the first promise that rejects.

System configuration

Tell us about the applicable parts of your setup.

Module versions : 0.7

NodeJS version: 8

Operating System: Window

Refactor using only hooks

Today we have a mix between hooks, stores and task types not so easy to understand. Not sure if practically possible but we could probably unify everything behind hooks, also to provide something more easily extensible, eg:

  • hooks to create stores or other storage objects like DB connections
  • symmetric hooks to destroy stores or close DB connections
  • hooks to initiate read streams (httpRequest, etc.) in replacement of specialized task types

We don't need the stores service anymore, hooks will allocate required objects on the fly and can pass it from job to tasks using taskTemplate.

The tasks service will only instantiates "empty" tasks that will trigger the pipeline execution.

MongoDB job never completes

Steps to reproduce

Launch the csv2db sample with MongoDB job file.

Expected behavior

The app ends when job is completed.

Actual behavior

The app is stuck after completion.

System configuration

Module version: 0.6

NodeJS version: 8

Operating System: Windows

Creating multiple single indices does not work

Steps to reproduce

Using a specification like this index: [{ gid: 1 }, { geometry: '2dsphere' }] raises the following error The field 'geometry' is not valid for an index specification..

Expected behavior

We should be able to create multiple single indices.

Actual behavior

We can only create compound indices using the supported syntax.

System configuration

Module versions : 0.7

HTTP error codes not correctly handled

When a server returns a HTTP error code to a download task the output file is still written and contains the error message instead of the actual data.

The expected behavior is to raise an error and stop the job.

Add a in-memory store

Now most hooks requires data to be read from the file system (eg writeCSV, readXML), adding a in-memory store will allow to read from and write to any store without requiring to use intermediate files.

CRON jobs are launched immediately

When running a CRON job the job is launched a first time as soon as the krawler is ran, it is usually not what is expected, eg if I want a CRON job to run once a day it will run twice on first launch.

It maybe interesting to have an option to force a first run when required.

Fault-tolerant jobs/tasks

When a job combine multiple independent tasks (eg requesting different web services) we should be able to specify which ones can be fault-tolerant, i.e. any error is cached and the job continue anyway.

The faulting task should probably removed from the results however and hooks following the faulty one should be skipped.

Subsequent tasks overwrite first produced file when writing to S3

Steps to reproduce

Create a job with 2 tasks, each task running eg the writeJson hook.

Expected behavior

Each task generates a different file in S3 according to its ID.

Actual behavior

The second task overwrites the file content of the first task, no other file being created.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.