Giter VIP home page Giter VIP logo

simpleimport's People

Contributors

cbleek avatar fedys avatar kilip avatar tisie avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

mbo-s

simpleimport's Issues

Add Listener for EntityEraser events.

In yawik/jobs a EntityLoader for the purge cli command is implemented, which loads all jobs that are expired a certain amount of days.
The purge command triggers an event to fetch dependent entities ( to a job entity )

SimpleImport needs a Listener for that events to report and remove the crawler items associated with an job.


An example dependency loader can be found in the yawik/Applications:
LoadDependendEntities

which is registered in config:

'Core/EntityEraser/Dependencies/Events' => [
    'listeners' => [
        Listener\LoadDependendEntities::class => [
            'events' => [
                \Core\Service\EntityEraser\DependencyResultEvent::CHECK_DEPENDENCIES => '__invoke',
                \Core\Service\EntityEraser\DependencyResultEvent::DELETE             => 'onDelete',
            ],
        ],
    ],
]

the purge command to delete expired jobs can be triggered with:

bin/console purge expired-jobs

TemplateValues cannot be unset/deleted

If a job was imported with templateValues and some time later gets updated and some or all templateValues are not set or an empty string, the existing templateValues will remain.

Upgrade Geocoder library and add result cache

The geocoder library used by this module released a new major version which changes its usage in a way that needs some adaption.
It changes the name to and splits the packages into the "geocoder-php/*" namespace, which has the most impact on this module.

The following actions need to be done:

  • Change dependencies in composer.json
    • "willdurand/geocoder" -> "geocoder-php/google-maps-provider"
    • New http-client compatible with "php-http/client-implementation" (e.g. "php-http/curl-client")
  • Adopt the code which uses geocoder
  • Integrate the cache provider (geocoder-php/cache-provider)
    • Opt-In: configurable via module options
  • Tests

We created the branch upgrade-geocoder for developing this enhancement.

Seting up a devenv.

cbleek@php7:~$ git clone https://github.com/yawik/SimpleImport.git
cbleek@php7:~$ cd SimpleImport/
cbleek@php7:~/SimpleImport$ git checkout upgrade-geocoder
cbleek@php7:~/SimpleImport$ composer install

The feature branch contains a YAWIK_TEST database which can be installed via:

cbleek@php7:~/SimpleImport$ composer db.init

The YAWIK_TEST contains a user with an organization an a simpleimport.crawler. The execution of
vendor/bin/yawik simpleimport import leads to an exception, because of the changes in the geocoder-php changes.

cbleek@php7:~/SimpleImport$ vendor/bin/yawik simpleimport import
======================================================================
   The application has thrown an exception!
======================================================================
 Error
 Class 'Geocoder\Provider\GoogleMaps' not found
---------------------------------------------------------------------

Initial state of imported jobs

The initial status of imported jobs by a crawler must be configurable.

Currently they are imported as "Created", meaning the administrator has to approve every single job...

For some crawler this might not be the desired behaviour. So it must be possible to set the inital state per crawler.

Some values are not sanitized and leads to arbitrary code injection

Similar to cross-solution/YAWIK#543

The values from the JSON will be imported without being sanitized. Affected fields are at least company and location. There can be other fields which produce the same errors, e.g. classification fields

Sample file

{
   "jobs": [
       "id": 1,
       "title": "<h1>Title</h1><script>alert('Title-XSS');</script>",
       "location": "<h1>Location</h1><script>alert('Location-XSS');</script>",
       "link": "http://www.example.com/job/1"
       }
   ]
}

Fetching HTML source if no description is given

this was discussed in yawik/Solr#12

adding a jobosting without a fulltext to solr makes no sense. So what can we do, if a job is activated and no html is given.

My idea is to simply fetch the html. If fetching fails, the job is not inserted.

Can we/should we do this in the solr module?

Keep in mind that all the time consuming tasks like fetching pages, inserting into solr, sending mails... should be moved into some queing system

Failed plainTextFetch invalidates complete offer

If the templateValues are not set the import tries to fetch the remote page as plain text. If this remote fetching fails, the job gets the status invalid and will not be imported.

There can be many reasons for such a failure, e.g.

2019-01-17T17:06:03+01:00 ERR (3): Cannot fetch HTML digest for a job, import ID: "32", link: "example.com/3", reason: "Unable to fetch remote data, reason: "Read timed out after 5 seconds""
2019-01-17T17:10:08+01:00 ERR (3): Cannot fetch HTML digest for a job, import ID: "9", link: "example.com/1", reason: "Invalid HTTP status: "404""
2019-01-18T13:17:27+01:00 ERR (3): Cannot fetch HTML digest for a job, import ID: "7763de2bd66926f8fc8b49d384628896", link: "example.com/2", reason: "Unable to fetch remote data, reason: "Unable to enable crypto on TCP connection example.com: make sure the "sslcafile" or "sslcapath" option are properly set for the environment.""

All these jobs were not imported and for at least two of them the error is not the remote site.

Since the plaintext is not critical, the job should regardless of a plaintextexception always be imported.

Enhance `check-classifications` action

It is currently only possible to add missing classifications to an job.

But sometimes, it's also needed to

  • replace a classification or
  • delete a classification

scrap details like salary or workload from title

sometimes job title contains informations about a salary.

eg:

  • Helfer in der Metallbearbeitung (m/w/d) - Mindestlohn 10,00 EUR
  • Anlagenmechaniker Sanitär-,Heizungs- und Klimatechnik (m/w/d) 14 EUR /Std.
  • Fachkräfte für Lagerlogistik (m/w/d) ab 17,00 EUR/Std
  • Metallbearbeiter (m/w/d) 14 EUR/Std
  • Industriemechaniker (m/w/d) ab 21,00 EUR/h

if the simple import finds such an information, we should store it in salary

regex salary

(\D|^)([\d]{2})([,.](\d\d|-))?\s?(EUR|Euro|€)\s*\/?\s*(STD|Std|h|Stunde)?

regex workload

(([\d]{1,2}(,[\d\d]{1,2})?)\s*-?\s*([\d]{1,2})?)\s*(STD|Std|h|Stunden?).?\s*(/|pro)\s*(Wo|Woche)

  • Werkstudent oder Projektunterstützung im Bereich Finanzen/Controlling/SAP S/4 HANA Cloud (m/w/d) – mit ca. 10-20 Stunden/Woche
  • Staplerfahrer (m/w/d) in Voll- und Teilzeit (40-20 Std/Woche)
  • Software-Entwickler (m/w/d) Web- and Sharepoint Applications (20 Std./ Woche)
  • Dozent (m/w/d) Englisch (2h/Woche)

Mapping job position categories

Job position catgeories, such as employment type, position, industry, etc. varies widely.

In order to unify the categories in YAWIK, there need to be a mapping mechanism, that allows to map arbitrary values to categories known by the particular yawik instance.

So we need to implement a filter mechanism that maps these values according to user configuration.

Make run time period configurable

After adding a crawler and then start the import process, this crawler does not run.

  1. A newly added crawler must immediatley be runnable.
  2. The time period, in which the crawler is considered "already run" must be configurable.
    (for each crawler preferably, or at least globally)

Commit a659f99053181f3c2a8e3dbc37b8d8da0c5a779c breaks tests

cbleek@php7-cb:~/AtomProjects/SimpleImport$ git checkout a659f99053181f3c2a8e3dbc37b8d8da0c5a779c
Note: checking out 'a659f99053181f3c2a8e3dbc37b8d8da0c5a779c'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at a659f99 Optional shuffle the Publish date on import
cbleek@php7-cb:~/AtomProjects/SimpleImport$ ./vendor/bin/phpunit 
PHPUnit 8.5.2 by Sebastian Bergmann and contributors.

...............................................................  63 / 181 ( 34%)
.............................F................................. 126 / 181 ( 69%)
..EE...................................................         181 / 181 (100%)

Time: 1.12 seconds, Memory: 28.00 MB

There were 2 errors:

1) SimpleImportTest\Hydrator\JobHydratorTest::testHydrate
ArgumentCountError: Too few arguments to function SimpleImport\Hydrator\JobHydrator::__construct(), 2 passed in /home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Hydrator/JobHydratorTest.php on line 59 and exactly 3 expected

/home/cbleek/AtomProjects/SimpleImport/src/Hydrator/JobHydrator.php:43
/home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Hydrator/JobHydratorTest.php:59

2) SimpleImportTest\Hydrator\JobHydratorTest::testHydrateInvalidObjectPassed
ArgumentCountError: Too few arguments to function SimpleImport\Hydrator\JobHydrator::__construct(), 2 passed in /home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Hydrator/JobHydratorTest.php on line 59 and exactly 3 expected

/home/cbleek/AtomProjects/SimpleImport/src/Hydrator/JobHydrator.php:43
/home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Hydrator/JobHydratorTest.php:59

--

There was 1 failure:

1) SimpleImportTest\Factory\CrawlerProcessor\JobProcessorFactoryTest::testInvoke
Psr\Container\ContainerInterface::get('FilterManager') was not expected to be called more than 3 times.

/home/cbleek/AtomProjects/SimpleImport/src/Factory/CrawlerProcessor/JobProcessorFactory.php:41
/home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Factory/CrawlerProcessor/JobProcessorFactoryTest.php:64

ERRORS!
Tests: 181, Assertions: 392, Errors: 2, Failures: 1.

Setting locations fails on updates.

if ($locations) {
$job->getLocations()->fromArray($locations);
}

This works fine, if the job is a new entity, because getLocations() returns an ArrayCollection. But it fails on persisted entities, because it returns a Doctrine PersistentCollection then, which does not have the method fromArray.

Further on, even if this approach worked, it would append the locations, not replace.

You probably want to simply set a new Collection of locations in any case:
$job->setLocations(new ArrayCollection($locations))

Use referenced documents for the crawler items.

Mongo documents have a maximum size.

Currently the crawler items (the meta data for a crawled job) are stored as embedded documents in the crawler entity. A crawler might be import a lot of jobs which could pose problems if we continue to embed the items.

So the items must be stored in a dedicated mongo collection and stored as referenced entities in the crawler entity.

plaintext optimizations

Currently the following string gets stored in solr:

Encoding is wrong: example: Sprüngli.

   ` "html":"Menu Shop Inspiration Spr&Atilde;&frac14;ngli Welt Standorte Kontakt Suche Suche Suche Reset My Spr&Atilde;&frac14;ngli My Spr&Atilde;&frac14;ngli Mein Konto Anmelden oder Registrieren Passwort Anmelden Passwort vergessen? Registrieren 0 0 Warenkorb ist leer Sie haben keine Artikel im Warenkorb. Weiter einkaufen DE Schlie&Atilde;&#159;en Sortiment Chlausgeschenke Weihnachtsgeschenke Luxemburgerli &Acirc;&reg; Pralin&Atilde;&copy;s &amp; Truffes Naschprodukte Tafelschokoladen Geschenkpakete &amp; Arrangements Cakes &amp; Torten P&Atilde;&cent;tisserie &amp; Dessert Sandwiches &amp; Geb&Atilde;&curren;ck Salate &amp; Bircherm&Atilde;&frac14;esli Ap&Atilde;&copy;ro Canap&Atilde;&copy; Glace Saisonal Osterhasen Ostereier Ostergeschenke Valentinstag Muttertagsgeschenke Aktuelles Saisonales Angebot Petits Plaisirs-Collection Shop entdecken Ap&Atilde;&copy;ro Business-Lunch Chlaus Cuba Firmengeschenke Geburtstag Geburt &amp; Taufe Hochzeit Luxemburgerli Muttertag Neuheiten Ostern Schokolade aus Bergheumilch Valentinstag Weihnachten Inspiration entdecken Treueprogramm Petits Plaisirs Aktuelles Brosch&Atilde;&frac14;ren Firmenkunden Geschichte Medien Partnerschaft mit Art Museums of Switzerland Handwerkskunst Individualisierung Kaffeekultur/Gastronomie Stellen bei Spr&Atilde;&frac14;ngli Spr&Atilde;&frac14;ngli-Klassiker Verantwortung Zutaten &amp; Herkunft Spr&Atilde;&frac14;ngli Welt entdecken Search My Spr&Atilde;&frac14;ngli Settings Shop Sortiment Chlausgeschenke Weihnachtsgeschenke Luxemburgerli &Acirc;&reg; Pralin&Atilde;&copy;s &amp; Truffes Naschprodukte Tafelschokoladen Geschenkpakete &amp; Arrangements Cakes &amp; Torten P&Atilde;&cent;tisserie &amp; Dessert Sandwiches &amp; Geb&Atilde;&curren;ck Salate &amp; Bircherm&Atilde;&frac14;esli Ap&Atilde;&copy;ro Canap&Atilde;&copy; Glace Saisonal Osterhasen Ostereier Ostergeschenke Valentinstag Muttertagsgeschenke Aktuelles Saisonales Angebot Petits Plaisirs-Collection Inspiration Chlaus Luxemburgerli Ap&Atilde;&copy;ro Business-Lunch Cuba Firmengeschenke Geburtstag Geburt &amp; Taufe Hochzeit Muttertag Neuheiten Ostern Schokolade aus Bergheumilch Valentinstag Weihnachten Spr&Atilde;&frac14;ngli Welt Treueprogramm Petits Plaisirs Aktuelles Brosch&Atilde;&frac14;ren Firmenkunden Geschichte Medien Partnerschaft mit Art Museums of Switzerland Handwerkskunst Individualisierung Kaffeekultur/Gastronomie Stellen bei Spr&Atilde;&frac14;ngli Spr&Atilde;&frac14;ngli-Klassiker Verantwortung Zutaten &amp; Herkunft Standorte Kontakt Home ... ... Spr&Atilde;&frac14;ngli WeltStellen bei Spr&Atilde;&frac14;ngliUnsere StellenangeboteDetailhandelsfachfrauShopInspirationSpr&Atilde;&frac14;ngli WeltStandorteKontaktDetailhandelsfachfrau80-100%, Z&Atilde;&frac14;rich Flughafen Arbeiten bei Spr&Atilde;&frac14;ngli Das 1836 gegr&Atilde;&frac14;ndete Schweizer Familienunternehmen z&Atilde;&curren;hlt heute mit seinem erlesenen Sortiment zu den renommiertesten Confiserien Europas. Die Produkte aus dem Hause Spr&Atilde;&frac14;ngli stehen f&Atilde;&frac14;r beste Qualit&Atilde;&curren;t, einmalige Frische und Nat&Atilde;&frac14;rlichkeit. Die vollendeten Kreationen bringen t&Atilde;&curren;glich Kundinnen und Kunden aus aller Welt ins Schw&Atilde;&curren;rmen. Zur Verst&Atilde;&curren;rkung des Teams unserer Filiale im Airside Center am Flughafen Z&Atilde;&frac14;rich suchen wir eine engagierte und begeisterte Detailhandelsfachfrau 80-100% mit Ausstrahlung f&Atilde;&frac14;r den Verkauf unserer liebevoll hergestellten K&Atilde;&para;stlichkeiten. Sie bringen mit:abgeschlossene Ausbildung als Detailhandelsfachfrau EFZ und/oder Verkaufserfahrung im Detailhandel (vorzugsweise Confiserie- oder Lebensmittelbranche)hohe Dienstleistungsbereitschaft und Freude an internationaler Kundschaftgepflegtes Erscheinungsbild und gute Umgangsformengute m&Atilde;&frac14;ndliche Englischkenntnisse, weitere Fremdsprachen von Vorteilhohe Flexibilit&Atilde;&curren;t bez&Atilde;&frac14;glich Arbeitszeiten (Schichteins&Atilde;&curren;tze zwischen 5.30 und 22.30 Uhr sowie 2 bis 3 Wochenend-Eins&Atilde;&curren;tze pro Monat)Wir bieten Ihnen eine abwechslungsreiche T&Atilde;&curren;tigkeit in einem gepflegten Umfeld, interessante Entwicklungsm&Atilde;&para;glichkeiten und attraktive Anstellungsbedingungen. Sind Sie interessiert? Dann freuen wir uns &Atilde;&frac14;ber Ihre elektronische Bewerbung [email protected]. Online Bewerben Einstellungen Deutsch / CHF GutscheinJetzt einl&Atilde;&para;sen NewsletterHier anmelden Haben Sie Fragen? Sie erreichen uns Mo. - Fr. von 8.00 Uhr - 12.00 Uhr und 13.00 Uhr - 17.00 Uhr unter +41 44 224 46 46. Kostenloser Versand schweizweit ab CHF 60.-Kauf auf Rechnung ab CHF 75.-Gratis GrusskarteInternationaler VersandSicheres Zahlen mit SSLAbholung in einer FilialeK&Atilde;&curren;ufer- und Datenschutz Spr&Atilde;&frac14;ngli OnlineshopShopInspirationSpr&Atilde;&frac14;ngli-WeltStandorteKontakt&Atilde;&#156;ber UnsAktuellesOffene StellenBrosch&Atilde;&frac14;renFirmenkundenOnlineshopLieferbedingungenZahlungsbedingungenR&Atilde;&frac14;ckgaberechtHilfe &amp; FAQMein Konto&Atilde;&#156;bersichtAdressbuchBestellungen Facebook Google+ &copy; 2017 Confiserie Spr&Atilde;&frac14;ngli AGBahnhofstrasse 21, 8001 Z&Atilde;&frac14;rich, Schweiz AGBAGB TreueprogrammImpressumKontaktAlle Preise inkl. MwSt. Schlie&Atilde;&#159;en Einstellungen Bitte w&Atilde;&curren;hlen Sie hier Ihre bevorzugte Sprache und W&Atilde;&curren;hrung. Deutsch Englisch Franz&Atilde;&para;sisch EUR CHF USD Speichern Newsletter Zum Newsletter anmelden HerrFrau Vorname Nachname email Speichern '; $(error_message).hide().appendTo($(this).closest('div')).fadeIn(1000); } } else { if($(this).closest('div').hasClass('has-error') &amp;&amp; !$(this).closest('div').hasClass('email-error')) { removeError($(this)); } if($(this).hasClass('validate-email')) { if(!emailValidation($(this).val())) { form_error = true; if(!$(this).closest('div').hasClass('email-error')) { $(this).addClass('validation-failed'); $(this).closest('div').addClass('has-error').addClass('email-error'); var error_message = 'Bitte geben Sie eine g&Atilde;&frac14;ltige E-Mail Adresse ein. Zum Beispiel [email protected].'; $(error_message).hide().appendTo($(this).closest('div')).fadeIn(1000); } } else { if($(this).closest('div').hasClass('email-error')) { removeError($(this)); } } } } }); function removeError(element) { $(element).removeClass('validation-failed'); $(element).closest('div').removeClass('has-error').removeClass('email-error'); $(element).closest('div').find('.validation-advice').fadeOut(1000).remove(); } function emailValidation(email) { var pattern = /^([a-z\\d!#$%&amp;'*+\\-\\/=?^_`{|}~\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]+(\\.[a-z\\d!#$%&amp;'*+\\-\\/=?^_`{|}~\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]+)*|\"((([ \\t]*\\r\\n)?[ \\t]+)?([\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x7f\\x21\\x23-\\x5b\\x5d-\\x7e\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0d-\\x7f\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]))*(([ \\t]*\\r\\n)?[ \\t]+)?\")@(([a-z\\d\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]|[a-z\\d\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF][a-z\\d\\-._~\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]*[a-z\\d\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])\\.)+([a-z\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]|[a-z\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF][a-z\\d\\-._~\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]*[a-z\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])\\.?$/i; return pattern.test(email); } if(form_error) { return false; } $('.icon-close-s-white:first').click(); var url = form.attr('action'); if ( !url ) return; $.ajax({ url : url, method : 'POST', data : form.serializeArray() }).done(function(data){ if (data.length){ var buttonSet = $('.ajaxpro-buttons-set:first'); if ( buttonSet.length ) { buttonSet.remove(); } var fakeMessage = '' + '' + 'Schlie&Atilde;&#159;en' + '' + '' + '' + '' + '' + '' + 'Sie wurden erfolgreich f&Atilde;&frac14;r den Newsletter angemeldet' + '' + '' + '' + ''; if ( $('#ajaxpro-notice-form-typo').length ) { $('#ajaxpro-notice-form-typo').remove(); } $(body).append($(fakeMessage).center()); } }); }); })(jQuery) } } //]]&gt;"}]`

Description in TemplateValues does not get imported

Source looks like

{
      "templateValues": {
        "benefits": "<p>Content</p>",
        "requirements": "<p>Content</p>",
        "tasks": "<p>Content</p>",
        "description": "Companyname with Description"
      },
      "company": "Companyname",
      "location": "Germany",
      "title": "Jobtitle",
      "classifications": {
        "employmentTypes": "Vollzeit"
      },
      "link": "https://example.com/1351.html",
      "id": "1351"
    },

The Job is stored without description in the database, the other TemplateValues are stored correctly

Upgrade Geocoder

Version 3 requires egeloen/http-adapter, which id abandoned.

Package egeloen/http-adapter is abandoned, you should avoid using it. Use php-http/httplug instead.

Exception, if no location is defined.

Hi @toni,

after the geocoder upgrade. the import behaves slightly diffrent. If a feed does not contain a location is thrown. I'ce added such the feed "reifen" to the demo db.

cbleek@php7-cb:~/SimpleImport$ composer db.init                  
> mongorestore --drop                                                                                                                                                               
...
cbleek@php7-cb:~/SimpleImport$ vendor/bin/yawik simpleimport info

moemax.................................. (5d0a2213403d4b050b219412)
reifen.................................. (5d19e95a403d4b0b8a06dea2)

executing the "reifen" import leads to an exeption. It should be possible to import a feed without a location.

cbleek@php7-cb:~/SimpleImport$ vendor/bin/yawik simpleimport import --name=reifen
The crawler with the name (ID) "reifen (5d19e95a403d4b0b8a06dea2)" has started its job:
                     [>                                                                                                                                      ]   0%             ======================================================================
   The application has thrown an exception!
======================================================================
 TypeError
 Argument 1 passed to Geocoder\Query\GeocodeQuery::create() must be of the type string, null given, called in /home/cbleek/SimpleImport/src/Job/GeocodeLocation.php on line 77
----------------------------------------------------------------------
/home/cbleek/SimpleImport/vendor/willdurand/geocoder/Query/GeocodeQuery.php:68
#0 /home/cbleek/SimpleImport/src/Job/GeocodeLocation.php(77): Geocoder\Query\GeocodeQuery::create(NULL)
#1 /home/cbleek/SimpleImport/src/Hydrator/JobHydrator.php(85): SimpleImport\Job\GeocodeLocation->getLocations(NULL)
#2 /home/cbleek/SimpleImport/src/CrawlerProcessor/JobProcessor.php(236): SimpleImport\Hydrator\JobHydrator->hydrate(Array, Object(Jobs\Entity\Job))
#3 /home/cbleek/SimpleImport/src/CrawlerProcessor/JobProcessor.php(117): SimpleImport\CrawlerProcessor\JobProcessor->syncChanges(Object(SimpleImport\Entity\Crawler), Object(SimpleImport\CrawlerProcessor\Result), Object(Zend\Log\Logger))
#4 /home/cbleek/SimpleImport/src/Controller/ConsoleController.php(132): SimpleImport\CrawlerProcessor\JobProcessor->execute(Object(SimpleImport\Entity\Crawler), Object(SimpleImport\CrawlerProcessor\Result), Object(Zend\Log\Logger))
#5 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc/src/Controller/AbstractActionController.php(78): SimpleImport\Controller\ConsoleController->importAction()
#6 /home/cbleek/SimpleImport/vendor/zendframework/zend-eventmanager/src/EventManager.php(322): Zend\Mvc\Controller\AbstractActionController->onDispatch(Object(Zend\Mvc\MvcEvent))
#7 /home/cbleek/SimpleImport/vendor/zendframework/zend-eventmanager/src/EventManager.php(179): Zend\EventManager\EventManager->triggerListeners(Object(Zend\Mvc\MvcEvent), Object(Closure))
#8 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc/src/Controller/AbstractController.php(106): Zend\EventManager\EventManager->triggerEventUntil(Object(Closure), Object(Zend\Mvc\MvcEvent))
#9 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc-console/src/Controller/AbstractConsoleController.php(56): Zend\Mvc\Controller\AbstractController->dispatch(Object(Zend\Console\Request), Object(Zend\Console\Response))
#10 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc/src/DispatchListener.php(138): Zend\Mvc\Console\Controller\AbstractConsoleController->dispatch(Object(Zend\Console\Request), Object(Zend\Console\Response))
#11 /home/cbleek/SimpleImport/vendor/zendframework/zend-eventmanager/src/EventManager.php(322): Zend\Mvc\DispatchListener->onDispatch(Object(Zend\Mvc\MvcEvent))
#12 /home/cbleek/SimpleImport/vendor/zendframework/zend-eventmanager/src/EventManager.php(179): Zend\EventManager\EventManager->triggerListeners(Object(Zend\Mvc\MvcEvent), Object(Closure))
#13 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc/src/Application.php(332): Zend\EventManager\EventManager->triggerEventUntil(Object(Closure), Object(Zend\Mvc\MvcEvent))
#14 /home/cbleek/SimpleImport/vendor/yawik/core/bin/yawik(27): Zend\Mvc\Application->run()
#15 {main}
======================================================================
   Previous Exception(s):

Do not fetch remote content, if provided via source

At the moment the plaintext for a remote job is always fetched by a remote GET

https://github.com/yawik/SimpleImport/blob/master/src/CrawlerProcessor/JobProcessor.php#L175-L187

This does not work if the remote site loads the job content via javascript or use an iframe

If the remotedata contains the needed templateValues http://scrapy-docs.yawik.org/build/html/guidelines/format.html take this, otherwise use remotefetch

"templateValues":{ "description": "<p>We're a good company<\/p>", "tasks":"<b>Your Tasks<\/b><ul><li>Task 1<\/li><li>Task2<\/li><\/ul>", "requirements":"<b>Qualifications<\/b><ul><li>requirement 1<\/li><li>requirement 2<\/li<<\/ul>", "benefits":"<b>We offer<\/b><ul><li>offer 1<\/li><li>offer 2<\/li><\/ul>", "html": "<p>complete html<\/p>" }

something like

$data = $importData['templateValues'];
if $data['html'] isset and notempty: $plainText = prettify($data['html']);
elseif concatenate (description, tasks, requirements, benefits) is not empty: $plainText = prettify($data['html'])
else $plainText = remotefetch(url)

and prettify(html) should remove all html-tags

Lock running crawlers

A crawler must not be able to run if it is already running in another process.
For example if a crawler is run through a cron job and started on the terminal - hat will lead to duplicate job entities in the database.
So we need a mechanism to lock a running crawler for other processes.

A flag in the crawler entity should be enough - although that means, the entity must be flushed to the database before the crawling loop starts.

status of job ads

Job advertisements that are available in the feed should be given the status "active".

Currently, an advertisement that has expired is not automatically reactivated when it reappears.

Can this be configured?

Specifying log file directory in module.config.php

Currently the location is hardcoded using the constant __DIR__ and assuming the module's base dir resides in the "module" directory of a yawik installation.

However, since YAWIK's structure changed with 0.32.0, modules are developed as standalone projects.
The module.config.php then is not in a module directory of a yawik instance. (Because yawik is run in test/sandbox/ )

We need a way to set the log file location based on the condition the module is run under.

@kilip What do you think? Do you have an idea?

Missing Commands like delete, deactivate, activate

It is not possible to delete an existing crawler.
There are more commands that might be helpful:

  • delete
  • activate/deactivate
  • update: Updates the settings for a crawler, e.g. feed-uri
  • force-reload: Import all job advertisements as "new"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.