yawik / simpleimport Goto Github PK
View Code? Open in Web Editor NEWSimple Job Import Module. Imports job openings into YAWIK
License: MIT License
Simple Job Import Module. Imports job openings into YAWIK
License: MIT License
In yawik/jobs
a EntityLoader for the purge cli command is implemented, which loads all jobs that are expired a certain amount of days.
The purge command triggers an event to fetch dependent entities ( to a job entity )
SimpleImport needs a Listener for that events to report and remove the crawler items associated with an job.
An example dependency loader can be found in the yawik/Applications:
LoadDependendEntities
which is registered in config:
'Core/EntityEraser/Dependencies/Events' => [
'listeners' => [
Listener\LoadDependendEntities::class => [
'events' => [
\Core\Service\EntityEraser\DependencyResultEvent::CHECK_DEPENDENCIES => '__invoke',
\Core\Service\EntityEraser\DependencyResultEvent::DELETE => 'onDelete',
],
],
],
]
the purge command to delete expired jobs can be triggered with:
bin/console purge expired-jobs
the SimpleImport module should be able to geoencode locations. http://geocoder-php.org/ can do the task but the latest version requires php7.
But the older version 3.3 is working with php5.6
https://github.com/cbleek/geocoder-php-test
So the SimpleImport should be able create location objects with city, country, coods
@TiSiE do you agree?
@fedys could you implement this?
If a job was imported with templateValues and some time later gets updated and some or all templateValues are not set or an empty string, the existing templateValues will remain.
Feel free to assign me this task.
The geocoder library used by this module released a new major version which changes its usage in a way that needs some adaption.
It changes the name to and splits the packages into the "geocoder-php/*" namespace, which has the most impact on this module.
The following actions need to be done:
We created the branch upgrade-geocoder for developing this enhancement.
cbleek@php7:~$ git clone https://github.com/yawik/SimpleImport.git
cbleek@php7:~$ cd SimpleImport/
cbleek@php7:~/SimpleImport$ git checkout upgrade-geocoder
cbleek@php7:~/SimpleImport$ composer install
The feature branch contains a YAWIK_TEST database which can be installed via:
cbleek@php7:~/SimpleImport$ composer db.init
The YAWIK_TEST contains a user with an organization an a simpleimport.crawler. The execution of
vendor/bin/yawik simpleimport import
leads to an exception, because of the changes in the geocoder-php changes.
cbleek@php7:~/SimpleImport$ vendor/bin/yawik simpleimport import
======================================================================
The application has thrown an exception!
======================================================================
Error
Class 'Geocoder\Provider\GoogleMaps' not found
---------------------------------------------------------------------
The initial status of imported jobs by a crawler must be configurable.
Currently they are imported as "Created", meaning the administrator has to approve every single job...
For some crawler this might not be the desired behaviour. So it must be possible to set the inital state per crawler.
Similar to cross-solution/YAWIK#543
The values from the JSON will be imported without being sanitized. Affected fields are at least company and location. There can be other fields which produce the same errors, e.g. classification fields
Sample file
{
"jobs": [
"id": 1,
"title": "<h1>Title</h1><script>alert('Title-XSS');</script>",
"location": "<h1>Location</h1><script>alert('Location-XSS');</script>",
"link": "http://www.example.com/job/1"
}
]
}
At the moment the applymode is only set if an applyLink is provided
https://github.com/yawik/SimpleImport/blob/master/src/Hydrator/JobHydrator.php#L67-L71
from the spider docs, we could use the contactEmail
http://scrapy-docs.yawik.org/build/html/guidelines/format.html
"contactEmail": "email address for applications (if available)",
is this correct?
Due to cross-solution/YAWIK#532, the crawling process can stop and leave the crawler in an inconsistent state.
In combination with the Solr module https://github.com/yawik/Solr it can be that the Solr commit fails.
There will be an exception and the import stops. The problem is here:
The job gets inserted in the Database but not in the Crawler due to the Solr exception.
So the next time the import gets started, the same job will be created again.
There is no automic deletion, since the job is not managed by the Crawler
this was discussed in yawik/Solr#12
adding a jobosting without a fulltext to solr makes no sense. So what can we do, if a job is activated and no html is given.
My idea is to simply fetch the html. If fetching fails, the job is not inserted.
Can we/should we do this in the solr module?
Keep in mind that all the time consuming tasks like fetching pages, inserting into solr, sending mails... should be moved into some queing system
If the templateValues are not set the import tries to fetch the remote page as plain text. If this remote fetching fails, the job gets the status invalid and will not be imported.
There can be many reasons for such a failure, e.g.
2019-01-17T17:06:03+01:00 ERR (3): Cannot fetch HTML digest for a job, import ID: "32", link: "example.com/3", reason: "Unable to fetch remote data, reason: "Read timed out after 5 seconds""
2019-01-17T17:10:08+01:00 ERR (3): Cannot fetch HTML digest for a job, import ID: "9", link: "example.com/1", reason: "Invalid HTTP status: "404""
2019-01-18T13:17:27+01:00 ERR (3): Cannot fetch HTML digest for a job, import ID: "7763de2bd66926f8fc8b49d384628896", link: "example.com/2", reason: "Unable to fetch remote data, reason: "Unable to enable crypto on TCP connection example.com: make sure the "sslcafile" or "sslcapath" option are properly set for the environment.""
All these jobs were not imported and for at least two of them the error is not the remote site.
Since the plaintext is not critical, the job should regardless of a plaintextexception always be imported.
It is currently only possible to add missing classifications to an job.
But sometimes, it's also needed to
sometimes job title contains informations about a salary.
eg:
if the simple import finds such an information, we should store it in salary
(\D|^)([\d]{2})([,.](\d\d|-))?\s?(EUR|Euro|€)\s*\/?\s*(STD|Std|h|Stunde)?
(([\d]{1,2}(,[\d\d]{1,2})?)\s*-?\s*([\d]{1,2})?)\s*(STD|Std|h|Stunden?).?\s*(/|pro)\s*(Wo|Woche)
Job position catgeories, such as employment type, position, industry, etc. varies widely.
In order to unify the categories in YAWIK, there need to be a mapping mechanism, that allows to map arbitrary values to categories known by the particular yawik instance.
So we need to implement a filter mechanism that maps these values according to user configuration.
After adding a crawler and then start the import process, this crawler does not run.
cbleek@php7-cb:~/AtomProjects/SimpleImport$ git checkout a659f99053181f3c2a8e3dbc37b8d8da0c5a779c
Note: checking out 'a659f99053181f3c2a8e3dbc37b8d8da0c5a779c'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:
git checkout -b <new-branch-name>
HEAD is now at a659f99 Optional shuffle the Publish date on import
cbleek@php7-cb:~/AtomProjects/SimpleImport$ ./vendor/bin/phpunit
PHPUnit 8.5.2 by Sebastian Bergmann and contributors.
............................................................... 63 / 181 ( 34%)
.............................F................................. 126 / 181 ( 69%)
..EE................................................... 181 / 181 (100%)
Time: 1.12 seconds, Memory: 28.00 MB
There were 2 errors:
1) SimpleImportTest\Hydrator\JobHydratorTest::testHydrate
ArgumentCountError: Too few arguments to function SimpleImport\Hydrator\JobHydrator::__construct(), 2 passed in /home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Hydrator/JobHydratorTest.php on line 59 and exactly 3 expected
/home/cbleek/AtomProjects/SimpleImport/src/Hydrator/JobHydrator.php:43
/home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Hydrator/JobHydratorTest.php:59
2) SimpleImportTest\Hydrator\JobHydratorTest::testHydrateInvalidObjectPassed
ArgumentCountError: Too few arguments to function SimpleImport\Hydrator\JobHydrator::__construct(), 2 passed in /home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Hydrator/JobHydratorTest.php on line 59 and exactly 3 expected
/home/cbleek/AtomProjects/SimpleImport/src/Hydrator/JobHydrator.php:43
/home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Hydrator/JobHydratorTest.php:59
--
There was 1 failure:
1) SimpleImportTest\Factory\CrawlerProcessor\JobProcessorFactoryTest::testInvoke
Psr\Container\ContainerInterface::get('FilterManager') was not expected to be called more than 3 times.
/home/cbleek/AtomProjects/SimpleImport/src/Factory/CrawlerProcessor/JobProcessorFactory.php:41
/home/cbleek/AtomProjects/SimpleImport/test/SimpleImportTest/Factory/CrawlerProcessor/JobProcessorFactoryTest.php:64
ERRORS!
Tests: 181, Assertions: 392, Errors: 2, Failures: 1.
SimpleImport/src/Hydrator/JobHydrator.php
Lines 60 to 62 in f917ac4
This works fine, if the job is a new entity, because getLocations() returns an ArrayCollection. But it fails on persisted entities, because it returns a Doctrine PersistentCollection then, which does not have the method fromArray.
Further on, even if this approach worked, it would append the locations, not replace.
You probably want to simply set a new Collection of locations in any case:
$job->setLocations(new ArrayCollection($locations))
Mongo documents have a maximum size.
Currently the crawler items (the meta data for a crawled job) are stored as embedded documents in the crawler entity. A crawler might be import a lot of jobs which could pose problems if we continue to embed the items.
So the items must be stored in a dedicated mongo collection and stored as referenced entities in the crawler entity.
When a job should be purged, that does not have a Crawler associated, a \Doctrine\ODM\MongoDB\DocumentNotFoundException
is thrown.
Additionally, if for whatever reason the remaining job ids could not be resolved, an ArgumentError
is thrown.
Currently the following string gets stored in solr:
Encoding is wrong: example: Sprüngli.
` "html":"Menu Shop Inspiration Sprüngli Welt Standorte Kontakt Suche Suche Suche Reset My Sprüngli My Sprüngli Mein Konto Anmelden oder Registrieren Passwort Anmelden Passwort vergessen? Registrieren 0 0 Warenkorb ist leer Sie haben keine Artikel im Warenkorb. Weiter einkaufen DE Schließen Sortiment Chlausgeschenke Weihnachtsgeschenke Luxemburgerli ® Pralinés & Truffes Naschprodukte Tafelschokoladen Geschenkpakete & Arrangements Cakes & Torten Pâtisserie & Dessert Sandwiches & Gebäck Salate & Birchermüesli Apéro Canapé Glace Saisonal Osterhasen Ostereier Ostergeschenke Valentinstag Muttertagsgeschenke Aktuelles Saisonales Angebot Petits Plaisirs-Collection Shop entdecken Apéro Business-Lunch Chlaus Cuba Firmengeschenke Geburtstag Geburt & Taufe Hochzeit Luxemburgerli Muttertag Neuheiten Ostern Schokolade aus Bergheumilch Valentinstag Weihnachten Inspiration entdecken Treueprogramm Petits Plaisirs Aktuelles Broschüren Firmenkunden Geschichte Medien Partnerschaft mit Art Museums of Switzerland Handwerkskunst Individualisierung Kaffeekultur/Gastronomie Stellen bei Sprüngli Sprüngli-Klassiker Verantwortung Zutaten & Herkunft Sprüngli Welt entdecken Search My Sprüngli Settings Shop Sortiment Chlausgeschenke Weihnachtsgeschenke Luxemburgerli ® Pralinés & Truffes Naschprodukte Tafelschokoladen Geschenkpakete & Arrangements Cakes & Torten Pâtisserie & Dessert Sandwiches & Gebäck Salate & Birchermüesli Apéro Canapé Glace Saisonal Osterhasen Ostereier Ostergeschenke Valentinstag Muttertagsgeschenke Aktuelles Saisonales Angebot Petits Plaisirs-Collection Inspiration Chlaus Luxemburgerli Apéro Business-Lunch Cuba Firmengeschenke Geburtstag Geburt & Taufe Hochzeit Muttertag Neuheiten Ostern Schokolade aus Bergheumilch Valentinstag Weihnachten Sprüngli Welt Treueprogramm Petits Plaisirs Aktuelles Broschüren Firmenkunden Geschichte Medien Partnerschaft mit Art Museums of Switzerland Handwerkskunst Individualisierung Kaffeekultur/Gastronomie Stellen bei Sprüngli Sprüngli-Klassiker Verantwortung Zutaten & Herkunft Standorte Kontakt Home ... ... Sprüngli WeltStellen bei SprüngliUnsere StellenangeboteDetailhandelsfachfrauShopInspirationSprüngli WeltStandorteKontaktDetailhandelsfachfrau80-100%, Zürich Flughafen Arbeiten bei Sprüngli Das 1836 gegründete Schweizer Familienunternehmen zählt heute mit seinem erlesenen Sortiment zu den renommiertesten Confiserien Europas. Die Produkte aus dem Hause Sprüngli stehen für beste Qualität, einmalige Frische und Natürlichkeit. Die vollendeten Kreationen bringen täglich Kundinnen und Kunden aus aller Welt ins Schwärmen. Zur Verstärkung des Teams unserer Filiale im Airside Center am Flughafen Zürich suchen wir eine engagierte und begeisterte Detailhandelsfachfrau 80-100% mit Ausstrahlung für den Verkauf unserer liebevoll hergestellten Köstlichkeiten. Sie bringen mit:abgeschlossene Ausbildung als Detailhandelsfachfrau EFZ und/oder Verkaufserfahrung im Detailhandel (vorzugsweise Confiserie- oder Lebensmittelbranche)hohe Dienstleistungsbereitschaft und Freude an internationaler Kundschaftgepflegtes Erscheinungsbild und gute Umgangsformengute mündliche Englischkenntnisse, weitere Fremdsprachen von Vorteilhohe Flexibilität bezüglich Arbeitszeiten (Schichteinsätze zwischen 5.30 und 22.30 Uhr sowie 2 bis 3 Wochenend-Einsätze pro Monat)Wir bieten Ihnen eine abwechslungsreiche Tätigkeit in einem gepflegten Umfeld, interessante Entwicklungsmöglichkeiten und attraktive Anstellungsbedingungen. Sind Sie interessiert? Dann freuen wir uns über Ihre elektronische Bewerbung [email protected]. Online Bewerben Einstellungen Deutsch / CHF GutscheinJetzt einlösen NewsletterHier anmelden Haben Sie Fragen? Sie erreichen uns Mo. - Fr. von 8.00 Uhr - 12.00 Uhr und 13.00 Uhr - 17.00 Uhr unter +41 44 224 46 46. Kostenloser Versand schweizweit ab CHF 60.-Kauf auf Rechnung ab CHF 75.-Gratis GrusskarteInternationaler VersandSicheres Zahlen mit SSLAbholung in einer FilialeKäufer- und Datenschutz Sprüngli OnlineshopShopInspirationSprüngli-WeltStandorteKontaktÃœber UnsAktuellesOffene StellenBroschürenFirmenkundenOnlineshopLieferbedingungenZahlungsbedingungenRückgaberechtHilfe & FAQMein KontoÃœbersichtAdressbuchBestellungen Facebook Google+ © 2017 Confiserie Sprüngli AGBahnhofstrasse 21, 8001 Zürich, Schweiz AGBAGB TreueprogrammImpressumKontaktAlle Preise inkl. MwSt. Schließen Einstellungen Bitte wählen Sie hier Ihre bevorzugte Sprache und Währung. Deutsch Englisch Französisch EUR CHF USD Speichern Newsletter Zum Newsletter anmelden HerrFrau Vorname Nachname email Speichern '; $(error_message).hide().appendTo($(this).closest('div')).fadeIn(1000); } } else { if($(this).closest('div').hasClass('has-error') && !$(this).closest('div').hasClass('email-error')) { removeError($(this)); } if($(this).hasClass('validate-email')) { if(!emailValidation($(this).val())) { form_error = true; if(!$(this).closest('div').hasClass('email-error')) { $(this).addClass('validation-failed'); $(this).closest('div').addClass('has-error').addClass('email-error'); var error_message = 'Bitte geben Sie eine gültige E-Mail Adresse ein. Zum Beispiel [email protected].'; $(error_message).hide().appendTo($(this).closest('div')).fadeIn(1000); } } else { if($(this).closest('div').hasClass('email-error')) { removeError($(this)); } } } } }); function removeError(element) { $(element).removeClass('validation-failed'); $(element).closest('div').removeClass('has-error').removeClass('email-error'); $(element).closest('div').find('.validation-advice').fadeOut(1000).remove(); } function emailValidation(email) { var pattern = /^([a-z\\d!#$%&'*+\\-\\/=?^_`{|}~\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]+(\\.[a-z\\d!#$%&'*+\\-\\/=?^_`{|}~\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]+)*|\"((([ \\t]*\\r\\n)?[ \\t]+)?([\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x7f\\x21\\x23-\\x5b\\x5d-\\x7e\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0d-\\x7f\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]))*(([ \\t]*\\r\\n)?[ \\t]+)?\")@(([a-z\\d\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]|[a-z\\d\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF][a-z\\d\\-._~\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]*[a-z\\d\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])\\.)+([a-z\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]|[a-z\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF][a-z\\d\\-._~\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF]*[a-z\\u00A0-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFEF])\\.?$/i; return pattern.test(email); } if(form_error) { return false; } $('.icon-close-s-white:first').click(); var url = form.attr('action'); if ( !url ) return; $.ajax({ url : url, method : 'POST', data : form.serializeArray() }).done(function(data){ if (data.length){ var buttonSet = $('.ajaxpro-buttons-set:first'); if ( buttonSet.length ) { buttonSet.remove(); } var fakeMessage = '' + '' + 'Schließen' + '' + '' + '' + '' + '' + '' + 'Sie wurden erfolgreich für den Newsletter angemeldet' + '' + '' + '' + ''; if ( $('#ajaxpro-notice-form-typo').length ) { $('#ajaxpro-notice-form-typo').remove(); } $(body).append($(fakeMessage).center()); } }); }); })(jQuery) } } //]]>"}]`
Source looks like
{
"templateValues": {
"benefits": "<p>Content</p>",
"requirements": "<p>Content</p>",
"tasks": "<p>Content</p>",
"description": "Companyname with Description"
},
"company": "Companyname",
"location": "Germany",
"title": "Jobtitle",
"classifications": {
"employmentTypes": "Vollzeit"
},
"link": "https://example.com/1351.html",
"id": "1351"
},
The Job is stored without description in the database, the other TemplateValues are stored correctly
https://travis-ci.org/github/yawik/SimpleImport/jobs/725533308
Version 3 requires egeloen/http-adapter, which id abandoned.
Package egeloen/http-adapter is abandoned, you should avoid using it. Use php-http/httplug instead.
The add-crawler command does not check for an existing crawler and will add the crawler twice
Hi @toni,
after the geocoder upgrade. the import behaves slightly diffrent. If a feed does not contain a location is thrown. I'ce added such the feed "reifen" to the demo db.
cbleek@php7-cb:~/SimpleImport$ composer db.init
> mongorestore --drop
...
cbleek@php7-cb:~/SimpleImport$ vendor/bin/yawik simpleimport info
moemax.................................. (5d0a2213403d4b050b219412)
reifen.................................. (5d19e95a403d4b0b8a06dea2)
executing the "reifen" import leads to an exeption. It should be possible to import a feed without a location.
cbleek@php7-cb:~/SimpleImport$ vendor/bin/yawik simpleimport import --name=reifen
The crawler with the name (ID) "reifen (5d19e95a403d4b0b8a06dea2)" has started its job:
[> ] 0% ======================================================================
The application has thrown an exception!
======================================================================
TypeError
Argument 1 passed to Geocoder\Query\GeocodeQuery::create() must be of the type string, null given, called in /home/cbleek/SimpleImport/src/Job/GeocodeLocation.php on line 77
----------------------------------------------------------------------
/home/cbleek/SimpleImport/vendor/willdurand/geocoder/Query/GeocodeQuery.php:68
#0 /home/cbleek/SimpleImport/src/Job/GeocodeLocation.php(77): Geocoder\Query\GeocodeQuery::create(NULL)
#1 /home/cbleek/SimpleImport/src/Hydrator/JobHydrator.php(85): SimpleImport\Job\GeocodeLocation->getLocations(NULL)
#2 /home/cbleek/SimpleImport/src/CrawlerProcessor/JobProcessor.php(236): SimpleImport\Hydrator\JobHydrator->hydrate(Array, Object(Jobs\Entity\Job))
#3 /home/cbleek/SimpleImport/src/CrawlerProcessor/JobProcessor.php(117): SimpleImport\CrawlerProcessor\JobProcessor->syncChanges(Object(SimpleImport\Entity\Crawler), Object(SimpleImport\CrawlerProcessor\Result), Object(Zend\Log\Logger))
#4 /home/cbleek/SimpleImport/src/Controller/ConsoleController.php(132): SimpleImport\CrawlerProcessor\JobProcessor->execute(Object(SimpleImport\Entity\Crawler), Object(SimpleImport\CrawlerProcessor\Result), Object(Zend\Log\Logger))
#5 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc/src/Controller/AbstractActionController.php(78): SimpleImport\Controller\ConsoleController->importAction()
#6 /home/cbleek/SimpleImport/vendor/zendframework/zend-eventmanager/src/EventManager.php(322): Zend\Mvc\Controller\AbstractActionController->onDispatch(Object(Zend\Mvc\MvcEvent))
#7 /home/cbleek/SimpleImport/vendor/zendframework/zend-eventmanager/src/EventManager.php(179): Zend\EventManager\EventManager->triggerListeners(Object(Zend\Mvc\MvcEvent), Object(Closure))
#8 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc/src/Controller/AbstractController.php(106): Zend\EventManager\EventManager->triggerEventUntil(Object(Closure), Object(Zend\Mvc\MvcEvent))
#9 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc-console/src/Controller/AbstractConsoleController.php(56): Zend\Mvc\Controller\AbstractController->dispatch(Object(Zend\Console\Request), Object(Zend\Console\Response))
#10 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc/src/DispatchListener.php(138): Zend\Mvc\Console\Controller\AbstractConsoleController->dispatch(Object(Zend\Console\Request), Object(Zend\Console\Response))
#11 /home/cbleek/SimpleImport/vendor/zendframework/zend-eventmanager/src/EventManager.php(322): Zend\Mvc\DispatchListener->onDispatch(Object(Zend\Mvc\MvcEvent))
#12 /home/cbleek/SimpleImport/vendor/zendframework/zend-eventmanager/src/EventManager.php(179): Zend\EventManager\EventManager->triggerListeners(Object(Zend\Mvc\MvcEvent), Object(Closure))
#13 /home/cbleek/SimpleImport/vendor/zendframework/zend-mvc/src/Application.php(332): Zend\EventManager\EventManager->triggerEventUntil(Object(Closure), Object(Zend\Mvc\MvcEvent))
#14 /home/cbleek/SimpleImport/vendor/yawik/core/bin/yawik(27): Zend\Mvc\Application->run()
#15 {main}
======================================================================
Previous Exception(s):
At the moment the plaintext for a remote job is always fetched by a remote GET
https://github.com/yawik/SimpleImport/blob/master/src/CrawlerProcessor/JobProcessor.php#L175-L187
This does not work if the remote site loads the job content via javascript or use an iframe
If the remotedata contains the needed templateValues http://scrapy-docs.yawik.org/build/html/guidelines/format.html take this, otherwise use remotefetch
"templateValues":{ "description": "<p>We're a good company<\/p>", "tasks":"<b>Your Tasks<\/b><ul><li>Task 1<\/li><li>Task2<\/li><\/ul>", "requirements":"<b>Qualifications<\/b><ul><li>requirement 1<\/li><li>requirement 2<\/li<<\/ul>", "benefits":"<b>We offer<\/b><ul><li>offer 1<\/li><li>offer 2<\/li><\/ul>", "html": "<p>complete html<\/p>" }
something like
$data = $importData['templateValues'];
if $data['html'] isset and notempty: $plainText = prettify($data['html']);
elseif concatenate (description, tasks, requirements, benefits) is not empty: $plainText = prettify($data['html'])
else $plainText = remotefetch(url)
and prettify(html) should remove all html-tags
A crawler must not be able to run if it is already running in another process.
For example if a crawler is run through a cron job and started on the terminal - hat will lead to duplicate job entities in the database.
So we need a mechanism to lock a running crawler for other processes.
A flag in the crawler entity should be enough - although that means, the entity must be flushed to the database before the crawling loop starts.
Include classifications into the import.
@TiSiE the simple import currently does not put jobs into the solr index. Maybe it's because there is no job event triggered as we do when activate a job?
Job advertisements that are available in the feed should be given the status "active".
Currently, an advertisement that has expired is not automatically reactivated when it reappears.
Can this be configured?
Currently the location is hardcoded using the constant __DIR__ and assuming the module's base dir resides in the "module" directory of a yawik installation.
However, since YAWIK's structure changed with 0.32.0, modules are developed as standalone projects.
The module.config.php then is not in a module directory of a yawik instance. (Because yawik is run in test/sandbox/
)
We need a way to set the log file location based on the condition the module is run under.
@kilip What do you think? Do you have an idea?
It is not possible to delete an existing crawler.
There are more commands that might be helpful:
The values for link and applyLink are not filtered correctly similiar to cross-solution/YAWIK#514
I think there should only be URLs possible like https://www.example.com/offer
and not something like javascript:alert('xss');
so the fix provided in the issue above would not solve this
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.