ausdto / disco_layer Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 5.0 3.93 MB

Code, outputs and Information relevant to the discovery layer.

JavaScript 80.05% Python 13.61% HTML 1.45% Shell 0.64% CSS 0.15% Handlebars 3.33% Makefile 0.77%

disco_layer's People

Contributors

Stargazers

Watchers

Forkers

johan-- monkeypants nagyist

disco_layer's Issues

simplecrawler (Crawler.prototype.domainValid) Hack

In haste, a hack was added to bypass the domain checking to allow it to go across all gov.au. This change the domainValid function to accept any domain. There is also crawler.filterByDomain = true though so maybe that can be changed.

Regardless the hack needs to be removed because it is an external module. Then the node_modules can be removed from repo.

https://github.com/AusDTO/discoveryLayer/blob/master/node/node_modules/simplecrawler/lib/crawler.js lines 536

var crawler = this,
crawlerHost = crawler.host;
//console.log("crawlerHost: " + crawlerHost);
// If we're ignoring the WWW domain, remove the WWW for comparisons...
if (crawler.ignoreWWWDomain)
host = host.replace(/^www./i,"");
//console.log("in domainValid");
///TODO: HACKED This is hacked to let it go outside this domain. Should then get caught by my conditions.
return true;

crawler: add a document hash field

This will allow easier determination of changes for downstream processing.

use REST api for service catalogue data

currently, disco_service/govservices/management/commands/.. is a messy contraption that interfaces a local git repo of the service catalogue repository.

It would be much better if it accessed an API on the node.js for things like fetching lists of things that need to be synced in the DB. Better to only have one codebase for processing/manageing that json graph.

make AST (graph) of cleaned content

Parse the cleaned content (use something like lxml) into some kind of temporary structure, then traverse that structure to create a corresponding OrientDB graph.

note: it should be possible to "roundtrip" test this: from cleaned content to ContentAST, and from ContentAST back to equivalent (if not identical) clean content.

only close database after queries are done.

At the moment I am just leaving a wait for database commands to finish.
Need to move to close the database only after all the queries are done.

make browsable version of the service catalogue

following #45, make some views so the service catalogue can be browsed.

seed "disco service" for the spider - these pages should be indexed (and boosted!).

use bookmarklet to record assertions about pages

crawler: only apply max if > 0

The max limit is always being applied, leaving no way to do unlimited.

crawler: url encoding of query parameters

Some query params seem to be getting incorrectly defined.
Most likely this is an encode/decode issue.

acnc.gov.au is currently excluded

I deleted all the document that were for ACNC becuse they were all going 400 because of the query strings.
Querystring stripping is now enabled. But there is a set of urls which are still there that when they come up for re fetching will cause issues.
It looks to be something with aspx pages that is the root cause.
If the query stripping is all that needs to be done can just update those urls before they are due again.

normalise/decruftify content

starting with raw content (spidered from web sites), create a shiny clean version of the content that's free of cruft.

Create a sample of user assertions about content.

create haystack index configuration for servicecatalogue

following from #45, need to configure a search index

retry with www if url has null path

Some of the domains are not redirecting. We need to retry the www equivalent.

configure disco_service to use celery-haystack

https://github.com/django-haystack/celery-haystack/

Post enchanced document to solr

Dummy and then full

Add build tests for node module

Exclude state based domains

Enhance the Exclude Domains fetch condition to exclude state domains.

content cagefight API presentation

depends on #29

page and/or API. Given a URL (e.g. the current page hosting a widget), return a list of pages "like this one".

purge orientdb references

in disco_service/spiderbucket/management/commands/sync_docs_from_orientdb.py, I have hardcoded values for orientdb.

these should be drawn from environment variables indirectly, through settings.py.

crawler: increase runtime between restarts

Need to add incremental commits

Load JSON samples of service description document

crawler: Expected redirect but getting 599

For example, if going to this website:
http://www.acnc.gov.au/findacharity
it redirects to:
http://www.acnc.gov.au/ACNC/FindCharity/QuickSearch/ACNC/OnlineProcessors/Online_register/Search_the_Register.aspx?noleft=1

But in the crawler I get a 599 and the redirected url is not fetched.

Fetch Condition - Due Date Callback fail

The callback chain is causing issues. This function effectively needs to wait until we get an answer.
Promises should resolve.

crawler: database password not being passed

The actual database password is not being passed.
Also look to see if the server password is actually needed any more.
at least create another account that is limited to just listing the dbs etc.

Potential fix is in local git.

haystack/solr version compatability

The current dockerised solr container is Solr5, but things might be easier with solr4 (until haystack support for solr5 gets a bit more mature).

either shake the issues out OR downgrade to solr4 and upgrade later.

adopt haystack-celery module

disco_service as a gateway to all the content

faceted browsing site with search features and recommendation engine

Kickstart semantic information extraction / enrichment. Reason over:

10M resources found
content extraction/description (NLP etc)
service catalogue data
AGIF metadata ontology + alternative terms taxonomy (AN Archives).
assertions binding service catalogue to AGIF metadata
assertions binding MOG chart to AGIF metadata

MVP might be "life event" facets based on service catalogue + content cagefight clustering (#30)

crawler: Stripping of query params

Some urls are giving 404 when the query params are left, but I assume some will not work if they are stripped.
Example where it fails with the query param (400):
acnc.gov.au/ACNC/Manage/Reporting/ReportTransitional/ACNC/Report/ReportTransitional.aspx%3Fhkey=61d173e0-cabb-4be4-82d5-0bf37fa55c7c

orientdb trigger and function to handle changes

Add hooks and a function to handle it.
http://orientdb.com/docs/last/Dynamic-Hooks.html
http://orientdb.com/docs/last/Functions.html

Fix Logging

Currently just using console logs.
Need to move to a logging library.

Maybe winston.

Crawl Server Errors

At the errors (404, 500, timeout) are not stored. They should be, but at some stage we should stop trying to refresh them. Need to decide rules.

Link AST Graph to service documents using assertions

crawler: change docker user

mysterious bug in govservices test_update_dimension

I had a test_update_dimension that I'm sure used to pass, but I think I broke it when I refactored the tests such that agencies were stashed in an "ORM cache" class member (as a performance hack, to reduce traffic between the test suite and the test DB).

I don't understand why this bug occurs now, but want to get on with a major refactor that might end up removing it anyhow. So, current plan is to switch the test off (rename to "buggy_test_update_dimension").

new tests/celery task for syncing NEW resources from crawler's DB to metadata.models.Page

related to #30

How should this go:

trigger periodic job (60s?) from celerybeat?
job pulls (limit 10K?) new resources from crawler, dispatches create task to queue
task creates new Resource into metadata
per #28, index maintenance should trigger automatically...

error: OrientDB.RequestError: Cannot index record webDocumentContainer{protocol:http,host:agriculture.gov.au,port:80,path:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,depth:3,pathname:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,url:http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3}: found duplicated key 'http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3' in index 'webDocumentContainer.url' previously assigned to the record #13:25476
at Operation.parseError (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:832:13)
at Operation.consume (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:422:35)
at Connection.process (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:360:17)
at Connection.handleSocketData (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:279:17)
at Socket.emit (events.js:107:17)
at readableAddChunk (_stream_readable.js:163:16)
at Socket.Readable.push (_stream_readable.js:126:10)
at TCP.onread (net.js:538:20)

Node/crawl/crawl.js - Externalise config

Need to move the config db/params out of the primary functions