ausdto / disco_layer Goto Github PK
View Code? Open in Web Editor NEWCode, outputs and Information relevant to the discovery layer.
Code, outputs and Information relevant to the discovery layer.
In haste, a hack was added to bypass the domain checking to allow it to go across all gov.au. This change the domainValid function to accept any domain. There is also crawler.filterByDomain = true though so maybe that can be changed.
Regardless the hack needs to be removed because it is an external module. Then the node_modules can be removed from repo.
https://github.com/AusDTO/discoveryLayer/blob/master/node/node_modules/simplecrawler/lib/crawler.js lines 536
var crawler = this,
crawlerHost = crawler.host;
//console.log("crawlerHost: " + crawlerHost);
// If we're ignoring the WWW domain, remove the WWW for comparisons...
if (crawler.ignoreWWWDomain)
host = host.replace(/^www./i,"");
//console.log("in domainValid");
///TODO: HACKED This is hacked to let it go outside this domain. Should then get caught by my conditions.
return true;
This will allow easier determination of changes for downstream processing.
currently, disco_service/govservices/management/commands/.. is a messy contraption that interfaces a local git repo of the service catalogue repository.
It would be much better if it accessed an API on the node.js for things like fetching lists of things that need to be synced in the DB. Better to only have one codebase for processing/manageing that json graph.
Parse the cleaned content (use something like lxml) into some kind of temporary structure, then traverse that structure to create a corresponding OrientDB graph.
note: it should be possible to "roundtrip" test this: from cleaned content to ContentAST, and from ContentAST back to equivalent (if not identical) clean content.
At the moment I am just leaving a wait for database commands to finish.
Need to move to close the database only after all the queries are done.
following #45, make some views so the service catalogue can be browsed.
seed "disco service" for the spider - these pages should be indexed (and boosted!).
The max limit is always being applied, leaving no way to do unlimited.
Some query params seem to be getting incorrectly defined.
Most likely this is an encode/decode issue.
Examples: info: Url was 404: http://ahl.gov.au/%3Fq=partnerships
info: Url was 404: http://ahl.gov.au/%3Fq=our-organisation
info: Url was 404: http://ahl.gov.au/%3Fq=ahl-board
info: Url was 404: http://ahl.gov.au/%3Fq=customer-service-charter
info: Url was 404: http://ahl.gov.au/%3Fq=contact
info: Url was 404: http://ahl.gov.au/%3Fq=employment
info: Url was 404: http://ahl.gov.au/%3Fq=support-services
info: Url was 404: http://ahl.gov.au/%3Fq=node%2F222
http://lmip.gov.au/default.aspx%3FLMIP%2FContactUs
Related: #32
I deleted all the document that were for ACNC becuse they were all going 400 because of the query strings.
Querystring stripping is now enabled. But there is a set of urls which are still there that when they come up for re fetching will cause issues.
It looks to be something with aspx pages that is the root cause.
If the query stripping is all that needs to be done can just update those urls before they are due again.
starting with raw content (spidered from web sites), create a shiny clean version of the content that's free of cruft.
following from #45, need to configure a search index
Some of the domains are not redirecting. We need to retry the www equivalent.
Dummy and then full
Enhance the Exclude Domains fetch condition to exclude state domains.
depends on #29
page and/or API. Given a URL (e.g. the current page hosting a widget), return a list of pages "like this one".
in disco_service/spiderbucket/management/commands/sync_docs_from_orientdb.py
, I have hardcoded values for orientdb.
these should be drawn from environment variables indirectly, through settings.py.
Need to add incremental commits
For example, if going to this website:
http://www.acnc.gov.au/findacharity
it redirects to:
http://www.acnc.gov.au/ACNC/FindCharity/QuickSearch/ACNC/OnlineProcessors/Online_register/Search_the_Register.aspx?noleft=1
But in the crawler I get a 599 and the redirected url is not fetched.
The callback chain is causing issues. This function effectively needs to wait until we get an answer.
Promises should resolve.
The actual database password is not being passed.
Also look to see if the server password is actually needed any more.
at least create another account that is limited to just listing the dbs etc.
Potential fix is in local git.
The current dockerised solr container is Solr5, but things might be easier with solr4 (until haystack support for solr5 gets a bit more mature).
either shake the issues out OR downgrade to solr4 and upgrade later.
faceted browsing site with search features and recommendation engine
Kickstart semantic information extraction / enrichment. Reason over:
MVP might be "life event" facets based on service catalogue + content cagefight clustering (#30)
Some urls are giving 404 when the query params are left, but I assume some will not work if they are stripped.
Example where it fails with the query param (400):
acnc.gov.au/ACNC/Manage/Reporting/ReportTransitional/ACNC/Report/ReportTransitional.aspx%3Fhkey=61d173e0-cabb-4be4-82d5-0bf37fa55c7c
Add hooks and a function to handle it.
http://orientdb.com/docs/last/Dynamic-Hooks.html
http://orientdb.com/docs/last/Functions.html
Currently just using console logs.
Need to move to a logging library.
Maybe winston.
At the errors (404, 500, timeout) are not stored. They should be, but at some stage we should stop trying to refresh them. Need to decide rules.
I had a test_update_dimension that I'm sure used to pass, but I think I broke it when I refactored the tests such that agencies were stashed in an "ORM cache" class member (as a performance hack, to reduce traffic between the test suite and the test DB).
I don't understand why this bug occurs now, but want to get on with a major refactor that might end up removing it anyhow. So, current plan is to switch the test off (rename to "buggy_test_update_dimension").
/home/ec2-user/crawler/logs/greenpower.gov.au_investigate.log
There are a whole bunch from green power that say completed but never make it to DB.
I can manually insert the url using studio.
Seems to be a lot of pdfs
Need to get log files rotating daily.
this is WIP ATM.
When queuing a lot of results we can get duplicate inserts after the select count. Not a big issue because the record has infact been stored which is all we want anyhow.
Need to handle database errors. Should just require attaching a catch hander.
connected to #30.
Python or node module to generate the enhanced document to be added to solr
following from #45, create a jenkins job that maintains RDBMS content when json changes in github.
following from #27, we have jenkins testing the disco_service but not the node stuff yet.
should be trivial, this works 'python manage.py test'
related to #30
long overdue, spiderbucket was always a stupid name
error: OrientDB.RequestError: Cannot index record webDocumentContainer{protocol:http,host:agriculture.gov.au,port:80,path:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,depth:3,pathname:/ScriptResource.axd?d=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3,url:http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3}: found duplicated key 'http://agriculture.gov.au/ScriptResource.axd%3Fd=c6-x5B8upWa01zaoYvIPGWKXX9r_j-wWUWV9FCPAuE2AJ0QqVe_xNXoqJ-ENMUA7BaVzaiSnT3p4CMPDOOVQ9VeiR6kFneRW3EA_el-OFUPXZkB0h9SKofzsQa9sElSIPeaoWOV631b-L71nDxxlHWyHzDP2z5_xU02TsidI_m6JdZKrnP_PjSEm6FyPSo9S0&t=ffffffff805766b3' in index 'webDocumentContainer.url' previously assigned to the record #13:25476
at Operation.parseError (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:832:13)
at Operation.consume (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/protocol28/operation.js:422:35)
at Connection.process (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:360:17)
at Connection.handleSocketData (/home/nokout/projects/discoveryLayer/crawler/node_modules/oriento/lib/transport/binary/connection.js:279:17)
at Socket.emit (events.js:107:17)
at readableAddChunk (_stream_readable.js:163:16)
at Socket.Readable.push (_stream_readable.js:126:10)
at TCP.onread (net.js:538:20)
Need to move the config db/params out of the primary functions
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.