seykron / ogov-importer Goto Github PK
View Code? Open in Web Editor NEWArgentina's Congress bill importer
License: GNU General Public License v2.0
Argentina's Congress bill importer
License: GNU General Public License v2.0
Some projects have gone to the revision chamber and are assigned a different ID there. I think we may be duplicating those projects when they are actually just one. Please check it.
In case we are, I think we should make something like a "bill group" that could contain several expedients representing the same project.
Also, this idea of "bill group" could serve to crate groups of projects based on user-defined issues. But that's a more complex use case that perhaps exceeds this scraper.
It seems to be a leak on file descriptors when the vote importer creates a temporary file to read a motion PDF. Maybe it is caused by node's streams, file write streams are not closing file descriptors properly.
Expected result:
Actual results:
Here is a page with information on the latest database update:
http://www1.hcdn.gov.ar/proyectos_search/actualizacion.asp?tipo=proyectos
Is this enough to only crawl those documents?
Here we see the subscribers field have the name of the committee as a subscriber, which I think should be the name of the creator of the project.
{
"type":"PROYECTO DE RESOLUCION",
"source":"Diputados",
"file":"3621-D-2007",
"publishedOn":"Tr�mite Parlamentario",
"creationTime":"2007-08-09T03:00:00.000Z",
"hasText":true,
"summary":"DECLARAR LA VALIDEZ DEL DECRETO DE NECESIDAD Y URGENCIA 861/2007.",
"subscribers":[
{
"name":"BICAMERAL PERMANENTE DE TRAMITE LEGISLATIVO - LEY 26122",
"party":"NONE",
"province":""
}
],
"committees":
["BICAMERAL PERMANENTE DE TRAMITE LEGISLATIVO - LEY 26122"],
"dictums":
[
{"file":"3621-D-2007",
"source":"Diputados","orderPaper":"Orden del d�a n� 2669/2007","date":"2007-08-09T03:00:00.000Z",
"result":"DICTAMEN DE MAYORIA: LA COMISION ACONSEJA APROBAR UN PROYECTO DE RESOLUCION Y DECLARAR LA VALIDEZ DEL DECRETO; 2 DICTAMENES DE MINORIA: ACONSEJAN RECHAZAR EL DECRETO"}],
"procedures":[{"file":"3621-D-2007","source":"Diputados","topic":"MOCION SOBRE TABLAS (PLAN DE LABOR) (AFIRMATIVA)","date":"2007-11-21T03:00:00.000Z","result":""},{"file":"3621-D-2007","source":"Diputados","topic":"CONSIDERACION Y APROBACION","date":"2007-11-28T03:00:00.000Z","result":"APROBADO"},{"file":"3621-D-2007",
"source":"Diputados",
"topic":"INSERCION DEL DIPUTADO LANDAU",
"date":"2007-11-28T03:00:00.000Z",
"result":""
}
]
}
Example:
{"pictureUrl":"http://www4.hcdn.gob.ar/fotos/wsantillan_medium.jpg",
"name":"SANTILLAN, WALTER MARCELO",
"user":"wsantillan",
"email":"[email protected]",
"district":"TUCUMAN",
"start":"10/12/2011",
"end":"09/12/2015",
"party":"FRENTE PARA LA VICTORIA - PJ",
"role":"legislative",
"committees":[
{"id":"camunicipale","name":"ASUNTOS MUNICIPALES","position":"vicepresidente 1º"},
{"id":"ccultur","name":"CULTURA","position":"secretario"},
{"id":"cceinformatic","name":"COMUNICACIONES E INFORMATICA","position":"vocal"},
{"id":"cdhygarantia","name":"DERECHOS HUMANOS Y GARANTIAS","position":"vocal"},
{"id":"ceydregiona","name":"ECONOMIAS Y DESARROLLO REGIONAL","position":"vocal"},
{"id":"cltrabaj","name":"LEGISLACION DEL TRABAJO","position":"vocal"},
{"id":"copublica","name":"OBRAS PUBLICAS","position":"vocal"},
{"id":"cpyssocia","name":"PREVISION Y SEGURIDAD SOCIAL","position":"vocal"},
{"name":"","position":"COMUNICACIONES E INFORMATICA"},
{"name":"","position":"LEGISLACION DEL TRABAJO"},
{"name":"","position":"LEGISLACION DEL TRABAJO"}
]
}
See last three entries.
This might sound just like an aesthetic improvement, so you might not want to give it a high priority, but I think should be stored in one folder per year, this way it's easier to count the bills in one year with find and wc.
When I'm processing the files, it will easier to tell if I have processed all the files from the same folder/year. Right now, with that mess of meaningless numbers I get confused.
Inside that folder, they should be split between s and d, and there I think they should be all together.
Also, adding .json to the files will make it easier for github to display the file.
Example:
2012/s/5454-s-2013.json
2012/d/5453-d-2013.json
2010/s/1253-s-2010.json
2010/d/1253-d-2010.json
The Orden del Día document contains the version of the text that will be voted by the whole chamber. We need this text because it's more important than the original project's text that doesn't get updated.
Also, parsing this orden del día PDF file will allow for comparision of version of the texts.
Example URL: http://www4.diputados.gov.ar/dependencias/dcomisiones/periodo-130/130-1998.pdf
This can be obtained from the same place in the search results as everything else.
FileSystemStorer must implement a directory balance strategy.
I run node importer bills and I get this after 20 minutes:
Imported items: 128000
info: Queue empty. Adding another 4 pages to the queue.
fs.js:432
return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
^
Error: EISDIR, illegal operation on a directory '/home/billit/ogov-importer/data/bills'
at Object.fs.openSync (fs.js:432:18)
at Object.fs.writeFileSync (fs.js:971:15)
at /home/billit/ogov-importer/lib/FileSystemStorer.js:86:12
at process._tickCallback (node.js:415:13)
There is a bill with null summary.
File: bills/07/02/0702-D-2012
Summary should be: PROGRAMA NACIONAL DE ASISTENCIA DE LAS ADICCIONES: CREACION EN EL AMBITO DEL MINISTERIO DE SALUD DE LA NACION.
Source: http://www.hcdn.gov.ar/proyectos/resultado.html?palabras=0702-D-2012
Full JSON content:
{"type":"PROYECTO DE LEY","source":"Diputados","file":"0702-D-2012","publishedOn":"Tr�mite Parlamentario n� 8","creationTime":"2012-03-13T03:00:00.000Z","summary":null,"subscribers":[{"name":"DONDA PEREZ, VICTORIA ANALIA","party":"LIBRES DEL SUR","province":"BUENOS AIRES"}],"committees":["PREVENCION DE ADICCIONES Y CONTROL DEL NARCOTRAFICO","ACCION SOCIAL Y SALUD PUBLICA","PRESUPUESTO Y HACIENDA"],"dictums":[{"file":"0702-D-2012","source":"Diputados","orderPaper":"Orden del d�a n� 1251/2012 - DICTAMEN CONJUNTO DE LOS EXPEDIENTES 0398-D-2012, 0662-D-2012, 0702-D-2012, 3044-D-2012, 4195-D-2012, 4215-D-2012, 5480-D-2012 y 5833-D-2012","date":null,"result":"02/11/2012"},{"file":"0702-D-2012","source":"Senado","orderPaper":"Orden del d�a n� 0030/2014 - DICTAMEN CONJUNTO DE LOS EXPEDIENTES 0114-CD-2012, 0281-S-2013, 1242-S-2013, 0731-S-2013, 0204-S-2014 y 0637-S-2014","date":"2014-04-23T03:00:00.000Z","result":"LA COMISION ACONSEJA APROBAR EL PROYECTO DE LEY VENIDO EN REVISION; CON TRES DISIDENCIAS PARCIALES; ANEXO: DICTAMEN DE MINORIA: CON MODIFICACIONES, LA COMISION ACONSEJA APROBAR OTRO PROYECTO DE LEY"}],"procedures":[{"file":"0702-D-2012","source":"Diputados","topic":"CITACION SESION ESPECIAL CONJUNTAMENTE PARA LOS EXPEDIENTES 0398-D-2012, 0662-D-2012, 0702-D-2012, 3044-D-2012, 4195-D-2012, 4215-D-2012, 5480-D-2012 y 5833-D-2012","date":"2012-11-14T03:00:00.000Z","result":null},{"file":"0702-D-2012","source":"Diputados","topic":"CONSIDERACION Y APROBACION CON MODIFICACIONES (VOTACION NOMINAL) CONJUNTAMENTE PARA LOS EXPEDIENTES 0398-D-2012, 0662-D-2012, 0702-D-2012, 3044-D-2012, 4195-D-2012, 4215-D-2012, 5480-D-2012 y 5833-D-2012","date":"2012-11-14T03:00:00.000Z","result":"MEDIA SANCION"},{"file":"0702-D-2012","source":"Senado","topic":"PASA A SENADO -","date":null,"result":null},{"file":"0702-D-2012","source":"Senado","topic":"CONSIDERACION Y SANCION CONJUNTAMENTE PARA LOS EXPEDIENTES 0114-CD-2012, 0204-S-2014 y 0637-S-2014","date":"2014-04-30T03:00:00.000Z","result":"SANCIONADO"}]}
The bill importer should have a way to limit the query to the beggining of last year. It is impossible for projects added since last year to be updated, so it will be faster to only crawl projects that have the possibility to have been updated.
Idea:
node importer bills 2013
Creates a query where fecha_inicio is 01/01/2013 instead of 1999.
The bill importer actually requires a very heavy use of memory and the node's garbage collector does not free memory as fast as it is needed.
Most of memory is used by jsdom to generate the virtual DOM environment. It works pretty well up to ~7000 bills in a laptop with 4GB of RAM, then it starts to swap.
Expected result:
Actual results:
Witch version of node vm?
Some special config?
thanks.
For the vote event we would like to have the type of majority and type of quorum (which define whether the session should be held and whether the project should be approved).
These can be obtained from the header of the PDF.
Also, the official counts (affirm/negative/absent) should be in their own field, because these are the official counts and the granular vote data is sometimes different.
The counts are already in the popolo standard: http://popoloproject.com/specs/vote-event.html
Expected results:
Actual results:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.