Giter VIP home page Giter VIP logo

ogov-importer's People

Contributors

matias-mi avatar seykron avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

ogov-importer's Issues

Group projects by revision chamber ID

Some projects have gone to the revision chamber and are assigned a different ID there. I think we may be duplicating those projects when they are actually just one. Please check it.
In case we are, I think we should make something like a "bill group" that could contain several expedients representing the same project.

Also, this idea of "bill group" could serve to crate groups of projects based on user-defined issues. But that's a more complex use case that perhaps exceeds this scraper.

File descriptors leak in vote importer

It seems to be a leak on file descriptors when the vote importer creates a temporary file to read a motion PDF. Maybe it is caused by node's streams, file write streams are not closing file descriptors properly.

Encoding issues

As far as I can tell, you have encoding issues in several of the fields. I see an encoding issue in the publishedOn field, with values like Tr�mite Parlamentario n� 107, and several Ñ characters in the subscribers' names are also garbled. The characters are not garbled in the source:

screen shot 2015-08-15 at 12 43 30

Implement storer to integrate this importer with ogov-api project.

Expected result:

  • Keep the integration with ogov-api up to date.
  • WISH: remove mongo and use mysql. Mongo does not work pretty well to map relationships between domain entities...

Actual results:

  • There's no more mongo-specific code to generate the document structure used by ogov-api.

Bill entries have wrong data in the subscribers field

Here we see the subscribers field have the name of the committee as a subscriber, which I think should be the name of the creator of the project.

{
"type":"PROYECTO DE RESOLUCION",
"source":"Diputados",
"file":"3621-D-2007",
"publishedOn":"Tr�mite Parlamentario",
"creationTime":"2007-08-09T03:00:00.000Z",
"hasText":true,
"summary":"DECLARAR LA VALIDEZ DEL DECRETO DE NECESIDAD Y URGENCIA 861/2007.",
"subscribers":[
{
"name":"BICAMERAL PERMANENTE DE TRAMITE LEGISLATIVO - LEY 26122",
"party":"NONE",
"province":""
}
],
"committees":
["BICAMERAL PERMANENTE DE TRAMITE LEGISLATIVO - LEY 26122"],
"dictums":
[
{"file":"3621-D-2007",
"source":"Diputados","orderPaper":"Orden del d�a n� 2669/2007","date":"2007-08-09T03:00:00.000Z",
"result":"DICTAMEN DE MAYORIA: LA COMISION ACONSEJA APROBAR UN PROYECTO DE RESOLUCION Y DECLARAR LA VALIDEZ DEL DECRETO; 2 DICTAMENES DE MINORIA: ACONSEJAN RECHAZAR EL DECRETO"}],
"procedures":[{"file":"3621-D-2007","source":"Diputados","topic":"MOCION SOBRE TABLAS (PLAN DE LABOR) (AFIRMATIVA)","date":"2007-11-21T03:00:00.000Z","result":""},{"file":"3621-D-2007","source":"Diputados","topic":"CONSIDERACION Y APROBACION","date":"2007-11-28T03:00:00.000Z","result":"APROBADO"},{"file":"3621-D-2007",
"source":"Diputados",
"topic":"INSERCION DEL DIPUTADO LANDAU",
"date":"2007-11-28T03:00:00.000Z",
"result":""
}
]
}

People entries have duplicated/wrong entries in committees.

Example:

{"pictureUrl":"http://www4.hcdn.gob.ar/fotos/wsantillan_medium.jpg",
"name":"SANTILLAN, WALTER MARCELO",
"user":"wsantillan",
"email":"[email protected]",
"district":"TUCUMAN",
"start":"10/12/2011",
"end":"09/12/2015",
"party":"FRENTE PARA LA VICTORIA - PJ",
"role":"legislative",
"committees":[
{"id":"camunicipale","name":"ASUNTOS MUNICIPALES","position":"vicepresidente 1º"},
{"id":"ccultur","name":"CULTURA","position":"secretario"},
{"id":"cceinformatic","name":"COMUNICACIONES E INFORMATICA","position":"vocal"},
{"id":"cdhygarantia","name":"DERECHOS HUMANOS Y GARANTIAS","position":"vocal"},
{"id":"ceydregiona","name":"ECONOMIAS Y DESARROLLO REGIONAL","position":"vocal"},
{"id":"cltrabaj","name":"LEGISLACION DEL TRABAJO","position":"vocal"},
{"id":"copublica","name":"OBRAS PUBLICAS","position":"vocal"},
{"id":"cpyssocia","name":"PREVISION Y SEGURIDAD SOCIAL","position":"vocal"},
{"name":"","position":"COMUNICACIONES E INFORMATICA"},
{"name":"","position":"LEGISLACION DEL TRABAJO"},
{"name":"","position":"LEGISLACION DEL TRABAJO"}
]
}

See last three entries.

Improve bill file storage structure

This might sound just like an aesthetic improvement, so you might not want to give it a high priority, but I think should be stored in one folder per year, this way it's easier to count the bills in one year with find and wc.

When I'm processing the files, it will easier to tell if I have processed all the files from the same folder/year. Right now, with that mess of meaningless numbers I get confused.

Inside that folder, they should be split between s and d, and there I think they should be all together.

Also, adding .json to the files will make it easier for github to display the file.

Example:
2012/s/5454-s-2013.json
2012/d/5453-d-2013.json

2010/s/1253-s-2010.json
2010/d/1253-d-2010.json

Add link to Orden del Día

The Orden del Día document contains the version of the text that will be voted by the whole chamber. We need this text because it's more important than the original project's text that doesn't get updated.
Also, parsing this orden del día PDF file will allow for comparision of version of the texts.

Example URL: http://www4.diputados.gov.ar/dependencias/dcomisiones/periodo-130/130-1998.pdf

This can be obtained from the same place in the search results as everything else.

Importer error at 128000 projects

I run node importer bills and I get this after 20 minutes:

Imported items: 128000
info: Queue empty. Adding another 4 pages to the queue.

fs.js:432
return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
^
Error: EISDIR, illegal operation on a directory '/home/billit/ogov-importer/data/bills'
at Object.fs.openSync (fs.js:432:18)
at Object.fs.writeFileSync (fs.js:971:15)
at /home/billit/ogov-importer/lib/FileSystemStorer.js:86:12
at process._tickCallback (node.js:415:13)

Bill with no summary

There is a bill with null summary.
File: bills/07/02/0702-D-2012
Summary should be: PROGRAMA NACIONAL DE ASISTENCIA DE LAS ADICCIONES: CREACION EN EL AMBITO DEL MINISTERIO DE SALUD DE LA NACION.
Source: http://www.hcdn.gov.ar/proyectos/resultado.html?palabras=0702-D-2012

Full JSON content:

{"type":"PROYECTO DE LEY","source":"Diputados","file":"0702-D-2012","publishedOn":"Tr�mite Parlamentario n� 8","creationTime":"2012-03-13T03:00:00.000Z","summary":null,"subscribers":[{"name":"DONDA PEREZ, VICTORIA ANALIA","party":"LIBRES DEL SUR","province":"BUENOS AIRES"}],"committees":["PREVENCION DE ADICCIONES Y CONTROL DEL NARCOTRAFICO","ACCION SOCIAL Y SALUD PUBLICA","PRESUPUESTO Y HACIENDA"],"dictums":[{"file":"0702-D-2012","source":"Diputados","orderPaper":"Orden del d�a n� 1251/2012 - DICTAMEN CONJUNTO DE LOS EXPEDIENTES 0398-D-2012, 0662-D-2012, 0702-D-2012, 3044-D-2012, 4195-D-2012, 4215-D-2012, 5480-D-2012 y 5833-D-2012","date":null,"result":"02/11/2012"},{"file":"0702-D-2012","source":"Senado","orderPaper":"Orden del d�a n� 0030/2014 - DICTAMEN CONJUNTO DE LOS EXPEDIENTES 0114-CD-2012, 0281-S-2013, 1242-S-2013, 0731-S-2013, 0204-S-2014 y 0637-S-2014","date":"2014-04-23T03:00:00.000Z","result":"LA COMISION ACONSEJA APROBAR EL PROYECTO DE LEY VENIDO EN REVISION; CON TRES DISIDENCIAS PARCIALES; ANEXO: DICTAMEN DE MINORIA: CON MODIFICACIONES, LA COMISION ACONSEJA APROBAR OTRO PROYECTO DE LEY"}],"procedures":[{"file":"0702-D-2012","source":"Diputados","topic":"CITACION SESION ESPECIAL CONJUNTAMENTE PARA LOS EXPEDIENTES 0398-D-2012, 0662-D-2012, 0702-D-2012, 3044-D-2012, 4195-D-2012, 4215-D-2012, 5480-D-2012 y 5833-D-2012","date":"2012-11-14T03:00:00.000Z","result":null},{"file":"0702-D-2012","source":"Diputados","topic":"CONSIDERACION Y APROBACION CON MODIFICACIONES (VOTACION NOMINAL) CONJUNTAMENTE PARA LOS EXPEDIENTES 0398-D-2012, 0662-D-2012, 0702-D-2012, 3044-D-2012, 4195-D-2012, 4215-D-2012, 5480-D-2012 y 5833-D-2012","date":"2012-11-14T03:00:00.000Z","result":"MEDIA SANCION"},{"file":"0702-D-2012","source":"Senado","topic":"PASA A SENADO -","date":null,"result":null},{"file":"0702-D-2012","source":"Senado","topic":"CONSIDERACION Y SANCION CONJUNTAMENTE PARA LOS EXPEDIENTES 0114-CD-2012, 0204-S-2014 y 0637-S-2014","date":"2014-04-30T03:00:00.000Z","result":"SANCIONADO"}]}

Import only last 2 years of bills

The bill importer should have a way to limit the query to the beggining of last year. It is impossible for projects added since last year to be updated, so it will be faster to only crawl projects that have the possibility to have been updated.
Idea:
node importer bills 2013

Creates a query where fecha_inicio is 01/01/2013 instead of 1999.

Bill importer memory performance issues

The bill importer actually requires a very heavy use of memory and the node's garbage collector does not free memory as fast as it is needed.

Most of memory is used by jsdom to generate the virtual DOM environment. It works pretty well up to ~7000 bills in a laptop with 4GB of RAM, then it starts to swap.

Expected result:

  • Lineal memory usage
  • JsDOM may be replaced by an XML pull parser.

Actual results:

  • A huge memory leak

Add data to voting

For the vote event we would like to have the type of majority and type of quorum (which define whether the session should be held and whether the project should be approved).
These can be obtained from the header of the PDF.

Also, the official counts (affirm/negative/absent) should be in their own field, because these are the official counts and the granular vote data is sometimes different.
The counts are already in the popolo standard: http://popoloproject.com/specs/vote-event.html

Resume bill importer if the process is somehow killed

Expected results:

  • Bill importer must continue from the last page it imported (it actually supports resuming, but it is not properly implemented in billImporter.js).

Actual results:

  • Bill importer always starts again from the first page.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.