Giter VIP home page Giter VIP logo

openownership / register-ingester-psc Goto Github PK

View Code? Open in Web Editor NEW
1.0 7.0 0.0 190 KB

Register Ingester PSC is an application designed for use with the People with significant control (PSC) data published by Companies House in the UK

Home Page: https://bods-data.openownership.org/source/PSC/

License: Apache License 2.0

Shell 2.15% Ruby 95.97% Dockerfile 1.81% Procfile 0.06%
beneficial-ownership companies-house open-data elasticsearch

register-ingester-psc's Introduction

Register Ingester PSC

Register Ingester PSC is a data ingester for the OpenOwnership Register project. It processes bulk data published about People with Significant Control (PSC) published by Companies House in the UK, and ingests records into Elasticsearch. Optionally, it can also publish new records to AWS Kinesis. It uses raw records only, and doesn't do any conversion into the Beneficial Ownership Data Standard (BODS) format.

Installation

Install and boot Register.

Configure your environment using the example file:

cp .env.example .env
  • PSC_STREAM: AWS Kinesis stream to which to publish new records (optional)
  • PSC_STREAM_API_KEY: PSC Stream API registration key (optional; only necessary if ingesting via a stream rather than snapshots)

Create the Elasticsearch indexes:

docker compose run ingester-psc create-indexes

Testing

Run the tests:

docker compose run ingester-psc test

Usage

There are now three options:

  • ingest via snapshots by using the helper script
  • ingest via snapshots by running the commands step-by-step
  • ingest via a stream by running the commands step-by-step (not fully functional)

Snapshots using the helper script

To ingest the bulk data from a snapshot (published daily):

docker compose run ingester-psc ingest-bulk

Snapshots step-by-step

Decide on an import ID relating to the data to download, e.g. 2023-10-06. This is then used in subsequent commands.

Discover snapshots by retrieving the list of snapshots:

docker compose run ingester-psc discover-snapshots 2023_10_06

Ingest snapshots by iterating through the list of files uploaded to the designated prefix with the import ID, and ingest them into Elasticsearch:

docker compose run ingester-psc ingest-snapshots 2023_10_06

Stream step-by-step (not fully functional)

Connect to the PSC Stream API, consume any new records, and ingest them into Elasticsearch (PSC_STREAM_API_KEY must be set):

docker compose run ingester-psc ingest-stream

Or to connect to the PSC Stream API using stream position STREAM_POSITION (if valid and not too old):

docker compose run ingester-psc ingest-stream <STREAM_POSITION>

register-ingester-psc's People

Contributors

dependabot[bot] avatar spacesnottabs avatar tiredpixel avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

register-ingester-psc's Issues

Investigate UK PSC ingest issue where people are getting very recent dates as birth dates

While testing queries using the Datasette instance connected to the Register, we've spotted an issue where a number of individuals are being given very recent birth dates so we need to check if there is a mapping error happening here.

Take the example of 'Umar Mukhtar Ismail'. Here is a search on Companies House:
https://find-and-update.company-information.service.gov.uk/search?q=Umar+Mukhtar+Ismail

There seem to be two people with that name and the same address:
https://find-and-update.company-information.service.gov.uk/officers/jdyknQBCOBKkhahF-RQ8lHviQCk/appointments
https://find-and-update.company-information.service.gov.uk/company/14185951/persons-with-significant-control

And it looks like Umar entered his details, put a recent date in the birth date field rather than his date of birth, realised his mistake five days later and then deleted himself/reentered his details: https://register.openownership.org/search?q=Umar+Mukhtar+Ismail

Any way that we can prevent this from coming through in the UK PSC data? Or just need to get them to do better verification checks on dates?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.