alephdata / followthemoney Goto Github PK

View Code? Open in Web Editor NEW

196.0 19.0 46.0 14.46 MB

Data model and processing tools for investigative entity data

Home Page: https://followthemoney.tech

License: MIT License

Python 87.71% Makefile 0.29% Dockerfile 0.33% TypeScript 10.75% JavaScript 0.72% Shell 0.20%

ontology journalism entities

followthemoney's Introduction

Follow the Money

This repository contains a pragmatic data model for the entities most commonly used in investigative reporting: people, companies, assets, payments, court cases, etc.

The purpose of this is not to model reality in an ideal data model, but rather to have a working data structure for researchers.

followthemoney also contains code used to validate and normalize many of the elements of data, and to map tabular data into the model.

Documentation

For a general introduction to followthemoney, check the high-level introduction:

https://followthemoney.tech

Part of this package is a command-line tool that can be used to process and transform data in various ways. You can find a tutorial here:

https://followthemoney.tech/docs/cli/

Besides the introductions, there is also a full reference documentation for the library and the contained ontology:

https://followthemoney.tech/explorer/

There's also a number of viewers for the RDF schema definitions generated from FollowTheMoney, e.g.:

LODE documentation
WebVOWL
RDF/OWL specification in XML.

Development environment

For local development with a virtualenv:

python3 -mvenv .env
source .env/bin/activate
pip install -e ".[dev]"

Now you can run the tests with

make test

Releasing

We release a lot of version of followthemoney because even small changes to the code base require a pypi release to begin being used in aleph. To this end, here's the steps for making a release:

git pull --rebase
make build
make test
git add . && git commit -m "Updating translation files"
bumpversion patch
git push --atomic origin main $(git describe --tags --abbrev=0)

This will create a new patch release and upload a distribution of it. If the changes are more significant, you can run bumpversion with the minor or major arguments.

When the schema is updated, please update the docs, ideally including the diagrams. For the RDF namespace and JavaScript version of the model, run make generate.

followthemoney's People

Contributors

Stargazers

Watchers

followthemoney's Issues

Introduce Namespaces for IDs

IDs in Aleph are essentially natural keys: in most cases (e.g. when generated by a mapping) they are of the form: sha1(collection_foreign_id, *identifying_criteria). This essentially puts all entities in a global namespace - which is OK as long as the IDs are managed by Aleph.

Now that we have the bulk loading API and ftm command-line tool, it's becoming possible to think about an attack vector in this: a user could generate entities with IDs that already exist in other collections in the system and then bulkload them to a collection they have access to. This would then overwrite the entities in the other collection (unless an expensive merge operation is performed).

I've been thinking about this in terms of how we can abort when such an escalation occurs, but a I think it might actually be nicer to just engineer around it entirely by introducing a notion of entity namespaces. The idea here is to introduce some semantics to entity IDs which would make it clear what dataset they are part of. Possible forms:

namespace.entity_id (where namespace is a foreign ID in Aleph).
sha1(namespace).entity_id
entity_id.sha1(namespace, entity_id)
entity_id.sha1(namespace, entity_id)[:8] - i.e. include only the first N characters of the hash

The benefit of this scheme is that it could be re-applied server-side inside the bulk API, thus making key collisions cryptographically hard to engineer.

TaxRoll Schema appears to have gotten some breaking changes

Usage of it only throws this when creating a new object via the API:
{"status": "error", "errors": "Unknown property (<Schema('TaxRoll')>): name"}

For some reason, it does not end up in the API console.

Will work-around it for now by declaring a Thing, might look into this issue later if nobody beats me too it.

Add new entity type Debt.

Intervals we have currently doesn't fit debtor/creditor relationship.

I'm attaching proposed schema

Debt:
  label: "Debt"
  plural: "Debts"
  description: "A monetary debt between two parties."
  extends: 
    - Interval
    - Value
  matchable: false
  featured:
    - debtor
    - creditor
    - date
    - amount
  edge:
    source: debtor
    target: creditor
  properties:
    debtor:
      label: "Debtor"
      reverse:
        name: debtDebtor
        label: "Debts"
      type: entity
      range: LegalEntity
    creditor:
      label: "creditor"
      reverse:
        name: debtCreditor
        label: "Credits"
      type: entity
      range: LegalEntity

Add crypto wallet as property

Implement entity link constraints

When one entity references another, there should be a check to make sure the linked entity is of some schema. Example: the owner of a company should be a legal entity. Nobody should be able to own things like People that are not Assets.

Create links data ontology

What is the hierarchy here? What are the specific attributes for each level?

Enable complex query filters

Riffing off of #366, we've been discussing a more advanced filter mechanism in entity mappings both for queries and for schema assignment. I'm sketching this out here to get some feedback.

The basic idea is to make it possible to compile nested complex queries. This might have to be incompatible to the existing filter, filter_not rules in mappings, so we would be introducing a new query section (which filter and filter_not would be transparently re-written to). The query syntax itself could riff off of the MongoDB syntax, which manages to encapsulate compound queries in JSON.

The second purpose of this enhancement would be to make schema conditional on column contents. For example, we often see tables where companies and people are mixed, and whether a row describes a person or company is determined by a value like individual/entity in one column.

Here's a sketch of how we could address both issues:

query:
  table: zz_donors
  query:
    $and:
      country: xk
      $or:
        political_party:
          $like: "%conservative%"
        district: ["Southern", "Northern"]
  entities:
    donor:
      schema_query:
        Person:
          donor_type: "individual"
        Company:
          donor_type: ["corporate", "business"]

Especially keen for feedback from @uhhhuh @brrttwrks :)

Add descriptions for all schema and properties

FtM.js

move FtM.js functionality to this repo,
https://github.com/alephdata/vis2.js/tree/master/src/followthemoney

Add `processingStatus` and `processingError` properties to document entity schema

We'll use them to store document ingestion status till we come up with a better solution.

Define reverse properties throughout the model

We need to go through all of the cases in which an entity points to another and define a name for the inverse of the relationship.

cf. alephdata/aleph#419 (comment)

Generic MISP to Aleph exporter

This is a long shot question: I want to export MISP data to aleph. We have any possible type of information in MISP, and the generic representation is the following:

Event
- contains standalone attributes and objects
- has metadata (tags, date, comment)
Object
- contains attributes
- Has a comment, will have tags
Attribute
- value of a certain type
- tags

In a lucky case, the object(s) are the 1:1 mapping to FTM schema, so we can directly export them.

In all the other cases, we have anything possible, from a file object representing a malware sample to the description of an attack against the ss7 infrastructure. Mapping each individual MISP Object to a dedicated FTM schema would be an awful mess, and anyone needing that granularity should just use MISP.

The use case I have in mind is using aleph as a organized container for data coming from not very structured sources, but have correlation with the structured information I have in a MISP instance.

Would it make sense to just consider each individual objects as documents, with the event itself as a parent?

Add literal component to key.

We currently have a few conditions where the fields in the source data are not sufficient to generate a unique ID for certain entities (especially those which define a relationship between two other entities, e.g. a company and it's director). We should therefore add a literal extra component to the key like this:

ownership:
  schema: Ownership
  keys:
    - director_name
    - company_name
  key_literal:
    - "Ownership"
directorship:
  schema: Directorship
  keys:
    - director_name
    - company_name
  key_literal:
    - "Directorship"

Filter Example mentioned in the Docs does not work?

after discussion here alephdata/aleph#1764 reopening it in the correct place:

It seems Filtering source data as described in the docs do not work? See @ https://docs.alephdata.org/developers/mappings

Aleph generated all 653 entities present in the CSV instead of the filtered (conservative & not male) 67 entities.

I am grateful for any hints, maybe i am doing something wrong? CSV and YML are according to the docs.
I want to use mixed entities in a CSV and only load specific rows ... i suppose the filter would let me do that, no?
acc. to @pudo this might be relevant (Ref: https://github.com/alephdata/followthemoney/blob/master/followthemoney/mapping/source.py#L22 )

To reproduce:

I am using the UI and Version 3.10.0 (docker)

I uploaded the CSV and before generating the entities i loaded filters.yml (see below)

source data

http://bit.ly/uk-mps-csv

filters.yml

gb_parliament_57:
  queries:
  - csv_url: http://bit.ly/uk-mps-csv
    filters:
      group: "Conservative"
    filters_not:
      gender: "male" 
    entities:
      member:
        schema: Person
        keys:
        - id
        properties:
          name:
            column: name

Add label property to all edge schemas

In addition to "source" and "target" the edge key in relationship schemas should have a label property

Validation for the properties

Add validation on the list of featured properties in entities' definitions

OpenOwnership enricher

A process for enriching a followthemoney entity and its relationships by looking it up in the OpenOwnership Register. From their website:

Freely licensed data, in bulk

We release all of our data as an open ledger, formatted using the Beneficial Ownership Data Standard (BODS).

This data is updated monthly, and made available as a free download under the Open Data Commons Attribution License.

Currently containing nearly 20 million BODS statements, in a 10GB JSONLines file.

Navigating the maze of corporate shell companies that own one another is a big challenge for investigative reporting and money laundering investigations. Beneficial ownership data seems to be especially useful for following the money on a project like aleph.

Proposal: declarative mapping to convert complex nested documents into the FTM entities

Foreword

Aleph already has some tools (including nice UI) to map flat files (like CSV/XLSX) into the FTM entities. Those mappings are yaml based and allow you to version control them and can be read and used by non-programmers. Great success.

Problem

Some input files are not born tables and cannot be converted to tables (without blowing up the data). For example, we have an asset declaration, where person declares his real estate, cars, incomes, bank accounts. That creates bunch of problems:

Each section has it's own rules, fields and generally produces different subset of entities.
There might be more than hundred of records in each section.
Some records in some sections has even more levels of nesting as well as back-referencing. For example, rights on the asset. You have a plot of the land, which you co-own with your relatives and third-parties. So, each record of such kind will yield not only record on the asset (for example, Real Estate) but also N records of Company or Person type + intervals to connect them to the Asset.
The entities that are generated can be of a different type. Back to the example above: some real estate can be co-owned/co-used by 2 persons and 1 company.

Usual solution for this kind of the data sources is to write some python code.

Proposal

It would be nice to have an extension (or a separate project/product/tool) to map such data sources into entities.
Here are some principles:

Declarative
YAML based
Somewhat compatible with existing mappings (so it can read them too)
Probably JMESPath based, where you can describe a jmespath to extract the section and jmespathes to extract the content of the section
Probably possessing it's own pseudo-language for expressions or a way to call predefined macro/user python function to deal with the objects of a different nature in the same list. For example, if this condition on some flag or field or combination of fields is met, we yield Person.

No idea what to do with backreferencing (for example, when you have one section describing relatives and another section, where data refers to those relatives using their internal id)

@pudo, what do you think.

JS: Icons integrating

@pudo:
So basically we have the following problems:
a) We want to show icons for entities in aleph-ui
b) Ideally, we want to use Palantir’s icon rendering mechanism for that, because it would help us to make the code simpler. At the moment we’re using the react-svg mechanism, so there are always two types of icons in aleph-ui and they follow different layout routes
c) We want to show icons in vis2
In VIS2, I think the relationship between an entity and it’s icon is going to be more flexible than in Aleph
For example, a user might want to create a Person and then choose whether to show a male of female icon
Or they might create a Company, and want to pick between a Factory building, an office building or a little offshore palm icon
d) The icons in Aleph and in VIS2 are not going to be the same size
So we might have a 16px optimised icon in aleph-ui, and a 64px optimised icon in VIS2
e) In vis2 we need to embed the icons into an SVG graphic

Thing's name properties are just strings

It seems to me that a Thing's name property should be of name type and not string type. This way thing's names would be matchable by the various comparison operations.

Thoughts?

Ensure that we have unique labels for ftm properties

Topics

We've been discussing for a while to add topics as a controlled vocabulary to FtM. The idea is that while we have schema (Person, Company), these are very neutral and often don't capture the investigative semantics of an entity. In that sense, topics would be like adjectives: grammatical qualifiers on the knowledge graph.

This draft list is based on the categorisation systems from a few leaked due diligence lists, with personal flourishes:

CRIME
    FRAUD
    CYBER
    FINANCIAL
    THEFT
    WAR
    TERROR
    TRAFFICKING
        DRUGS
        HUMANS
        SMALL THINGS MADE FROM STRAW
CORPORATE
    OFFSHORE
GOVERNMENT
    NATIONAL
    STATE
    MUNICIPAL
    SOE (STATE OWNED)
    PARTY
FINANCIAL SERVICES
    BANK
    FUND
ROLES
    POLITICIAN
    ASSOCIATE
    JUDGE
    DIPLOMAT
    LAWYER
    SPY
    JOURNALIST
    ACTIVIST
UNION
MILITARY
RELIGION
SANCTIONING
    FROZEN
    BANNED
    INVESTIGATED

Crew Lists for Vessels (Multiple Names)

Would it be possible for you to add a property field for Crew on the Vessel Entity?

I'm working with finding some people in historical Crew Lists and it would have been great if there was a Field where I in an easy way could add Names of all the Crew Members (and Occupations and Age if possible) instead of have to create one Entity for each and every one (can be 20-50 for each ship).

And if those Names was automatically linked to a Person Entity (Node) if it existed in the Graph, same way as Places and Dates are...

Licenses and Permits

FtM has a License schema, which is used to model extractive mining rights in some cases. However, the more use case for the term License is for a company or individual to have a permit to engage in some form of regulated activity: banking, selling alcohol, driving a taxi, even driving more generally.

In some ways, the extractives thing is just a particularly weird example of this. But the initial case of mining has lead us to design a schema that's not well suited to model the more normal cases. I'd like to propose a change to the License schema to make it more re-usable. It could be:

an Interval
linked to a LegalEntity, it's holder
allow for a type, a resource and a licenseNo

I don't think there's a lot of licenses actually in use in Aleph right now. If that's not true, maybe we can introduce a new Permit schema instead and mark License as deprecated.

Wikidata enricher

A process for enriching a followthemoney entity by looking it up in Wikidata, then mapping the entities properties from Wikidata RDF to FtM properties, and mapping Wikidata relationships to FtM entities.

Neo4J export overrides name property of entities

https://github.com/alephdata/followthemoney/blob/master/followthemoney/export/graph.py#L156

Fix is quite easy but I don't understand if it's bug or feature.

Adding a polyline property to Trip or other entity types.

As we are considering modeling a Trip in FtM, the thought occurred to me that, like an Address has a lat/lon, a Trip could/should have a polyline geometry to record not just start and end destinations, but the actual trajectory. I can imagine a number of cases where modeling this might be advantageous. One potential is to document a trip's trajectory - a flight, a vessels path from port to port, the path of a person (timeline?). Another might be to support more interesting UI components if we have map views in Aleph.

I am throwing this out there, but I am sure I am overlooking a number of things that make this more complicated than I am imagining it :).

[Memorious] Error: "followthemoney.exc.InvalidModel: Invalid type: gender" when setting the FTM_MODEL_PATH env var

Context:
I'm trying (hard) to extend the FtM schema with a new entity.

I'm running Memorious with FTM_MODEL_PATH set to a specific directory for override.
The directory contains a copy of FtM schema directory.
The directory is well mounted and YAML files are accessible from within the Docker container

dockercompose.yml

volumes:
      - "./build/data:/data"
      - "./config:/crawlers/config"
      - "./src:/crawlers/src"
      - "./ftm-schema:/data/ftm-schema"
 env_file:
      - memorious.env

memorious.env

FTM_MODEL_PATH=/data/ftm-schema

The error happens at loading time, whatever the scraper being run.

root@5cc7cbcfeefc:/memorious# memorious run malta_tenders_scraper
2022-07-12 14:42:53.209217 [info     ] [malta_tenders_scraper->init(seed)]: 09287abea50e4928982f9ee4896380fc [malta_tenders_scraper.init]
2022-07-12 14:42:54.136851 [error    ] Invalid type: gender           [malta_tenders_scraper.init]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/memorious/logic/context.py", line 93, in execute
    return self.stage.method(self, data)
  File "/usr/local/lib/python3.8/dist-packages/memorious/logic/stage.py", line 23, in method
    func = get_entry_point("memorious.operations", self.method_name)
  File "/usr/local/lib/python3.8/dist-packages/servicelayer/extensions.py", line 21, in get_entry_point
    return get_entry_points(section).get(name)
  File "/usr/local/lib/python3.8/dist-packages/servicelayer/extensions.py", line 16, in get_entry_points
    EXTENSIONS[section][ep.name] = ep.load()
  File "/usr/local/lib/python3.8/dist-packages/pkg_resources/__init__.py", line 2458, in load
    return self.resolve()
  File "/usr/local/lib/python3.8/dist-packages/pkg_resources/__init__.py", line 2464, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python3.8/dist-packages/memorious/operations/ftm.py", line 2, in <module>
    from ftmstore import get_dataset as get_ftmstore_dataset
  File "/usr/local/lib/python3.8/dist-packages/ftmstore/__init__.py", line 4, in <module>
    from ftmstore.store import Store
  File "/usr/local/lib/python3.8/dist-packages/ftmstore/store.py", line 6, in <module>
    from ftmstore.dataset import Dataset
  File "/usr/local/lib/python3.8/dist-packages/ftmstore/dataset.py", line 5, in <module>
    from followthemoney import model
  File "/usr/local/lib/python3.8/dist-packages/followthemoney/__init__.py", line 14, in <module>
    model = Model(model_path)
  File "/usr/local/lib/python3.8/dist-packages/followthemoney/model.py", line 37, in __init__
    self._load(os.path.join(path, filename))
  File "/usr/local/lib/python3.8/dist-packages/followthemoney/model.py", line 61, in _load
    self.schemata[name] = Schema(self, name, config)
  File "/usr/local/lib/python3.8/dist-packages/followthemoney/schema.py", line 186, in __init__
    self.properties[name] = Property(self, name, prop)
  File "/usr/local/lib/python3.8/dist-packages/followthemoney/property.py", line 94, in __init__
    raise InvalidModel("Invalid type: %s" % type_)
followthemoney.exc.InvalidModel: Invalid type: gender

Exporter for CSV and Excel

For a given stream of entities, let's implement an abstraction to:

Serialise each entity into a table entry, with the table named after the entity schema.
Merge multi-valued properties in a semantically useful way. Default could just be '; '.join(values)
Sort the output columns so that featured properties appear first, alphabetically after that.
Also have convenience functions to write to CSV and OpenPyxl.

Also, let's implement a CLI version of this in util/followthemoney_util.

Refactor type validation/normalization

Every type should have only three methods: validate, clean or normalize.

[Memorious][Aleph] Custom entities emitted by Memorious scraper to Aleph are not visible

I added a simple CallForTender entity in the FtM schema,
added the extended schema directory to aleph docker containers
set the FTM_MODEL_PATH variable
I did make upgrade aleph
started aleph
ran the memorious scraper that emits CallForTender entities

The logs look clean but the dataset appears empty in the web UI (0 entities in created Dataset)

What aleph worker logs say:

2022-07-13 19:55:15.035389 [info     ] [malta_call_for_tenders_scraper] Update entity: 436398349d0f0020cab915393e04cc7f.a81dabaf7384e36f4fe344f78ec9f394d9b1a248 [aleph.logic.entities]
2022-07-13 19:55:15.168270 [debug    ] Pipeline entity [436398349d0f0020cab915393e04cc7f.a81dabaf7384e36f4fe344f78ec9f394d9b1a248]: Call for tender [aleph.queues]
2022-07-13 19:55:15.169629 [info     ] Task [2]: updateentity (done)  [aleph.worker]
2022-07-13 19:55:15.178224 [info     ] Task [2]: index (started)      [aleph.worker]
2022-07-13 19:55:15.184360 [debug    ] [malta_call_for_tenders_scraper] Indexed 1 entities [aleph.logic.collections]
2022-07-13 19:55:15.191560 [info     ] Task [2]: index (done)         [aleph.worker]
2022-07-13 19:55:25.242442 [info     ] Task [2]: updateentity (started) [aleph.worker]
(edited)
[10:12](https://alephdata.slack.com/archives/CE111PS2Y/p1657743176470059)

What aleph api logs say:

aleph-api-1  | 172.27.0.8 - - [13/Jul/2022 19:55:12] "POST /api/2/entities/436398349d0f0020cab915393e04cc7f HTTP/1.1" 200 -
aleph-api-1  | 2022-07-13 19:55:12.148271 [info     ] 172.27.0.8 - - [13/Jul/2022 19:55:12] "POST /api/2/entities/436398349d0f0020cab915393e04cc7f HTTP/1.1" 200 - [werkzeug]
aleph-api-1  | 2022-07-13 19:55:17.085889 [info     ] Request handled                [aleph.views.context] request_logging=True

Split values by string into a list in a mapping

The opposite of join

TypeError: 'TraversibleType' object is not subscriptable

Hi! When running any ftm command I get:

(base) ~ ♥ ftm --help
Traceback (most recent call last):
  File "/home/zufank/anaconda3/bin/ftm", line 5, in <module>
    from followthemoney.cli.cli import cli
  File "/home/zufank/anaconda3/lib/python3.9/site-packages/followthemoney/__init__.py", line 3, in <module>
    from followthemoney.model import Model
  File "/home/zufank/anaconda3/lib/python3.9/site-packages/followthemoney/model.py", line 9, in <module>
    from followthemoney.mapping import QueryMapping
  File "/home/zufank/anaconda3/lib/python3.9/site-packages/followthemoney/mapping/__init__.py", line 1, in <module>
    from followthemoney.mapping.query import QueryMapping
  File "/home/zufank/anaconda3/lib/python3.9/site-packages/followthemoney/mapping/query.py", line 6, in <module>
    from followthemoney.mapping.sql import SQLSource
  File "/home/zufank/anaconda3/lib/python3.9/site-packages/followthemoney/mapping/sql.py", line 46, in <module>
    class SQLSource(Source):
  File "/home/zufank/anaconda3/lib/python3.9/site-packages/followthemoney/mapping/sql.py", line 66, in SQLSource
    def get_column(self, ref: Optional[str]) -> Label[Any]:
TypeError: 'TraversibleType' object is not subscriptable

I'm on Fedora 35
python 3.9
pip 22.0.4

Here are all the packages versions:

(base) ~ ♥ python -m pip install --upgrade followthemoney
Requirement already satisfied: followthemoney in ./anaconda3/lib/python3.9/site-packages (2.9.1)
Requirement already satisfied: babel<3.0.0,>=2.9.1 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (2.9.1)
Requirement already satisfied: python-levenshtein<1.0.0,>=0.12.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (0.12.2)
Requirement already satisfied: openpyxl<4.0.0,>=3.0.5 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (3.0.9)
Requirement already satisfied: pyyaml<7.0.0,>=5.0.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (6.0)
Requirement already satisfied: fuzzywuzzy[speedup]<1.0.0,>=0.18.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (0.18.0)
Requirement already satisfied: types-PyYAML in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (6.0.7)
Requirement already satisfied: fingerprints<2.0.0,>=1.0.1 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (1.0.3)
Requirement already satisfied: normality<3.0.0,>=2.1.1 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (2.2.5)
Requirement already satisfied: prefixdate<1.0.0,>=0.4.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (0.4.1)
Requirement already satisfied: sqlalchemy2-stubs in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (0.0.2a22)
Requirement already satisfied: sqlalchemy<2.0.0,>=1.2.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (1.4.22)
Requirement already satisfied: requests<3.0.0,>=2.21.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (2.26.0)
Requirement already satisfied: phonenumbers<9.0.0,>=8.12.22 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (8.12.48)
Requirement already satisfied: click<9.0.0,>=7.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (7.1.2)
Requirement already satisfied: pantomime<1.0.0,>=0.5.1 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (0.5.1)
Requirement already satisfied: pytz>=2021.1 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (2021.3)
Requirement already satisfied: networkx<2.9,>=2.5 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (2.6.3)
Requirement already satisfied: stringcase<2.0.0,>=1.2.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (1.2.0)
Requirement already satisfied: languagecodes<2.0.0,>=1.1.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (1.1.1)
Requirement already satisfied: rdflib<6.2.0,>=6.1.0 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (6.1.1)
Requirement already satisfied: countrynames<2.0.0,>=1.9.1 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (1.11.1)
Requirement already satisfied: python-stdnum<2.0.0,>=1.16 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (1.17)
Requirement already satisfied: banal<1.1.0,>=1.0.1 in ./anaconda3/lib/python3.9/site-packages (from followthemoney) (1.0.6)
Requirement already satisfied: text-unidecode in ./anaconda3/lib/python3.9/site-packages (from normality<3.0.0,>=2.1.1->followthemoney) (1.3)
Requirement already satisfied: chardet in ./anaconda3/lib/python3.9/site-packages (from normality<3.0.0,>=2.1.1->followthemoney) (4.0.0)
Requirement already satisfied: et-xmlfile in ./anaconda3/lib/python3.9/site-packages (from openpyxl<4.0.0,>=3.0.5->followthemoney) (1.1.0)
Requirement already satisfied: setuptools in ./anaconda3/lib/python3.9/site-packages (from python-levenshtein<1.0.0,>=0.12.0->followthemoney) (58.0.4)
Requirement already satisfied: pyparsing in ./anaconda3/lib/python3.9/site-packages (from rdflib<6.2.0,>=6.1.0->followthemoney) (3.0.4)
Requirement already satisfied: isodate in ./anaconda3/lib/python3.9/site-packages (from rdflib<6.2.0,>=6.1.0->followthemoney) (0.6.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./anaconda3/lib/python3.9/site-packages (from requests<3.0.0,>=2.21.0->followthemoney) (1.26.7)
Requirement already satisfied: certifi>=2017.4.17 in ./anaconda3/lib/python3.9/site-packages (from requests<3.0.0,>=2.21.0->followthemoney) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in ./anaconda3/lib/python3.9/site-packages (from requests<3.0.0,>=2.21.0->followthemoney) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in ./anaconda3/lib/python3.9/site-packages (from requests<3.0.0,>=2.21.0->followthemoney) (3.2)
Requirement already satisfied: greenlet!=0.4.17 in ./anaconda3/lib/python3.9/site-packages (from sqlalchemy<2.0.0,>=1.2.0->followthemoney) (1.1.1)
Requirement already satisfied: typing-extensions>=3.7.4 in ./anaconda3/lib/python3.9/site-packages (from sqlalchemy2-stubs->followthemoney) (3.10.0.2)
Requirement already satisfied: six in ./anaconda3/lib/python3.9/site-packages (from isodate->rdflib<6.2.0,>=6.1.0->followthemoney) (1.16.0)

Is there something I can do?

Make Passport more general

We have a Passport schema, which is relatively specific but in its description states "this can be some other type of identification". I'd like to make a separate Identification schema from which Passport inherits and which can be used for many other types of ID (e.g. ID cards, voter registration, etc.)

Support arrays for filters

So we can filter a field by a list of possible things.

Proposed schema: Role/Post/Position

This one is an odd one, so I'm pitching it here before implementing it in a PR. Basically, a lot of data sources I see at OpenSanctions contain the notion of someone holding an office (e.g. President of country X, member of parliament Y). Now, in theory a lot of this could be expressed as Membership, Employment or Directorship, but there's some issues:

a) You don't always know which one of these it is from the data (e.g. is Biden an employee of the US, or a Member of it?)
b) It ends up leading to the creation of dumb stub entities, like "Republic of X", or "Commission to do Z" that exist only to express someone holding a position.

So I'd like to propose making a new model for Post, in which the "far side" (Organization) is optional and can also be a string. Otherwise it'd be a pretty normal Interval linking to a Person. It's duplicative with the schemata mentioned above, but just allows expressing a different category of data.

What do people think?

Add Boolean Functionality To Filter Operation (in yaml schema definition)

I've been working with the Aleph Schema on yaml files with ftm(cli) in the past month and I think it's fit if you add boolean functionality (if not present) to the Filter operation and allow it to span mutliple columns.

Currently, I've seen that among the supported operations are:

filter and filter_not

And apparently, they work for single columns. Listing multiple columns has no effect on all columns listed after the first one (correct me if I'm wrong).

The proposed functionality will have something like this:

  filter_and:
    columnA: "FilterClauseA"
    columnB: "FilterClauseB"

This filters each row entry based on the stated criteria from columnA, columnB, columnC,......, columnN. Returned rows must meet all stated criteria per column. Just like an AND operation.

  filter_or:
    columnA: "FilterClauseA"
    columnB: "FilterClauseB"

This filters each row entry based on the stated criteria from columnA, columnB, columnC,......, columnN. Returned rows should meet at least one stated criteria per column. Just like an OR operation.

IMO, the schema definition on yaml files (like a DSL) is one of the interesting features in FtM/Aleph. I don't know if the proposed features exist, if not, it seems fit for them to be added to improve the DSL's flexibility and reduce dependency on writing extra code.

Transaction is vague

Improve how Transactions are modelled. Figure out better terms than to and from for the involved parties. Maybe flatten it altogether. Discuss?

Transaction kind of things include:

Payments
Property
Debts/mortgages
Economic activity
Human communication?
...

Map csv as a memorious operation

Whenever we download a bulky csv file and want to run a mapping against it.

Implement normalisations

Name normalisation (by counting the Levenstein distance);
Birth dates;
Max on retreivedAt/last-seen.

URLs miss the fragment after '#'

urlnormalizer.normalize_url returns urls with the part that follows "#" dropped.
"https://a.b.c/index.jsp#/ROC/companyDetailsRO.do?action=companyDetails&companyId=123" becomes "https://a.b.c/index.jsp"

Introduce Address schema

I think it's time we introduced an Address object with a link from Thing. What fields do we need?

co
street
apartment
city, town, settlement
postalCode
region
country
text (for when it's just a long string).

How can we then have a special function that keys it into a single string?

All the automation

Make Passport entity type?

See alephdata/aleph#213 - not sure this is ideal because it means passport IDs aren't part of the core entity denormalisations.

Proposal: namespaces/external entity definitions.

Foreword

Basic FTM ontology is nice (despite carrying some legacy fields) but not always does it fit the task. Great success

In the case you want to extend the basic ontology you have following options:

Rewrite it from scratch according to your needs
Alter key entities by adding new fields, maybe with prefix
Create your own descendants to existing ontology to give your version some regional or task-specific flavor.

First two methods doesn't really fit some tasks, as long as you probably want to have the backward compatibility to reuse the data that aleph, opensanctions and other sources provide you for free.

Problem

In case when you are extending the ontology you have to have the way to distinguish original entities from the one you've added. For example, to be able to export in the original format by finding each parent that came from original ontology. Also you need a clear cue to show what was added to it.

Proposed solution

This can be solved with a new base field, for example, a namespace, where all the original entity types will have namespace=base or something. In this case the developer will be able not only to extend the original ontology but also to re-use ontologies published by others (again, with regional/task-specific flavor, maybe committed to contrib part of this repo).

@pudo also proposed to extend this and allow to specify the public url of the entity type. In this case Aleph will be able to deal with the data that was mapped to an extended ontology, loading/caching/maintaining the list of external definitions, as long as they are derived from unaltered original ontology.

"Error: No schema for entity." while loading from opensanctions

I am getting an "Error: No schema for entity" while trying to load eu_fsf and us_ofac_sdn from opensanctions into an aleph instance.

This is the versions of python and the relevant packages I am using:

# python3 --version
Python 3.8.10

# pip show followthemoney
Name: followthemoney
Version: 2.7.3

# pip show alephclient
Name: alephclient
Version: 2.3.5

Here is what I am calling and the output I am getting:

# curl -s https://data.opensanctions.org/datasets/latest/eu_fsf/entities.ftm.json | /usr/local/bin/alephclient write-entities -f eu_fsf;
INFO:alephclient.cli:[eu_fsf] Bulk load entities: 1000...
Error: No schema for entity.

Also:

# curl -s https://data.opensanctions.org/datasets/latest/us_ofac_sdn/entities.ftm.json | /usr/local/bin/alephclient write-entities -f us_ofac_sdn;
INFO:alephclient.cli:[us_ofac_sdn] Bulk load entities: 1000...
INFO:alephclient.cli:[us_ofac_sdn] Bulk load entities: 2000...
INFO:alephclient.cli:[us_ofac_sdn] Bulk load entities: 3000...
INFO:alephclient.cli:[us_ofac_sdn] Bulk load entities: 4000...
INFO:alephclient.cli:[us_ofac_sdn] Bulk load entities: 5000...
Error: No schema for entity.

It seems that it comes from here: https://github.com/alephdata/followthemoney/blob/master/followthemoney/proxy.py#L61

Alignment of BvD & OC returns, where available (thinking complementary not overwriting)
Alignment with FTM schema, where not already in place (q for Pudo : is current corpint aligned?.. mostly?)
Model selection/filtering of primary-company api returns
Incorporation of corporate group api calls where available
Batch run scheduling with prioritization option for research requests (scheduled for off hours in the case of BvD) (work with Sunu to set this up?)

Add FtM schema for Calls, Messages, UserAccounts and Locations

As part of alephdata/aleph#1028 and going forward, we need these schema to model data into FtM land.

Here's the rough outline we discussed:

UserAccount <- Thing
name
email
number
service
password
owner [LegalEntity]
---

Call <- Interval
caller [LegalEntity]
callerNumber
receiver [LegalEntity] -- 1...n
receiverNumber
duration
(( recording [Audio] ))
---

Communication <- Interval
---

Message <- Communication
(i.e. SMS, InstantMessage)
sender [LegalEntity]
senderAccount [UserAccount]
receiver [LegalEntity] -- 1...n
receiverAccount [UserAccount]
Content
Attachment
Timestamp [date]
(( conversationID ))
(( sequenceNumber ))
inReplyTo [Message]
deleted
---

Location
(i.e. Journey, Location, BluetootDevice?, WirelessNetwork?)

alephdata / followthemoney Goto Github PK

followthemoney's Introduction

Follow the Money

Documentation

Development environment

Releasing

followthemoney's People

Contributors

Stargazers

Watchers

Forkers

followthemoney's Issues

Freely licensed data, in bulk

Foreword

Problem

Proposal

Foreword

Problem

Proposed solution

Recommend Projects

Recommend Topics

Recommend Org