alephdata / ingest-file Goto Github PK

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.

License: GNU Affero General Public License v3.0

Makefile 0.58% Python 96.80% Dockerfile 2.62%

documents excel forensics forensics-investigations metadata-extraction ocr email-forensics document-extraction

ingest-file's Introduction

ingestors

ingestors extract useful information from documents of different types in a structured standard format. It retains folder structures across directories, compressed archives and emails. The extracted data is formatted as Follow the Money (FtM) entities, ready for import into Aleph, or processing as an object graph.

Supported file types:

Plain text
Images
Web pages, XML documents
PDF files
Emails (Outlook, plain text)
Archive files (ZIP, Rar, etc.)

Other features:

Extendable and composable using classes and mixins.
Generates FollowTheMoney objects to a database as result objects.
Lightweight worker-style support for logging, failures and callbacks.
Throughly tested.

Development environment

For local development with a virtualenv:

python3 -mvenv .env
source .env/bin/activate
pip install -r requirements.txt

Release procedure

git pull --rebase
make build
make test
source .env/bin/activate
bump2version {patch,minor,major} # pick the appropriate one
git push --atomic origin $(git branch --show-current) $(git describe --tags --abbrev=0)

Usage

Ingestors are usually called in the context of Aleph. In order to run them stand-alone, you can use the supplied docker compose environment. To enter a working container, run:

make build
make shell

Inside the shell, you will find the ingestors command-line tool. During development, it is convenient to call its debug mode using files present in the user's home directory, which is mounted at /host:

ingestors debug /host/Documents/sample.xlsx

License

As of release version 3.18.4 ingest-file is licensed under the AGPLv3 or later license. Previous versions were released under the MIT license.

ingest-file's People

Contributors

Stargazers

Watchers

ingest-file's Issues

Tabular data result format

While working on the #3 I ended up reviewing the way the tabular data is handled.

Currently, Aleph schema is replicating the exact format a tabular document requires in order to be displayed. To be precise, an ingested tabular document will result in a bunch of document_record records. Any other document is treated as a document_page.

In order to standardise the ingestors result, I suggest the following changes:

the tabular ingestor result will return every row as it would be a page
the sheet name and other details will become part of the metadata, where the headers and cell values will end up in a mapping (similar to the way it is stored in Aleph right now)
the cell type will be cast to the cell value
the final result can be stored within the same DB schema as any other document format

Let's take an example of the tabular file bellow with two sheets:
(1st sheet named: With Name)

name, timestamp, price
Mihai Viteazu, 29/12/1923 22:00, 99.11

(2st sheet named: With Title)

title, timestamp, price
Vlad Țepeș, 20/01/1857 16:59, 111

Will end up as the following ingestor:

ingestor(file_name='my.xls').pages[0].result.sheet -> 'With Name'
ingestor(file_name='my.xls').pages[0].result.content -> {"name": "Mihai Viteazu", "timestamp": "ISO 8601 formatted 29/12/1923 22:00", "price": 99.11}

ingestor(file_name='my.xls').pages[1].result.sheet -> 'With Title'
ingestor(file_name='my.xls').pages[1].result.content -> {"title": "Vlad Țepeș", "timestamp": "ISO 8601 formatted 20/01/1857 16:59", "price": 111}

One thing to note, this will be leveraging fully the JSON data types, which means that dates/times will have to be considered text. This eventually can be solved by the Dalet project, but I don't think we should be concerned about this limitation (please correct me if I'm wrong).

This approach, in my opinion, will simplify and improve the parsing and display of the large tabular files. And eventually, will allow us to simplify the Aleph documents UI.

In order to preserve the Aleph backwards compatibility, ingestors can provide all required details for that, but I don't think it should replicate the exact same format.

/cc @pudo @mgax @smmbllsm

Replace nosetests to adapt to Python 3.10

nosetests is dead, and probably the better fixture handling in pytest will be an overall gain for the ingestors.

Update PDF parsing strategy

What's broken?

We're seeing incorrect text extraction out of some documents, especially those containing Arabic text.
Text from images isn't being extracted into the right location in the remaining text
We have to maintain our own PDF binary bindings
We are extracting images in the documents to files first, then running OCR. Don't really need to put them on disk.

What are our options?

Continue with pdflib
Try out pdfreader - https://pdfreader.readthedocs.io/en/latest/tutorial.html#how-to-start
Try out pdfminer.six

Make ingestors emit FtM entities

We should start a feature branch to make the ingestors emit data in the format defined by followthemoney. This may require a set of changes:

followthemoney needs new schemata and properties, including: Record, and its children Row, Page.
Generate natural IDs for the FtM entities emitted so they are stable across loads. The IDs should include: a. the dataset name, b. the foreign_id or content_hash of the data, c. an index within the document (for Page and Row).
Make email ingestors generate valid entity links for sender, recipient, cc and bcc. Also make them link to other messages via thread IDs. This is going to be complicated, might require maintaining a list of mappings in memory.
Switch all of ingestors data normalisers to use FtM ingestors. We may want to introduce extra property types in FtM, e.g. mimetype.

Should OCR be a microservice injected into the ingestors

Pro:

Allows better allocation across cluster
Makes it easier to migrate to Tesseract 4 (and perhaps even use 3 and 4 at the same time)
Can be swapped out for Google Vision API or similar

Con:

Microservices suck

Consider doing PST support inline?

There's a C library with Python bindings, libpff which we could use to extract messages from Outlook PST files. This may be significantly faster and require less disk space than the current readpst method.

But looking at the library, the message data model seems very limited, e.g. it doesn't seem to expose the recipients of a message. Need to find out if that can be gotten, and if this is stable.

How To Fix sqlalchemy.exc.TimeoutError

Hello
Thank you for your contribution.
I am using the latest version(3.16.1) of docker image of this project and facing the following error when I ingest the file.

Could you kindly guide me how I can fix this?
I am using docker image and it used to be restarted cause of this error.
Thank you

Test ingest-breaking files collection

cf. pcbje/gransk#9

Material is here: https://gitlab.com/dzmitry-lahoda/ediscovery-files/tree/master/assets

We're already using a good number of their fixtures, but we need more edge cases.

Generate events from iCal files

Add support for iCal calendar invites and turn them into FtM events.

Consider switching to a more reliable `chardet` alternative

... like https://github.com/Ousret/charset_normalizer

Lingua as an alternative language detection library to fasttext

https://github.com/pemistahl/lingua-py
https://pemistahl.github.io/lingua-py/

Upsides: better and well studied accuracy than the lid.175 model of fasttext
Downsides: 75 languages supported vs 176 for fasttext lid.175 model

Support using balkhash as an output sink

balkhash is the mechanism for persisting followthemoney data. This library should by default output its results to a balkhash dataset (either in postgresql, leveldb, or google datastore).

PDFs with large spacing between letters are not properly parsed

This is a regression that came with the change to pdfminer.six in #42 and currently affects 3.18.0-rc2.

Spacing like in this file cause the parser to identify individual letters instead of words:

Handling of Outlook MSG files and RTF bodies

You have to give Microsoft credit for its consistency: instead of storing E-Mail messages in Outlook as RFC822 plain text, they came up with their own super funky file format based on OLE. We often see these in leaks (for example: the entire Panama Papers).

In Python, the most popular parser for MSG files is msg-extractor, but it's maintained by a developer who seems to prioritise implementing the spec over building a tool that could parse all files found in the wild. I tried to fix up the encoding support in the library at some point, but the PR was rejected on the basis that I should request that the source of the files fix their Exchange settings. This did not seem like a healthy option, vis a vis the Russian mafia.

So I started to maintain a fork and eventually ended up cleaning it up quite significantly. Still, two issues are pretty persistent:

a) Encodings in this file format are a mess and many files seem to be outright damaged. msglite 0.30 does much more work on this, but I'm still pretty sure we'll see issues in the future.

b) Outlook re-formats the body of many emails into RTF (Rich text format) upon receipt. In the best case, this means that each .msg file contains an HTML, a plain text and an RTF version. But sometimes the plain text is encoding-fucked, while the RTF is not. Now, msglite does provide a msg.rtfBody property with that version, but processing it further in Python is a bit difficult.

We can either - as we do currently - save the RTF as its own file and essentially declare it an attachment to the message. The attachment is then processed using convert-document and turned into a PDF. This is at best annoying (because people now need to understand that the attachment is part of a real email), and also duplicative if the main message body was extracted correctly.

The other option would be to use striprtf to turn the RTF body into a plain text body. Unfortunately, the lib currently provides an extremely naive implementation of RTF that does not handle encodings other than unicode (cf. joshy/striprtf#11 - but the issue is larger than described there). We might want to consider PRing proper encoding support into striprtf and then adopting it.

Improve convert retry handling

Our current retry logic for converting documents (shelling out to LibreOffice) is based on two constants: the number of retry attempts and the timeout

ingest-file/ingestors/support/convert.py

Lines 16 to 17 in fca65fb

 TIMEOUT = 3600 # seconds 

 CONVERT_RETRIES = 5

What would be more desirable is a faster first fail which could be increased to a maximum.

For instance: right now we retry up to 5 times and timeout after 3600s (1 hour). We could potentially get much better throughput by having a first timeout after 600s (10 minutes) which gets progressively larger (with a potential max cap). To illustrate:

TIMEOUT_START=600
TIMEOUT_INCREASE=900
TIMEOUT_MAX=3600
CONVERT_RETRIES=5

This would result in up to 5 retries with timeouts of 10, 25, 40, 55 and 60 minutes. Ideally "stuck" convert tasks would time out much sooner and get queued up for a retry faster.

TODO

try get some data on average(and maximum?) time it takes to convert a document
make the timeout and retry settings respect their respective settings.

Emit LegalEntities for participants in email conversations

When parsing email messages, the sender, recipient and carbon-copied parties should be emitted as LegalEntities in followthemoney schema. We may need to extend the schema for Email to support linking them to the message object. I assume the ID of each of these entities should be based on their email address in a normalised form.

In Aleph, tasks seem to be stuck and no progress is made

For whatever reason (logs output nothing helpful), some tasks seem to be failing and this blocks all other tasks. We have millions of enqueued tasks, and tasks continuously start and fail causing the containers to crash.

It would be helpful if failed tasks were moved to the back of the queue so things would progress at all until only failing tasks remain.

Missing extracted entities

On some documents we are not seeing all the extracted entities that we expect. When running the entity extraction model manually we see many more organizations pop up than what we find in the UI for that pdf.

As an example, this document only has a small list of mentioned entities:

https://aleph.occrp.org/entities/37674183.14f08c45eefae92e8c9eb0a109dc42673faed3f6#mode=tags&page=2

Running that page through spacy though gives us many other entities, including "Joseph Elvinger" who is not shown in Aleph.

Handle TIFFs inside PDFs

Tracking alephdata/aleph#2810

Seems like there is a regression in 3.18.0 which breaks the OCR process when TIFF files are inside a PDF.

Add NER support for Norwegian, Danish and Dutch

We had people ask for Norwegian support on the Slack. Danish and Dutch are good to have too. spacy has support for these languages.

Use servicelayer

For OCR, cache and archive access.

cf. alephdata/aleph#616

Update tesseract library

From tesserwrap. Probably: https://github.com/sirfz/tesserocr - seems to have the most features of the several out there.

Attachments in emails from PST files are not linked to the corresponding Email entity

attachments are being extracted as separate files in the same folder as the email, but their parent property is not being set to the email entity. As a result, when viewing the email, it looks like the email had no attachments.

XLSM file is processed as an archive rather than an Excel file

Drag and drop .xlsm file into Aleph dataset
A folder with the name of the file is generated, contained a bunch of XML files

This is as though on your desktop you rename the file extension to .cab and extract it using Keka or similar.

Willing to supply a sample file privately.

Create ingest-file microservice to run ingest process outside Aleph

Pull out Ingestors out of Aleph into a microservice called ingest-file. We'll have 3 main components

An API
- Lets Aleph push a file hash, metadata and the associated collection id to be ingested
- Lets Aleph query the progress of ingestion of a collection
- Uses Redis as a queue to keep track of what's to be ingested
A set of daemons running Ingestors
- Pull files from storage based on the Redis queue
- Run them through Ingestors and put the extracted FtM entities into Balkhash and push the new extracted files into storage
- Update the progress on Redis
- We can make use of AsyncIO because pulling and pushing files and entities will be I/O heavy.
A daemon pushing entities into the bulk API
- Check which files are done being ingested from Redis
- Push the new FtM entities from Balkhash into Aleph through the bulk API

We'll use service-layer for accessing Redis and the OCR service.

  +---------------+       +------------+
  |               |       |            |
  |               |       |            |
+-> STORAGE LAYER <-------+  ALEPH API +<------------------+
| |               |       |            |                   |
| |               |       |            |                   |
| +-+-------------+       +-+--+-------+                   |
|   |                       |  ^                           |
|   |       +---------------+  |                           |
| +-----------------------------------------------------------------------+
| | |       |      |-----------+                           |              |
| | |  +----v---------+         +---------+                |              |
| | |  |              |         |         |                |              |
| | |  |              +--------->         |                |              |
| | |  | INGESTOR API |         |         |                |              |
| | |  |              |         |         |       +--------+-----------+  |
| | |  |              <---------+         |       |                    |  |
| | |  +--------------+         |         |       |                    |  |
| | |                           |  REDIS  |       |                    |  |
| | |                           |         +------->                    |  |
| | |  +------------------+     |         |       |  BULK PUSH DAEMON  |  |
| | |  |                  |     |         <-------+                    |  |
| | |  |                  |     |         |       |                    |  |
| | +->+                  +----->         |       |                    |  |
| |    | INGESTOR THREADS |     |         |       |                    |  |
| |    |                  |     |         |       |                    |  |
+------+                  |     +---------+       +-----------^--------+  |
  |    |                  |                                   |           |
  |    +------------------+                                   |           |
  +-----------------------------------------------------------------------+
                 |                                            |
                 |                                            |
                 |              +-----------------------------+-----------+
                 |              |                                         |
                 |              |                                         |
                 +-------------->                BALKHASH                 |
                                |                                         |
                                |                                         |
                                +-----------------------------------------+

Error while processing rows of a sheet creates orphaned empty tables at the root of a collection

While processing excel files and other tabular formats, we iterate through the rows and emit Table fragments with only each row's text in indexText and generate a csv file out of all the rows. Only after that's all done, we emit a Table entity with parent information and other metadata like name etc.

Some times we encounter errors while iterating through the rows. At this point we have put some Table fragments in ftm-store with no parent info on them. Due to the error, we stop processing the entity and we never send the parent, name properties to ftm-store. The merged entity from all these fragments ends up creating a empty Table entity at the root of the collection since it doesn't have a parent property set.

Validate IBANs

ingest-file extracts IBANs using a rather simple regex. This can lead to a lot of false positives. ingest-file could add additional validation for matches in order to improve precision:

Validating the length depending on country
Validating checksums
…

We should consider that the text the extraction is performed on is often the result of OCR processing which may detect characters incorrectly. If an IBAN’s checksum isn’t correct, that may be due to OCR having misdetected a character etc.

Use Tesseract 4

With an option to switch back to 3.04 in case of emergencies.

Add the version of ingest-file used to parse a document into the document entity

#466

Revert to Python's built-in email parsing module

We currently use mailgun's flanker library to process RFC822 messages. The library tries to normalise email messages pretty heavily, but has a number of downsides (see below). At the same time, it looks like Python 3 has significantly cleaned up it's email handling compared to Python 2, so it might be palatable to just use the built-in parser instead of the extensive wrapping.

flanker issues:

No clear release process, releases are not shipped to PyPI
Removes Re: , Fwd: from email subjects, which is super disorienting

Dependabot can't resolve your Python dependency files

Dependabot can't resolve your Python dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

pip._internal.exceptions.DistributionNotFound: No matching distribution found for pdflib==0.3.0 (from -r requirements.in (line 23))

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Extract crypto wallet addresses

ingest-file could extract crypto wallet addresses for popular crypto currencies using regular expressions, similar to it already extracts email addresses and IBANs.

While ElasticSearch and Aleph do support searching using regexes which can be used to find mentions of such addresses, ElasticSearch’s regex capabilities are limited, e.g. a regex must always match a full token. It can be difficult or impossible to come up with a valid ES regex that matches valid addresses and is precise at the same time.

CLI interface to feed tasks to the Redis task queue

We should have a way to feed documents into the Redis ingestion queue via the command line for debugging purposes.

Ingestor for HEIC / HEVC files

Ingestors don't currently support HEIC / HEVC images.

refs alephdata/aleph#1982

Extract metadata from Audio and Video files

Support for:

ID3
What are other container formats that have metadata? MKV?

Norwegian NER does not seem to work

Hi. I am trying to upload documents (PDFs with text) to Aleph v 3.11.1 using ingest-file 3.16.0

As far as I can see the newest ingest-file includes NER-support for Norwegian (nb).

The dataset is set to Norwegian and I have also tried to specify nor as language in multipart upload and tested manual upload of the PDFs in Aleph UI.

But I still can not see that entities/mentions are extracted. Any hints on where to start debugging this? Example file: https://www.bergen.kommune.no/api/rest/filer/V105369 (this should contain a lot of NER entities)

These are extract from the logs from ingest-file and convert-document:

637249502.7992988, "stage": "ingest", "message": "OCR: 2 chars (from 50082 bytes)", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.support.ocr", "timestamp": "2021-11-18 15:31:52.343269", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "cf0b59f2-fbe4-4cdb-aab7-23a547ff7132", "start_time": 1637249502.7992988, "stage": "ingest", "message": "w: 946, h: 165, l: eng+nor, c: 95, took: 0.04451", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.support.ocr", "timestamp": "2021-11-18 15:31:52.353464", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "cf0b59f2-fbe4-4cdb-aab7-23a547ff7132", "start_time": 1637249502.7992988, "stage": "ingest", "message": "OCR: 2 chars (from 73257 bytes)", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.support.ocr", "timestamp": "2021-11-18 15:31:52.456324", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "cf0b59f2-fbe4-4cdb-aab7-23a547ff7132", "start_time": 1637249502.7992988, "stage": "ingest", "message": "w: 946, h: 165, l: eng+nor, c: 90, took: 0.09722", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.support.ocr", "timestamp": "2021-11-18 15:31:52.466132", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "cf0b59f2-fbe4-4cdb-aab7-23a547ff7132", "start_time": 1637249502.7992988, "stage": "ingest", "message": "OCR: 79 chars (from 93366 bytes)", "severity": "INFO"}
ingest-file_1       | {"logger": "ingestors.analysis.language", "timestamp": "2021-11-18 15:31:52.578159", "dataset": "15", "job_id": "6:89a95cfa-8346-4f45-8573-ee4de6e4df36", "version": "3.16.0", "trace_id": "549ad73f-eb18-494e-a8d9-6f8d71f92b56", "start_time": 1637249512.5727026, "stage": "analyze", "message": "Detected (2586 chars): no -> 0.836", "severity": "DEBUG"}

There is trace of OCR and language detection (which is no while the NER model is nb, maybe that confuses the system), but no trace of NER.

Excerpts from the document returns entities when running it with:

import spacy

nlp = spacy.load("nb_core_news_sm")

text = ("""Det ble utarbeidet tekniske planer for dam Munkebotsvatnet for valgt løsning, ferdig den 22.4.2016.
NVE godkjente tekniske planer (NC) i brev 12.8.2016. I brevet presiserer NVE følgende:
I og med at dammen tilhører et vassdragsanlegg uten konsesjon vil det for resten av utbyggingen være
Bergen kommune som skal stå for saksbehandling og kontroll etter plan- og bygningsloven (PBL), jf.
Forskrift om byggesak. Følgelig er det kommunen som gir nødvendige tillatelser til å gjennomføre de
deler av utbyggingsprosjektet som ikke angår selve dammen, dvs. fangdam, tilkomstveier m.m.""")
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Returns

Munkebotsvatnet ORG
NVE ORG
NC ORG
NVE ORG
Bergen kommune ORG
PBL ORG

Switch to using pathlib everywhere

We are dealing with a lot of file paths, some of which may come as semi-broken Unicode thanks to the unix file system conventions. We should be passing around pathlib objects rather than strings of any sort in order to make path traversal happen.

Consider merging convert-document into main app

At the moment, the ingestors will call on an HTTP service provided by convert-document (in this repo) to convert documents in various types (Word, Powerpoint, etc.) to PDF files, which then the ingestors know how to handle. convert-document the uses LibreOffice to do the conversion and returns the PDF. The reasoning behind this was:

a) We wanted to run a single instance of LibreOffice in each instance of convert-document and then use UNO, a built in IPC mechanism, to submit documents from the HTTP server. This would have shaved off the startup overhead for LibreOffice. However, the python UNO bindings are badly maintained and LibreOffice, when called via UNO, has a tendency to lock up when prompted to process broken office documents. After exhaustively debugging this with limited success, we've now switched convert-document back to a process-based model where it shells out to call soffice --headless.

b) Another advantage of running convert-document in its own container is that we can insulate LibreOffice, a complex application that does networky things. Unlike the ingest-file service, convert-document does not need credentials for ftmstore, the task redis, or even outbound network access at all. We've been using NetworkPolicy in k8s production to essentially block it into its container.

Still, there are also downsides to running c-d in it's own container:

a) LibreOffice can only ever run one action at the same time. So the c-d HTTP app implements exclusive locking, and will return HTTP 503 codes if it receives a request while already processing a document. The 503 is then used by ingest-file to round-robin the available c-d instances and to implement incremental back-off. Doing all this is quite a big of hassle and in some cases becomes a major perf bottle neck on document imports.

b) We're shipping two separate units, both with a somewhat fattish ubuntu installed. This probably also creates extra attack surface.

So if we're not going to be using the UNO API going forward, I feel like we should perhaps just inline convert-document to make the ingestors directly shell out to soffice --headless in their own container. It's a small security loss but a big simplification of the overall operation of the system. We also then need to plan for the memory spikes that would occur in ingest-file while it starts up libreoffice (ca. 200MB or so).

Investigate failing tests (currently skipped)

There are two tests currently marked as @skip in the tests dir:

Both fail.
The root cause for the failure should be investigated. Ideally, all tests should pass.

pymupdf 1.22.x + breaks jbig2 support

See #511

Dependabot can't resolve your Python dependency files

Dependabot can't resolve your Python dependency files.

As a result, Dependabot couldn't update your dependencies.

The error Dependabot encountered was:

ERROR: Could not find a version that satisfies the requirement pdflib==0.3.0 (from -r requirements.in (line 23)) (from versions: 0.1, 0.1.1, 0.1.2)
Traceback (most recent call last):
  File "/usr/local/.pyenv/versions/3.9.1/bin/pip-compile", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/piptools/scripts/compile.py", line 458, in cli
    results = resolver.resolve(max_rounds=max_rounds)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/piptools/resolver.py", line 173, in resolve
    has_changed, best_matches = self._resolve_one_round()
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/piptools/resolver.py", line 278, in _resolve_one_round
    their_constraints.extend(self._iter_dependencies(best_match))
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/piptools/resolver.py", line 388, in _iter_dependencies
    dependencies = self.repository.get_dependencies(ireq)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/piptools/repositories/local.py", line 74, in get_dependencies
    return self.repository.get_dependencies(ireq)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/piptools/repositories/pypi.py", line 243, in get_dependencies
    self._dependencies_cache[ireq] = self.resolve_reqs(
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/piptools/repositories/pypi.py", line 194, in resolve_reqs
    results = resolver._resolve_one(reqset, ireq)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/pip/_internal/resolution/legacy/resolver.py", line 385, in _resolve_one
    dist = self._get_dist_for(req_to_install)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/pip/_internal/resolution/legacy/resolver.py", line 336, in _get_dist_for
    self._populate_link(req)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/pip/_internal/resolution/legacy/resolver.py", line 302, in _populate_link
    req.link = self._find_requirement_link(req)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/pip/_internal/resolution/legacy/resolver.py", line 267, in _find_requirement_link
    best_candidate = self.finder.find_requirement(req, upgrade)
  File "/usr/local/.pyenv/versions/3.9.1/lib/python3.9/site-packages/pip/_internal/index/package_finder.py", line 927, in find_requirement
    raise DistributionNotFound(
pip._internal.exceptions.DistributionNotFound: No matching distribution found for pdflib==0.3.0 (from -r requirements.in (line 23))

If you think the above is an error on Dependabot's side please don't hesitate to get in touch - we'll do whatever we can to fix it.

View the update logs.

Moshi moshi

💯

Search index text regression

This affects the current 3.18.0-rc2. The changes to the pdf libraries and the ingestion mechanism there have lead to a decrease in search index text quality.

Reproduction steps:

upload/ingest a pdf file or document that gets converted to pdf
search for text inside it
expect the text inside the document to be found

Support .aleph files

Ability to upload FtM iJSON files with the extensions .aleph and .ftm.

Handle XLSX files that have protection/encryption enabled

It seems like we fail to parse files which are created in Excel with write-protection, even though they are readable without a password in the app. There has to be a way of extracting content from them.

Make error handling more robust in the "analyze" step

Error while analysing an ingested document stops the document processing pipeline and the document doesn't get indexed or show up on Aleph.

Example of such an error:

Traceback (most recent call last):
  File "/usr/lib/python3.8/encodings/idna.py", line 165, in encode
    raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/servicelayer/worker.py", line 31, in handle_safe
    self.handle(task)
  File "/ingestors/ingestors/worker.py", line 52, in handle
    entity_ids = self._analyze(dataset, task)
  File "/ingestors/ingestors/worker.py", line 38, in _analyze
    analyzer.feed(entity)
  File "/ingestors/ingestors/analysis/__init__.py", line 46, in feed
    for (prop, tag) in extract_patterns(self.entity, text):
  File "/ingestors/ingestors/analysis/patterns.py", line 26, in extract_patterns
    value = prop.type.clean(match_text, proxy=entity)
  File "/usr/local/lib/python3.8/dist-packages/followthemoney/types/common.py", line 86, in clean
    return self.clean_text(text, fuzzy=fuzzy, format=format, proxy=proxy)
  File "/usr/local/lib/python3.8/dist-packages/followthemoney/types/email.py", line 71, in clean_text
    domain = domain.encode("idna").decode("ascii")
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label empty or too long)

Reported by @brrttwrks on slack.

High number of context switching when ingesting files

We're ingesting some files and we're getting an alert in our monitorization system regarding a high number of context switching from the ingestors processes.

I know it's a hard issue to deal with, but do you think that there could be an improvement on the ingest process to improve the performance by reducing the number of switches?

Thanks

Unsupported image format

3.18.2 has difficulties with PDFs with unsupported image formats when we try to get a PIL image out of a pikepdf Image.

Some research suggests this might be related to the DecodeParams used by pikepdf (see pikepdf/pikepdf#423).

Some related stacktraces suggest breakage in pikepdf even before the point of converting to a PIL Image and for those cases it seems like pikepdf 7.0.0 has a fix which works:

Fixed an issue with extracting images that were compressed with multiple compression filters that also had custom decode parameters.

Run ingestors on the CLI to generate "ftm-bundles"

What is an ftm-bundle?

An ftm-bundle is a zip file containing structured FtM entities and document blobs. The structure of the zip file may look something like:

bundle.zip/
  entities.json
  index.json
  archive/
    ab/cd/..
    a1/b2/..

Aleph should know how to load an ftm-bundle into a collection without the need pass the files to ingest-file. So ideally when an ftm-bundle is uploaded to Aleph, Aleph will copy the entities into ftm-store and the document blobs to the document archive (gcs, s3 etc) and trigger an reindex.

The ingestor script that generates an ftm-bundle from a directory may look something like this:

./ingest-bundle.sh --parallel 6 --source-dir my_data --dest-path . --dest-file my_data.ftmbundle

why is this useful?

If we can run ingestors on the CLI to generate ftm-bundles, it should help in debugging the ingest process and processing large amounts of data incrementally. When we encounter bugs in ingestors, it will also help us iterate quickly without the need to reupload the data to Aleph or waiting for the deployment of a new Aleph version.
ftm-bundles will enable us to share both structured and unstructured data across Aleph instances. See alephdata/aleph#1523
If we print a nice summary of errors after an ingestors run on the CLI, it will be helpful in large scale data processing in figuring out which files are problematic.
It will be helpful for organizations that run Aleph but don't have the infrastructure to monitor the logs of ingest-file service or run many parallel workers to do faster processing. They will be able to process the data offline and upload the artifacts to Aleph which should make the process a bit easier to monitor and manage.

Make SQLAlchemy pool size configurable

when ingest-file is run on a machine with a large number of cores, the default pool size of SQLAlchemy may not be enough. See #251