project-deepform / deepform Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jstray/deepform

76.0 76.0 10.0 121.88 MB

Experimental form data extraction for journalism

License: MIT License

Python 92.85% Shell 0.47% PLpgSQL 0.27% Dockerfile 1.11% Makefile 5.30%

deepform's People

Contributors

Stargazers

Watchers

Forkers

ngrayluna zeta1999 jingmouren zhxgj radkoff jcytong tulasiram58827 yynnxu rogervaas smathuki

deepform's Issues

train.py crashes on save if passed a custom model name

deepform/deepform/model.py

Line 142 in f6ec77b

basename = config.model_path or default_model_name(config.window_len)

Traceback (most recent call last):
  File "/root/.asdf/installs/python/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.asdf/installs/python/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/deepform/deepform/train.py", line 207, in <module>
    main(config)
  File "/root/deepform/deepform/train.py", line 173, in main
    save_model(model, config)
  File "/root/deepform/deepform/model.py", line 138, in save_model
    basename.parent.mkdir(parents=True, exist_ok=True)
AttributeError: 'str' object has no attribute 'parent'

Pull PDFs on demand for annotation

Instead of having to run with ~30G of attached PDFs for all of our source documents, since we only need a handful at the end of a given training run for annotation, we should keep them in a publicly-available known location and download them on-demand as needed.

How are non-entity tokens handled ?

As far as I understand, in data/training we only have the entity-tokens and each token is labeled with the class with the maximal value in each row.

However, how are the non-entity tokens handled? That is tokens that don't fit into neither of the classes ?

Train on combined 2012 and 2014 data

Handle missing fields in train.py -- 2012 data is missing dates and 2014 is missing totals

Merge fuzzy-matching code into infer.py

About how to obtain original pdf files for 2012, 2014

Thanks for your repo, it is very meaningful!

I have a question: I want to download the original pdf files and link each pdf with its corresponding labels.

I find that 2020 data is easy to link label with pdf file since you provide the pdf link like this,

However, it is hard to file original pdf files in 2012 and 2014 data.
I can only find the way to link tokenized parquet file with its corresponding labels via file_id.
But, how can I obtain the original pdf files via file_id or other ways?

I find that this file presents the url, but how can I download pdf files with this url?

Thanks!

Run complete model on 2020 sample documents and upload to Overview

Stop logging password as a config variable

Currently db_password is a config variable and along with all the other it's logged as part of the run, and is openly visible for each run on W&B and in the output logs for each run.

My inclination is that it should be an environment variable and added to .env, but one way or another we should stop reporting the plaintext password as one of the run parameters.

Modify create_training_data.py to create labels for advertiser and contract number

Match multi-token sequences against gold standard data from ftf-all-filings.csv (2012)

Create infer.py

Refactor train.py code, create infer.py that takes a PDF on the command line (assume already OCRd) and output extracted fields:

contract number
advertiser
start date
end date
total

Enable data conversion to run without huge memory allocation

Right now the data conversion script convert_to_parquet.py loads all the data into memory at once -- as our dataset has grown, this is no longer possible on a typical desktop computer, and it should be possible to operate chunk-by-chunk in order to do it on pretty much any machine.

Access to OCR outputs

How can we get the access to OCR text and word bounding boxes for 2020 pdfs?

Continuous 2020 downloading and inference

Add license

Are we an open source project? If we are, we should have a file in our repo that states what license we're operating under.

Update wand version 0.9 -> 0.10

The current version of the Weights and Biases client library contains a few breaking changes that will need to be manually fixed.

Issue writing the dataset as parquet in add_features

I'm running the following (locally, not docker) with the latest from mainline: python -m deepform.data.add_features data/3_year_manifest.csv

And I'm getting this traceback:

Traceback (most recent call last):
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/evan/deepform/deepform/data/add_features.py", line 263, in <module>
    extend_and_write_docs(
  File "/Users/evan/deepform/deepform/data/add_features.py", line 98, in extend_and_write_docs
    doc_index.to_parquet(pq_index)
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 2365, in to_parquet
    to_parquet(
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 270, in to_parquet
    return impl.write(
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 101, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 1376, in pyarrow.lib.Table.from_pandas
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 593, in dataframe_to_arrays
    arrays[i] = maybe_fut.result()
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 565, in convert_column
    raise e
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert 0.0 with type str: tried to convert to double', 'Conversion failed for column gross_amount with type object')

I inspected the DataFrame, and the issue appears to be that the document with slug 499480-cancel-68803-13518579030793-_-pdf has 0.0 for gross_amount, which prevents conversion to double.

One solution might be to do:
doc_index['gross_amount'] = doc_index.gross_amount.apply(pd.to_numeric, errors='coerce')
before exporting to parquet format, but I wanted to confirm with you all that this field is supposed to be float, and that the 0.0 amount isn't a mistake. I'm also not sure why no one else has run into this, so maybe something else is up.

Hand-check 2020 test totals

Load 1000 random 2020 documents into Overview

Just get them downloaded from FCC and uploaded so we can look at them in a shared document set (for now just share the login)

Create 2014 tokens.csv

Maybe already done?

Run totals model on 2020 data

Train model on advertiser, contract number in 2012 data

Use the 2012 data to train advertiser, contract number models

Add script to automate retrieving training data

Right now getting our training data is a manual process that needs to be done independently by every developer. We should have a script that pulls our training data down from a known location and puts it in a consistent place in our repo.

Fix 2012 duplicate data problems

New 2012 data seems to have many duplicates of some documents

Run totals model on 2020 sample documents and upload to Overview

Merge 2012 and 2014 training data

Merge into one file, that indicates explicitly which labels are available for each document:

2012: contract number, advertiser, total
2014: contract number, advertiser, start date, end date

Hand check 2020 sample data, all fields

Make docker container available as a development environment

Right now docker-compose runs a sweep, but isn't available (easily) as a development environment. We should mount our source directory and facilitate issuing commands to a detached container.

Match output token more intelligently

Currently doc_val_acc counts as zeros for some answers that are really close to the answers that were volunteer-sourced, such as

guessed "$17,925.00" with score 29.85, correct "17,925"
guessed "$19,850.00" with score 29.67, correct "19850"
guessed "$25,200.00" with score 29.02, correct "$25,200"
guessed "$152,900,00" with score 29.15, correct "$152,900.00"

These are all in some sense the same value, but either the volunteer data or the OCRd token are a little bit funny. We should be able to match any of these examples without increasing our false positive rates on tokens that actually disagree about the number.

@jstray discusses some of these issues (and more) in https://app.wandb.ai/deepform/extract_total/reports/What-type-of-errors-does-windowed-total-extraction-make%3F--VmlldzoxMTA4NjE

No token file for Political Files/2020/Federal/US Senate/amy mcgrath order 519-525 30/mcgrath 30
No token file for Political Files/2020/Federal/President/Mike Bloomberg 2020/mike bloomberg egxa 117 invoice 1.21
No token file for Political Files/2020/Federal/President/Sanders/Orders/sanders.wsoc.663906R.02.28.2020-2
No token file for Political Files/2020/Federal/US House/Sheila Jackson/sheilajacksonfinalinvoice
No token file for Political Files/2020/Federal/President/Steyer/Telemundo/Orders/steyer2020.esoc.649599.1.21.20
No token file for Political Files/2020/Federal/President/tom steyer invoice kgwn 12.15

This ends up with only generating 8990 parquet files under data/training