Giter VIP home page Giter VIP logo

deepform's People

Contributors

andrealowe avatar danielfennelly avatar gray-davidson-00 avatar jstray avatar moredatarequired avatar ngrayluna avatar radkoff avatar staceysv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

deepform's Issues

train.py crashes on save if passed a custom model name

basename = config.model_path or default_model_name(config.window_len)

Traceback (most recent call last):
  File "/root/.asdf/installs/python/3.8.5/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.asdf/installs/python/3.8.5/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/deepform/deepform/train.py", line 207, in <module>
    main(config)
  File "/root/deepform/deepform/train.py", line 173, in main
    save_model(model, config)
  File "/root/deepform/deepform/model.py", line 138, in save_model
    basename.parent.mkdir(parents=True, exist_ok=True)
AttributeError: 'str' object has no attribute 'parent'

Pull PDFs on demand for annotation

Instead of having to run with ~30G of attached PDFs for all of our source documents, since we only need a handful at the end of a given training run for annotation, we should keep them in a publicly-available known location and download them on-demand as needed.

How are non-entity tokens handled ?

As far as I understand, in data/training we only have the entity-tokens and each token is labeled with the class with the maximal value in each row.

However, how are the non-entity tokens handled? That is tokens that don't fit into neither of the classes ?

About how to obtain original pdf files for 2012, 2014

Thanks for your repo, it is very meaningful!

I have a question: I want to download the original pdf files and link each pdf with its corresponding labels.

I find that 2020 data is easy to link label with pdf file since you provide the pdf link like this,
image

However, it is hard to file original pdf files in 2012 and 2014 data.
I can only find the way to link tokenized parquet file with its corresponding labels via file_id.
But, how can I obtain the original pdf files via file_id or other ways?

I find that this file presents the url, but how can I download pdf files with this url?
image

Thanks!

Stop logging password as a config variable

Currently db_password is a config variable and along with all the other it's logged as part of the run, and is openly visible for each run on W&B and in the output logs for each run.

My inclination is that it should be an environment variable and added to .env, but one way or another we should stop reporting the plaintext password as one of the run parameters.

Create infer.py

Refactor train.py code, create infer.py that takes a PDF on the command line (assume already OCRd) and output extracted fields:

  • contract number
  • advertiser
  • start date
  • end date
  • total

Enable data conversion to run without huge memory allocation

Right now the data conversion script convert_to_parquet.py loads all the data into memory at once -- as our dataset has grown, this is no longer possible on a typical desktop computer, and it should be possible to operate chunk-by-chunk in order to do it on pretty much any machine.

Add license

Are we an open source project? If we are, we should have a file in our repo that states what license we're operating under.

Update wand version 0.9 -> 0.10

The current version of the Weights and Biases client library contains a few breaking changes that will need to be manually fixed.

Issue writing the dataset as parquet in add_features

I'm running the following (locally, not docker) with the latest from mainline: python -m deepform.data.add_features data/3_year_manifest.csv

And I'm getting this traceback:

Traceback (most recent call last):
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/evan/deepform/deepform/data/add_features.py", line 263, in <module>
    extend_and_write_docs(
  File "/Users/evan/deepform/deepform/data/add_features.py", line 98, in extend_and_write_docs
    doc_index.to_parquet(pq_index)
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/util/_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/core/frame.py", line 2365, in to_parquet
    to_parquet(
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 270, in to_parquet
    return impl.write(
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pandas/io/parquet.py", line 101, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 1376, in pyarrow.lib.Table.from_pandas
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 593, in dataframe_to_arrays
    arrays[i] = maybe_fut.result()
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 565, in convert_column
    raise e
  File "/Users/evan/Library/Caches/pypoetry/virtualenvs/deepform-kruIrF0o-py3.8/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert 0.0 with type str: tried to convert to double', 'Conversion failed for column gross_amount with type object')

I inspected the DataFrame, and the issue appears to be that the document with slug 499480-cancel-68803-13518579030793-_-pdf has 0.0 for gross_amount, which prevents conversion to double.

One solution might be to do:
doc_index['gross_amount'] = doc_index.gross_amount.apply(pd.to_numeric, errors='coerce')
before exporting to parquet format, but I wanted to confirm with you all that this field is supposed to be float, and that the 0.0 amount isn't a mistake. I'm also not sure why no one else has run into this, so maybe something else is up.

Add script to automate retrieving training data

Right now getting our training data is a manual process that needs to be done independently by every developer. We should have a script that pulls our training data down from a known location and puts it in a consistent place in our repo.

Merge 2012 and 2014 training data

Merge into one file, that indicates explicitly which labels are available for each document:

  • 2012: contract number, advertiser, total
  • 2014: contract number, advertiser, start date, end date

Match output token more intelligently

Currently doc_val_acc counts as zeros for some answers that are really close to the answers that were volunteer-sourced, such as

  • guessed "$17,925.00" with score 29.85, correct "17,925"
  • guessed "$19,850.00" with score 29.67, correct "19850"
  • guessed "$25,200.00" with score 29.02, correct "$25,200"
  • guessed "$152,900,00" with score 29.15, correct "$152,900.00"

These are all in some sense the same value, but either the volunteer data or the OCRd token are a little bit funny. We should be able to match any of these examples without increasing our false positive rates on tokens that actually disagree about the number.

@jstray discusses some of these issues (and more) in https://app.wandb.ai/deepform/extract_total/reports/What-type-of-errors-does-windowed-total-extraction-make%3F--VmlldzoxMTA4NjE

Create test version of sweep

There should be a convenient way to test that our system works all the way through actually running a wandb sweep without having to actually run a full-blown sweep.

No token file for ...

There are a bunch of files that it complains "No token file", for example

No token file for Political Files/2020/Federal/US Senate/amy mcgrath order 519-525 30/mcgrath 30
No token file for Political Files/2020/Federal/President/Mike Bloomberg 2020/mike bloomberg egxa 117 invoice 1.21
No token file for Political Files/2020/Federal/President/Sanders/Orders/sanders.wsoc.663906R.02.28.2020-2
No token file for Political Files/2020/Federal/US House/Sheila Jackson/sheilajacksonfinalinvoice
No token file for Political Files/2020/Federal/President/Steyer/Telemundo/Orders/steyer2020.esoc.649599.1.21.20
No token file for Political Files/2020/Federal/President/tom steyer invoice kgwn 12.15

This ends up with only generating 8990 parquet files under data/training

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.