Giter VIP home page Giter VIP logo

parsr's Introduction


Turn your documents into data!

Français | Portuguese | Spanish | 中文

  • Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.

  • It provides analysts, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.

  • Currently, Parsr can perform: document cleaning, hierarchy regeneration (words, lines, paragraphs), detection of headings, tables, lists, table of contents, page numbers, headers/footers, links, and others. Check out all the features.

Table of Contents

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

  1. To access the python client to Parsr API, issue:

    pip install parsr-client

    To sample the Jupyter Notebook, using the python client, head over to the jupyter demo.

  1. To use the GUI tool (the API needs to already be running), issue:
    docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
    Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

  1. QPDF: Apache http://qpdf.sourceforge.net
  2. ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
  3. Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
  4. PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
  5. Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
  6. Camelot: MIT https://github.com/camelot-dev/camelot
  7. MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
  8. Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Copyright 2020 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).

parsr's People

Contributors

0xflotus avatar agarwal-nitesh avatar andbu avatar baniasbaabe avatar binarybrain avatar cristina-ilie avatar dafelix42 avatar dependabot[bot] avatar giorgiop avatar hexapode avatar jvalls-axa avatar kgeis avatar kleag avatar lluissm avatar marianorodriguez avatar martinnormark avatar mndoye avatar poveden avatar raniagus avatar royjohal avatar slallemand avatar wafajohal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parsr's Issues

Certain headings being detected as body text

Certain headings are not being detected as such: perhaps due to the small difference in font size compared to the body text?

It could be interesting to use different parameters as weights (bold, etc) for picking heading candidates in the HeadingDetectionModule.

Source
image

Resulting Markdown
image

Use ghostscript instead of convert to generate intermediate TIFF files (from PDF) before extracting using tesseract

Currently with some specific documents, the PDF pre processing step (a call to convert that generates tiff files) can generate low resolution images. Theses images are then passed to tesseract and the results are quite bad.

We should make sure that:

  • the target images are big enough to get good results from tesseract.
  • the processing time is fast enough.

We had a look at calling ghostscript directly.
300 dpi seems well enough and Ghostscript is faster than Convert.

$ gs -dNOPAUSE -q -sDEVICE=tiff48nc -dBATCH -sOutputFile=a.tiff -r300 a.pdf -c quit

Note that this create some pretty huge files.

Handle PDFs composed of text and images that contain text

  1. A possible solution could be to handle each file page by page: if the page has no text from the text extractor, we forward it to the OCR extractor.

  2. The cleanest solution would be to extract images with mupdf, run an OCR on them and put it back into the Document.

In raw text extraction outputs, handle bulleted and numbered lists

Currently, the output of the pdf2json and tesseract extractors do not give us any information about the numbered or bulleted lists.
The current bullet/numbered list detection module ListDetectionModule needs to be improved.

Current state
Source
image
Resulting Markdown
image

Sample files

Provide some representative samples in the repository under samples/.

Documentation of the optional parameters for each module

The documentation for each module (residing in /docs/modules) don't yet contain reference to the optional parameters for each one of the modules. The only point to check is in the code itself.

It would increase usability if these values could be referred to in the documentation along with the descriptions.

Automatic LinesToParagraph parameters calculation

The current implementation of paragraph formation from words (in the LinesToParagraph.ts module) highly depends on the values of its parameters (the most importantly maxInterline) to determine if the lines can be merged.

This can be automatised using the following steps:

  1. Calculate if the data isn't rotated - take the smallest text elements - chars or words, and calculate their general direction (with respect to the neighbors of the same type). If there is a common pattern (an angle alpha), the page is rotated.
  2. Taking alpha into account, calculate the most common, the maximum and the minimum inter-line vertical distance (vertical), and use that to calculate the maxInterline value that the module vitally needs to function correctly.

Replace the usage of the `where` command for when Parsr is used under Windows

Summary
The file server/src/extractors/extract-fonts.ts:30 features the use of the which command, which is nix-only.
For Windows based clients, this needs to be replaced with where for clients WinXP 32bit and above.
A solution for older Windows clients needs to be integrated too.

Steps To Reproduce

  1. Run the tool, and process any file.
  2. Observe an error reporting that the join of null cannot be made. This comes from the fact that the spawn of the command which produces a null output.

Expected behavior
The running of an input/extractor module.

Actual behavior
The following error is observed:

[2019-09-09T05:10:55] ERROR (parsr): Cannot read property 'join' of null
    TypeError: Cannot read property 'join' of null
        at /Users/me/Code/parsr/dist/src/extractors/extract-fonts.js:30:81
        at new Promise (<anonymous>)
        at Object.extractFonts (/Users/me/Code/parsr/dist/src/extractors/extract-fonts.js:29:12)
        at PdfJsonExtractor.run (/Users/me/Code/parsr/dist/src/extractors/pdf2json/PdfJsonExtractor.js:63:43)
        at Orchestrator.run (/Users/me/Code/parsr/dist/src/Orchestrator.js:47:31)
        at runOrchestrator (/Users/me/Code/parsr/dist/bin/index.js:97:14)
        at main (/Users/me/Code/parsr/dist/bin/index.js:88:5)
        at Object.<anonymous> (/Users/me/Code/parsr/dist/bin/index.js:246:1)
        at Module._compile (internal/modules/cjs/loader.js:775:14)
        at Object.Module._extensions..js (internal/modules/cjs/loader.js:789:10)

Screenshots
Unknown
Unknown

Environment

  • OS: Windows (version to be confirmed)

Additional context
The following discussions on an alternative can be useful:

The following function can be used to detect the current OS:

Handling formatting (bold, italics...) in the markdown export

Currently, all text content is outputted without any formatting (bold, italics) information.
If any formatting information is available, it should be used to enrich the markdown output.

It is a feature asked for NLP based use-cases, where bold, italics and other formatting information can be exploited for a better identification of a definition or a term, for example.

Vue - Allow next & prev selection of current element selected

We need a UI way to select the next/prev element with same type (Word, Paragraph, Line, Heading, Table...) of current selected element.

I would like to have inside "Element Inspector" a component with just a left arrow & right arrow that will select previous or next element with same type of the current selected

Issue_33

[Vue] Add persistence to configuration form

Each time a file is uploaded custom configuration is lost, in the development time a custom configuration is required to test Parsr behaviour and is a nightmare the fact to lose that each upload.

I would like to reuse current custom configuration for multiple uploads and I would like to have a "reset" button to set again the default configuration.

This feature can be quickly done by moving custom configuration state from component "/views/Upload" to sore "/vuex/Store".

Screenshot 2019-10-03 at 16 35 19

Keep the original order of elements in their properties

Currently, the TextOrderDetection module detects an order to the extracted text depending on the physical layout.

It could be interesting to keep the original text node order from PDF in the properties of the element, so that it can still be referred to, and this information is not lost.

Add an API endpoint to serve the list of modules, their documentation and default parameter values (defaultConfig.json)

Description
Currently, there is no way for the client (GUI, etc) to know which parameters are to be supplied for each module, what the default values are, what their max/min values are, etc.
In the deployed environment, the client needs to be aware of this information which inherently belongs to the server.
An API endpoint /modules with an object providing all this information would even render future server-side modifications replicable on the client automatically.

An error on table export to CSV format

Summary
The table export procedure fails for a particular table.
There seems to be an error somewhere in the pipeline where the table is conveted to an array.

Steps To Reproduce
Steps to reproduce the behavior:

  1. Make sure table-detection is turned on.
  2. Pass the attached file through the pipeline.
  3. Observe a TypeError in Table.ts.

Expected behavior
A seemless export of the table to CSV.

Actual behavior
The following error is returned:

[2019-09-09T04:43:06] ERROR (parsr): Cannot set property '0' of undefined
    TypeError: Cannot set property '0' of undefined
        at Table.toArray (/Users/me/Code/parsr/dist/src/types/DocumentRepresentation/Table.js:337:47)
        at /Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:89:47
        at Array.forEach (<anonymous>)
        at /Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:50:27
        at Array.forEach (<anonymous>)
        at MarkdownExporter.getMarkdown (/Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:49:24)
        at MarkdownExporter.export (/Users/me/Code/parsr/dist/src/exporters/MarkdownExporter.js:44:48)
        at /Users/me/Code/parsr/dist/bin/index.js:130:107
        at process._tickCallback (internal/process/next_tick.js:68:7)

Environment

  • Reference commit/version: 16a4e23
  • Other platform details: npm 6.9.0
  • OS: MacOS 10.14.5

Attached file*
t2 (dragged).pdf

Automatic high performance Header/Footer detection

The current header/footer detection module HeaderFooterDetectionModule requires an estimate in percentage of the maximal distance from the page limit, where the header and footers lie.
It would be great to have this module automatically detect headers and footers (using techniques like NLP, Vision, etc) without the need of such a parameter.

Evaluate PDFMiner, pdf-extractor, pdfreader, and ghostscript with a possibility of including them as possible extractors

Recent tests have shown that PDFMiner outperforms pdf2json as a pdf extractor. It's pdf2txt.py -A -t xml -o <output.xml> <input.pdf> provides an output xml very similar in content to the output of pdf2json.
It could be nice to provide PDFMiner as an extaction option.

Note: PDFMiner's installation depends on python2 and not python3.

Other extractors to be evaluated:

Fails on openshift Kubernetes `EACCES: permission denied, mkdir '/opt/app-root/src/api/server/dist/output'`

Summary
Trying to run the container on Openshift 3 Kubernetes fails. By default openshift runs in non-root mode. The program tries to create a directory (dist/output) and this fails.

The code that writes this output directory is here:
https://github.com/axa-group/Parsr/blob/develop/api/server/src/api.ts#L55-L59

Steps To Reproduce

$ oc run parsr --image=axarev/parsr --expose=true --port=3001
...
$ oc logs parsr-1-6wchv
Starting par.sr API : node api/server/dist/index.js
fs.js:115
    throw err;
    ^

Error: EACCES: permission denied, mkdir '/opt/app-root/src/api/server/dist/output'
    at Object.mkdirSync (fs.js:753:3)
    at new ApiServer (/opt/app-root/src/api/server/dist/api.js:63:16)
    at Object.<anonymous> (/opt/app-root/src/api/server/dist/index.js:19:11)
    at Module._compile (internal/modules/cjs/loader.js:689:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
    at Module.load (internal/modules/cjs/loader.js:599:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
    at Function.Module._load (internal/modules/cjs/loader.js:530:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
    at startup (internal/bootstrap/node.js:283:19)

Expected behavior
The pod should be running. It does run properly in a default GKE cluster.

Actual behavior
The container fails to run, because it cannot create the directory.
It goes in a restart loop.

Environment

Additional context
The fact that openshift by default only allows non-root is documented e.g. here:
https://stackoverflow.com/questions/42363105/permission-denied-mkdir-in-container-on-openshift

I think it would be better if we can change the set-up of the container that it can run as non-root.

pdfjson's installation procedure to be changed for Arch Linux and Ubuntu

Summary
The installation procedure for pdf2json mentioned under Arch Linux assumes the package's presence in the official repositories, but it has recently been removed.
UPDATE: The package pdf2json has also been removed from the latest versions of ubuntu.

Steps To Reproduce

  1. Boot into a machine where Arch linux is the operating system.
  2. In a terminal, execute as superuser: pacman -Ss pdf2json. No packages will be returned.

Expected behavior
An installation of pdf2json.

Actual behavior
The package is not found in the repositories.

Environment

  • OS: Arch Linux

Additional context

  1. A new AUR package should be made using builing pdf2json from source and submitted to https://aur.archlinux.org/
  2. The README files should be modified to reflect the new commands to install this package.

Text rendering mode 3 handling

Some document are filled with text elements having their rendering mode set to 3.

These blocks of text are supposed to be invisible and thus can spoil paragraph detection and other similar algorithms. They should not be removed though because they are sometimes used for OCR-enhanced documents, where the detected invisible text overlap with the image.

Accept .docx format as an output type

Several use-cases involve direct interpretation/manipulation of .docx files for translation, ingestion into databases, etc.
Along with markdown, raw text, JSON, it could be a nice feature to have DOCX as an output format.

Add an API endpoint to serve the default server configuration

Description
Currently, the default server configuration is supplied along with the project source.
In a deployed environment, the client cannot be automatically notified if the server's defaults are changed.
An API endpoint /default-config could provide this information, which would remove the need to hardcode anything on the client/GUI side.

EDIT: changed the proposition from /defaultConfig to /default-config to follow api endpoint naming convention.

Add support for exporting higher levels of granularity (line, paragraph)

Currently, the output JSON supports exporting the text elements to the granularity levels of word and character, which make up for the two most fine-grained levels of detail.

It could be nice to make a more compact JSON export possible - keeping the finest granularity to either Line or Paragraph.

[axarev/parsr:latest] @grpc/grpc-js only works on Node ^8.13.0 || >=10.10.0

Summary

The latest version of Parsr published on Docker Hub throws an error when trying to get the queue status for a document previously submitted to the /document endpoint.

Steps To Reproduce

  1. Use the following docker-compose.yml:
version: '3.3'

services:
  duckling:
    image: axarev/duckling:latest
    ports:
      - 8000:8000

  parsr:
    image: axarev/parsr:latest
    ports:
      - 8080:3000
      - 3001:3001
    environment:
      DUCKLING_HOST: http://duckling:8000
      ABBYY_SERVER_URL:
    volumes:
      - ./pipeline/:/opt/app-root/src/demo/web-viewer/pipeline/

volumes:
  pipeline:
    driver: local
  1. Run docker-compose up.
  2. Call POST http://localhost:3001/api/document with a sample PDF and the "exempli gratia" config.json file.
  3. Call GET http://localhost:3001/api/queue/{id} with the ID produced at step 3.

Expected behavior
Get an appropriate response.

Actual behavior
An HTTP 500 Internal Server Error is received with the following payload:

/opt/app-root/src/node_modules/@grpc/grpc-js/build/src/index.js:47
throw new Error(`@grpc/grpc-js only works on Node ${supportedNodeVersions}`);
^

Error: @grpc/grpc-js only works on Node ^8.13.0 || >=10.10.0
at Object.<anonymous> (/opt/app-root/src/node_modules/@grpc/grpc-js/build/src/index.js:47:11)
	at Module._compile (module.js:652:30)
	at Object.Module._extensions..js (module.js:663:10)
	at Module.load (module.js:565:32)
	at tryModuleLoad (module.js:505:12)
	at Function.Module._load (module.js:497:3)
	at Module.require (module.js:596:17)
	at require (internal/module.js:11:18)
	at Object.<anonymous> (/opt/app-root/src/node_modules/google-gax/build/src/grpc.js:37:14)
		at Module._compile (module.js:652:30)

Environment

  • Reference commit/version: axarev/parsr:latest
  • OS: Windows 10
  • Docker Desktop Community for Windows:
  • Version: 2.1.0.3 (38240)
  • Engine: 19.03.2
  • Compose: 1.24.1

Handle Password Protected PDFs

Password protected PDF files should be treated with something like QPDF (to get rid of the password requirement) before letting it go through the extractor.

Accept .docx files on input

Currently, the system accepts either images, or PDF files.
It would be a nice feature to have if the system could accept DOCX files upon input, keeping the heirarchical structure and contents intact and as a valid types/DocumentRepresentation instance.

Missing fields in the SVG-Line export into the output JSON format

Inside the document representation, the SvgLine object reads:

{
  _id: 27627,
  _box: BoundingBox {
    _left: 211,
    _top: 6711,
    _width: 4538,
    _height: 11
  },
  _metadata: [],
  _properties: {},
  _children: [],
  content: null,
  _thickness: 2,
  _fromX: 212,
  _fromY: 6725,
  _toX: 4750,
  _toY: 6725
}

Upon export, the same object reads:

{
  id: 27627,
  type: 'svg-line',
  properties: {},
  metadata: [],
  box: {
    l: 211,
    t: 6711,
    w: 4538,
    h: 11
  },
  fromX: undefined,
  fromY: undefined,
  toX: undefined,
  toY: 6725,
  thickness: 2
}

NOTE: the 'undefined' fields.

Add unit tests to getElementOfType()

This function is widely used throughout the code and is not straightforward.
It should be more thoroughly tested including all the various edge cases one can think off.

Add an architecture diagram to the project documentation

It could be nice to explain the whole pipeline of Parsr diagramatically for easier understanding and adaptation.
It would be great to store the diagram source files in the repository as well.

  • doc/architecture.md
  • doc/assets/architecture-drawing.xxx

Error on docker-compose build

Summary
An error shows up when docker-compose build is launched in the root directory, causing the installation to fail.

Steps To Reproduce
Steps to reproduce the behavior:

  1. Clone the repository.
  2. In the root folder, type docker-compose build
  3. Observe an output similar to: https://gist.github.com/aarohijohal/380f197cf990815ca49dc9919fbed211

Expected behavior
A working installation of Parsr.

Actual behavior
https://gist.github.com/aarohijohal/380f197cf990815ca49dc9919fbed211

Environment

  • Reference commit/version: [455799ad19e06e4ed4763033c985e279d2f00594](https://github.com/axa-group/Parsr/commit/455799ad19e06e4ed4763033c985e279d2f00594)
  • Other platform details: docker-compose version 1.24.1, build 4667896b
  • OS: MacOS Mojave 10.14.5

python3 command is just python sometimes

Summary
On Archlinux and some other OS (maybe Windows as well?), python 3.x is ran using the python command and python 2.x using the python2 command. Here, we're assuming it's always python3:

const tableExtractor = child_process.spawnSync('python3', [

Steps To Reproduce
Steps to reproduce the behavior:

  1. Have your path with python pointing on Python 3.x
  2. Remove python3 from your path.

Expected behavior
Table extraction should run.

Actual behavior
A node error: Error: spawnSync python3 ENOENT.

Environment

  • Reference commit/version: dc3bec7ae62b0db12a075f31c26ba8951a63c2f7
  • Other platform details: Python 3.x is just python, not python3
  • OS: Arch Linux, maybe others?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.