Giter VIP home page Giter VIP logo

mmda's Introduction

MMDA - multimodal document analysis

This is work in progress... Click here for project status.

Setup

conda create -n mmda python=3.8
pip install -r requirements.txt

Parsers

  • SymbolScraper - Apache 2.0

    • Quoted from their README: From the main directory, issue make. This will run the Maven build system, download dependencies, etc., compile source files and generate .jar files in ./target. Finally, a bash script bin/sscraper is generated, so that the program can be easily used in different directories.
  • PDFPlumber - MIT License

  • Grobid - Apache 2.0

Rasterizers

Library walkthrough

1. Creating a Document for the first time

In this example, we use the SymbolScraperParser to convert a PDF into a bunch of text and PDF2ImageRasterizer to convert that same PDF into a bunch of page images.

from typing import List
from mmda.parsers.symbol_scraper_parser import SymbolScraperParser
from mmda.rasterizers.rasterizer import PDF2ImageRasterizer 
from mmda.types.document import Document
from mmda.types.image import PILImage

# PDF to text
ssparser = SymbolScraperParser(sscraper_bin_path='...')
doc: Document = ssparser.parse(input_pdf_path='...pdf')

# PDF to images
pdf2img_rasterizer = PDF2ImageRasterizer()
images: List[PILImage] = pdf2img_rasterizer.rasterize(input_pdf_path='...pdf', dpi=72)

# attach those images to the document
doc.annotate_images(images=images)

2. Saving a Document

You can convert a Document into a JSON object.

import os
import json

# usually, you'll probably want to save the text & images separately:
with open('...json', 'w') as f_out:
    json.dump(doc.to_json(with_images=False), f_out, indent=4)

os.makedirs('.../', exist_ok=True)
for i, image in enumerate(doc.images):
    image.save(os.path.join('.../', f'{i}.png'))
    
    
# you can also save images as base64 strings within the JSON object
with open('...json', 'w') as f_out:
    json.dump(doc.to_json(with_images=True), f_out, indent=4)

3. Loading a serialized Document

You can create a Document from its saved output.

from mmda.types.image import PILImage, pilimage

# directly from a JSON.  This should handle also the case where `images` were serialized as base64 strings.
with open('...json') as f_in:
    doc_dict = json.load(f_in)
    doc = Document.from_json(doc_dict=doc_dict)
    
# if you saved your images separately, then you'll want to reconstruct them & re-attach
images: List[PILImage] = []
for i, page in enumerate(doc.pages):
    image_path = os.path.join(outdir, f'{i}.png')
    assert os.path.exists(image_path), f'Missing file for page {i}'
    image = pilimage.open(image_path)
    images.append(image)
doc.annotate_images(images=images)

4. Iterating through a Document

The minimum requirement for a Document is its .symbols field, which is just a <str>. For example:

doc.symbols
> "Language Models as Knowledge Bases?\nFabio Petroni1 Tim Rockt..."

But the usefulness of this library really is when you have multiple different ways of segmenting .symbols. For example, segmenting the paper into Pages, and then each page into Rows:

for page in doc.pages:
    print(f'\n=== PAGE: {page.id} ===\n\n')
    for row in page.rows:
        print(row.symbols)
        
> ...
> === PAGE: 5 ===
> ['tence x, s′ will be linked to s and o′ to o. In']
> ['practice, this means RE can return the correct so-']
> ['lution o if any relation instance of the right type']
> ['was extracted from x, regardless of whether it has']
> ...

shows two nice aspects of this library:

  • Document provides iterables for different segmentations of symbols. Options include things like pages, tokens, rows, sents, paragraphs, sections, .... Not every Parser will provide every segmentation, though. For example, SymbolScraperParser only provides pages, tokens, rows. More on how to obtain other segmentations later.

  • Each one of these segments (in our library, we call them SpanGroup objects) is aware of (and can access) other segment types. For example, you can call page.rows to get all Rows that intersect a particular Page. Or you can call sent.tokens to get all Tokens that intersect a particular Sentence. Or you can call sent.rows to get the Row(s) that intersect a particular Sentence. These indexes are built dynamically when the Document is created and each time a new DocSpan type is loaded. In the extreme, one can do:

for page in doc.pages:
    for paragraph in page.paragraphs:
        for sent in paragraph.sents:
            for row in sent.rows:
                for token in sent.tokens:
                    pass

You can check which fields are available in a Document via:

doc.fields
> ['pages', 'tokens', 'rows']

5. Loading new SpanGroup field

Not all Documents will have all segmentations available at creation time. You may need to load new fields to an existing Document.

TBD...

6. Editing existing fields in the Document

We currently don't support any nice tools for mutating the data in a Document once it's been created, aside from loading new data. Do at your own risk.

TBD...

mmda's People

Contributors

kyleclo avatar lolipopshock avatar yoganandc avatar cmwilhelm avatar rauthur avatar soldni avatar rodneykinney avatar geli-gel avatar regan-huff avatar stefanc-ai2 avatar whattabatt avatar egork520 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.