Deep Learning for XML

This is a deep learning framework for operating on XML structured data. It is implemented in PyTorch. The framework has modularized and extensible components for training, debugging, inference, checkpoints, model schema migrations etc. XML is the first class format for a large number of applications(All HTML web, office documents, SVG, etc).

In this release, we have implemented an equivalent of seq2seq. Given a set of input and output XMLs, the framework can automatic learn and then apply those transformations on novel XML inputs.

This is an alpha release. We appreciate any kind of feedback or contribution. In particular, we are looking for

New Application scenarios in your domain of interest.
Bug reports.
Code contributions. If you would join the project, please contact on the forum.

Deep Learning for XML
- Key Features
Installation
- Prerequisites
Getting Started
- Datasets
- Some plots
  - Tensorboard integration with toy1
  - Evaluating toy1 training
Roadmap
Troubleshoots and Contributing

Key Features

Encoder decoder architecture.
Encoder designed to capture hierarchical structure of an XML.
1. Tyipcal seq2seq models operate at a relatively small sentence length. Information flows linearly(unidirectional or bidirectional). Istead, XML data is hierarchical and requires information to flow along tree edges as well.
2. Regular RNN for text and attributes of XML nodes.
3. Inspired from GraphRNN for capturing structure of an XML tree.
4. Respects order of children of an XML element.
5. Order not treated as important in XML attributes.
Decoder is designed to generate output XML.
1. Use of attention(1, 2) to find the appropriate character position or XML node or node attribute to focus upon.
2. Use of pointer networks for learning to verbatim copy portions of text from input XML.
3. Custom GPU implementation of performance critical modules.
4. Support for beam decoding during inference for better accuracy.
Use of shortcut connections between layers in the network for a more stable convergence.
Tensorboard integration(over pytorch tensors).
Schema versioning: We keep tweaking our models. We often need a way to migrate training done on our old model into new schema. This can be called a kind of "self-transfer learning. This is supported via schema versioning.

Installation

This package supports Python 3.6. We recommend creating a new virtual environment for this project (using virtualenv or conda).

Prerequisites

Install python and ninja. Use following commands on MacOS for installation using macports,

$ sudo port install python36
$ sudo port install py36-pip
$ sudo port select --set pip pip36
$ sudo port select --set python python36
$ sudo port install ninja
Checkout the repository.

git clone https://github.com/nishantsharma/xml.ai
Install all python packages mentioned in requirements.txt.

$ sudo pip install -r requirements.txt

Getting Started

Datasets

Currently, we are running on generated datasets. There are 3 toy datasets that we support generating.

S.No	Dataset ID	Description	Input Example	Output Example
1.	toy0	Inverts node.text	<toyrev>ldhmo</toyrev>	<toyrev>omhdl</toyrev>
2.	toy1	Swaps parent and child node tags	<tag1><tag2 /></tag1>	<tag2><tag1 /></tag2>
3.	toy2	Swapping shipping and billing address fields.	Generated data compliant with schema.xsd.	The two addresses swapped.
4.	toy3	Children order is reversed. Attribute list is rotated. Tail and text swapped.	<a><b p1="p1"></b> <c p2="p2"></c></a>	<a><c p1="p1"></c> <b p2="p2"></b></a>

Preparing toy datasets

Run script to generate the reverse toy dataset. By default, the generated data is stored in data/inputs/<domainId>.

./scripts/generate.sh --domain toy1
./scripts/generate.sh --domain toy2

To get help on generation parameters, give the following command.

./scripts/generate.sh --domain toy1 --help
./scripts/generate.sh --domain toy2 --help

Training

To continue last training run on the default domain.

./scripts/train.sh

To continue last training run for a specific domain.

./scripts/train.sh --domain toy1
./scripts/train.sh --domain toy2

For help.

./scripts/train.sh -h

Evaluate a model

To evalaute latest trained model of a domain.

./scripts/evaluate.sh --domain <domainId>

To evaluate on domain toy1.

./scripts/evaluate.sh --domain toy1

For help.

./scripts/evaluate.sh -h

Tensorboard

To view tensorboard logs, first make sure that tensorboard is already installed.

pip3 install tensorboard

Then, run the following command

tensorboard --logdir ./data/training/runFolders/

Checkpoints

Training checkpoints are organized by domainId, runNo, modelSchemaNo and function as shown in the following file structure.

data/
  +-- training/
        +-- runFolders/
              +-- run.<runNo>.<domainId>_<modelSchemaNo>/
              +-- run.00000.toy1_0/
                    +-- Chk<epochNo>.<batchNo>/
                          +-- input_vocab*.pt
                          +-- output_vocab.pt
                          +-- model.pt
                          +-- modelArgs
                          +-- trainer_states.pt
  +-- testing/
        +-- runFolders/
              +-- run.00000.toy1_0/
                    +-- Chk*/
              +-- run.<runNo>.<domainId>_<modelSchemaNo>/
                    +-- Chk*/
  +-- inputs/
        +-- <domainId>/
              +-- dev/
                    +-- dataIn*.xml
                    +-- dataOut*.xml
              +-- test/
                    +-- dataIn*.xml
                    +-- dataOut*.xml
              +-- train/
                    +-- dataIn*.xml
                    +-- dataOut*.xml

The sample script by default saves checkpoints in the inputs/<domainId> folder of the root directory. Look at the usages of the sample code for more options, including resuming and loading from checkpoints.

Some plots

Tensorboard integration with toy1

Evaluating toy1 training

Roadmap

The goal of this library is facilitating the development of XML-to-XML transformation techniques and applications.

Application scenarios

We plan to bring following application scenarios to life.

Given a few XMLs, propose an XML schema that best describes them. It maybe a standard open schema.
XSLT Extractor: Given an input and output XML, generate the simplest XSLT which translates one to the other. Something like what prose does.
Learn aesthetics transformations for common XML formats like SVG, PPT, DOC.
...

Model roadmap

We have following on our roadmap.

Currently, our decoder is generating output sequence and the learning process forces it to be XML. We want to directly generate output XML.
We are generating the complete training output as text. Instead, we want to generate XML transformations. Think XSLT to turn input XML to output XML.
We are operating at the supervised learning level. That may be good. But, imagine a scenario where a human is editing an XML(say his resume) for aesthetics. In that case, we can interpret aesthetics as an objective function. We would like to apply reinforcement learning to discover this underlying aesthetics objective function using reinforcement learning. One can use "Inverse Reinforment Learning" to discover the aesthetics objective function.

Framework roadmap

While constantly improving the performnce, quality of code and documentation, we will also focus on the following items:

Identification and evaluation with benchmarks;
Provide more flexible model options, improving the usability of the library;
Support features in the new versions of PyTorch.

Troubleshoots and Contributing

If you have any questions, bug reports, and feature requests, please open an issue on Github. For live discussions, please go to our Gitter lobby.

We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.

jimfhahn / xml.ai Goto Github PK

xml.ai's Introduction