opensuse / rstxml2docbook Goto Github PK

View Code? Open in Web Editor NEW

4.0 8.0 3.0 2.89 MB

Converts RST XML to DocBook XML

Home Page: http://opensuse.github.io/rstxml2docbook/

License: GNU General Public License v3.0

Python 51.17% XSLT 48.83%

sphinx rst converter docbook python3

rstxml2docbook's Introduction

Convert RST to DocBook XML

The rstxml2db script converts RST XML files to DocBook XML.

Quick Start

To use the program without pip and virtual environment, use the following command after cloning this repository:

$ PYTHONPATH=src python3 -m rstxml2db -h

Installing

To install rstxml2db in a Python virtual environment, use the following steps:

Clone this repository:

$ git clone http://github.com/openSUSE/rstxml2docbook.git
$ cd rstxml2docbook

Create a Python 3 environment and activate it:

$ python3 -m venv .env
$ source .env/bin/activate

Update the pip and setuptools modules:
```
$ pip install -U pip setuptools
```
Install the package:
```
$ ./setup.py develop
```

If you need to install it from GitHub directly, use this URL:

git+https://github.com/openSUSE/rstxml2docbook.git@develop

After the installation in your Python virtual environment, two executable scripts are available: rstxml2db and rstxml2docbook. Both are the same, it's just for convenience.

Workflow

The script does the following steps:

Read the intermediate XML files from a previous Sphinx conversion step (see sec.build.xml.files).
Resolves any references to external files and create a single XML tree in memory.
Transform the tree with XSLT into DocBook and if requested, split it into several smaller files.
Output to stdout or save it into one or more file, depending on if splitting mode is activated.

Building the Intermediate XML Files

Usually, you first create the intermediate XML file (using the XML builder with the -b option):

$ sphinx-build -b xml -d .../build/html.doctree src/ xml/

The src/ directory contains all of your RST files, whereas the xml/ directory is the output directory.

Each RST file generates a corresponding XML file.

Building the DocBook Files

After you have created the intermediate XML files, it's now time to use the rstxml2db script. The script reads in all XML files and creates DocBook files, for example:

$ rstxml2db xml/index.xml

By default, the previous step uses the index.xml file and generates several DocBook files all located in the out/ directory.

If you need one DocBook file, use the option -ns to output the result DocBook file on stdout.

The Internal Workflow

The workflow from converting RST XML files into DocBook involves these steps:

Load the index.xml file.
Resolve all external references to other files; create one single RST XML tree.
If --legalnotice is used, add the legalnotice file into bookinfo.
If --conventions is used, replace first chapter with preface content.
Clean up XML:
1. Remove IDs with no corresponding <xref/>.
2. Fix absolute colum width into relative value.
3. Add processing instruction in <screen>, if the maximum characters inside screen exceeds a certain value.
Output tree, either by saving it or by printing it to std out.

The transformation from separate RST XML files into a single RST XML tree uses mainly the element list_item[@classes='toctree-l1']. Anything that is referenced is used as a file for inclusion. Everything else is copied as it is.

The transformation from the single RST XML tree into DocBook 5 uses the rstxml2db.xsl stylesheet.

Things to Know During Convertion

The convertion internally creates a single RST XML tree. This tree contains all information which is needed.

For example, the following things work:

Internal referencing from one section to another (element reference[@internal='True'])
Internal references to a glossary entry (element reference[@internal='True'], but with @refuri containing an # character
External referencing to a remote site (element reference[@refuri])
Different, nested sections are corretly converted into the DocBook structures (book, chapter, section etc.)
Admonition elements
Tables and figures
Lists like bullet_list, definition_list, and enumerated_list
Glossary entries
Inline elements like strong, literal_emphasis

The following issues are still problematic:

Double IDs When RST contains the same title, the same IDs are generated from the RST XML builder. I consider it as a bug.
Invalid Structures RST allows structures which are not valid for DocBook. For example, when you have sections and add after the last section you add more paragraphs. This will lead to validation errors in DocBook. The script currently does not detect these structural issues. You need to adapt the structure manually.

rstxml2docbook's People

Contributors

Stargazers

Watchers

Forkers

vikas-lamba

rstxml2docbook's Issues

Show available XSLT parameters

Problem description

Currently, the user does not know which parameters are available for the -p/--param option.

Expected solution

Use an additional --help-xsl-param option which just shows all the available parameters and does nothing else.

Root Elements in Splitted Files are Incomplete

Problem description

When splitting files, the script creates something like this:

<section xmlns="http://docbook.org/ns/docbook">
  <title>Bla...</title>
 <!-- ... -->
</section>

However, the script misses this objects in the root element:

version="5.1"
XLink namespace xmlns:xlink="http://www.w3.org/1999/xlink"

Proposed solution

Add the missing objects into root element

Exception when screen is empty

Problem description

when a screen element is empty, we get an exception:

AttributeError: 'NoneType' object has no attribute 'split'

This is located in rstxml2db.cleanup module, function add_pi_in_screen

Proposed solution

Fix the issue in the rstxml2db.cleanup module

Handle sequence diagrams properly

Problem description

Some RST files contains sequence diagrams (.diag).

Actual behaviour

The sequence diagrams are not converted.

Expected behaviour

The sequence diagrams are converted automatically:

The conversion script creates an image automatically.
The image filename is created automatically.

Move block elements outside of paragraph

Problem description

Some structure contains block elements inside a paragraph like this:

<paragraph>The quick brown fox jumps over the lazy dog...
   <bullet_list bullet="*">
      <list_item>...</list_item>
   </bullet_list>
Another lazy fox jumps over the brown dog...
</paragraph>

Actual behaviour

Block elements inside paragraphs as shown above.

Expected behaviour

All block elements (like bullet_list from the above examples) have to be placed outside of a paragraph like this:

<paragraph>The quick brown fox jumps over the lazy dog...</paragraph>
<bullet_list bullet="*">
    <list_item>...</list_item>
</bullet_list>
<paragraph>Another lazy fox jumps over the brown dog...</paragraph>

Make consistent IDs

Problem description

RST documentation can contain references in the following form:

An external reference like:

<reference name="Django documentation"
refuri="https://docs.djangoproject.com/en/dev/ref/settings/#allowed-hosts">Django documentation</reference>

Internal reference no 1, pointing to a specific RST file with anchor:

<reference internal="True" name="Foo" refuri="../admin/customize#foo">Foo</reference>

Internal reference no 2, pointing to an anchor:

<reference name="Bar" refid="bar">Bar</reference>

Internal reference to another file:

<reference internal="True" refuri="book/quickstart#quickstart">Quickstart</reference>

The third example is the one who makes most problems. As most sections contain auto-generated IDs, the changes to get ambigious IDs are pretty high.

Actual behaviour

The reference no 2 makes the most problems.

Convert OpenStack Documentation (e.g. Horizon) to DocBook 5 and try to validate. You will get some ID problems.

Proposed solution

Add a prefix for each ID
The prefix is retrieved/extracted from the book ID
Use the prefix plus current ID wherever it is needed.

Define root element in DocBook output

Problem description

Currently, when transforming the RST XML intermediate format to DocBook, the root element is book.
This can be inconvenient if you want to have a set with different books.

Actual behaviour

The DocBook output contains only a book root element.

Proposed solution

Add an option --root or similar where you can set the root element. It should support book and set as root elements.
Child elements of the root element should be appropriately adjusted.

Wrong XML raises XMLSyntaxError

Problem description

When using a XML file which isn't syntactically correct (shouldn't be the case, but, well...) the script raises a XMLSyntaxError

Actual behaviour

A XML file which contains a missing end tag gives the following stack trace:

lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: reference line 13 and paragraph, line 14, column 51
Stack (most recent call last):
  File "/local/repos/GH/rstxml2docbook/.env/bin/rstxml2db", line 11, in <module>
    load_entry_point('rstxml2docbook', 'console_scripts', 'rstxml2db')()
  File "/local/repos/GH/rstxml2docbook/src/rstxml2db/__init__.py", line 44, in main
    log.fatal(error, exc_info=error, stack_info=True)

Add new option to store result trees

Problem description

For debugging purposes, it would be nice to have the result of each step somewhere stored in /tmp.

Actual behaviour

No result trees are stored yet.

Expected behaviour

Option --result-tree / -R enables this feature
Option --result-tree-dir contains the directory where to store it. If the directory doesn't exist, it will be created.
Each step produces a file <STEP-NUMBER>.xml in the temporary from --result-tree-dir.

Concat multiple IDs

Problem description

Multiple IDs are evil, some looks like this:

<section ids="a b">

Actual behaviour

Currently, we use the b part of the IDs.

Expected behaviour

To make IDs unique, use both parts.

DeprecationWarning: inspect.getargspec() is deprecated

Problem description

rstxml2docbook/tests/test_docstrings.py:79: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() or inspect.getfullargspec()
  for arg in inspect.getargspec(func).args:

Conversion fails when external references are present

File "src/lxml/etree.pyx", line 314, in lxml.etree._ExceptionContext._raise_if_stored
lxml.etree.XSLTApplyError: Cannot resolve URI user/freezer-agent

Missing template for list_item/strong

Problem description

The RST XML contains this structure

<bullet_list>
 <list_item>
  <strong>UUID tokens: </strong>
  <literal classes="sp_impl_complete"
   ids="operation_create_unscoped_token_uuid">complete</literal>
 </list_item>
 <list_item>
  <strong>Fernet tokens: </strong>
  <literal classes="sp_impl_complete"
   ids="operation_create_unscoped_token_fernet">complete</literal>
 </list_item>
</bullet_list>

Actual behaviour

It is transformed into a <listitem> without a <para> inside:

<itemizedlist>
  <listitem>
    <emphasis role="bold">UUID tokens: </emphasis>
    <literal>missing</literal>
  </listitem>
  <listitem>
    <emphasis role="bold">Fernet tokens: </emphasis>
    <literal>complete</literal>
  </listitem>
</itemizedlist>

Expected behaviour

Add the missing <para>.

Failed to convert parts of the document

Keystone, admin-identity-tokens.xml file:

Example:

               <inline>
                  <reference refid="operation_create_unscoped_token">
                    <info/>
                    <strong>Create unscoped token</strong>
                  </reference>
                </inline>
              </entry>
              <entry>
                <inline classes="sp_feature_mandatory">mandatory</inline>
              </entry>

Use all parts of the attribute ids value

Problem description

Some sections comes with multiple IDs like this:

<section ids="module-keystone.v2_crud module-contents" names="module\ contents">

It's hard to decide which one is the "better" one.

Actual behaviour

Usually, the last part after the space was taken. However, in Keystone for example, this lead to IDs which were the same.

Expected behaviour

Unique IDs.

This can be done to remove the space with a _, for example. So basically, the two parts are combined.

literal should be screen in some cases

Problem description

Some RST documents end up with this structure:

<bullet_list>
   <list_item>
      <literal classes="sp_cli">...</literal>
   </list_item>
</bullet_list>

Actual behaviour

The result is just <inline>

Expected behaviour

The DocBook file contains a <screen>.

Logging levels are wrong

Problem description

When using option -v, it should output only warnings. With -vv it should output a bit more, and so on.

Actual behaviour

Use with option -v to see INFO and WARNINGs mixed together.

Support .seealso directive

Problem description

The .. seealso:: directive is currently not supported. Example RST file:

.. seealso::
   For more information, see ...

This creates the following content in the RST XML file:

<seealso>
   <paragraph>For more information, see ...</paragraph>
</seealso>

Actual behaviour

Files with this directive in it creates this warning message:

[WARNING ] - rstxml2db.xml.process - Unknown element 'seealso'

<mediaobject> needs <informalfigure> or <figure>

Geekodoc used to allow bare <mediaobject>s, but now does not allow them anymore. On conversion from rst, this needs to be taken into account and those need to be wrapped in <informalfigure>

Problem with <reference> to remote URL

Reported by @dmpop via IRC and pastebin.

Problem description

Seem, a reference to a remote resource in the TOC is not correctly processed:

<list_item classes="toctree-l1">
  <compact_paragraph classes="toctree-l1">
    <reference anchorname="" internal="False"
            refuri="https://developer.openstack.org/api-ref/baremetal/"
            >API Reference (latest)</reference>
  </compact_paragraph>
</list_item>

Actual behaviour

(.env)  dpopov@e219  ~/Git/openstack-docs/ironic  rstxml2db index.xml
Traceback (most recent call last):
  File "/run/media/dpopov/DATAPART1/Git/rstxml2docbook/.env/bin/rstxml2db", line 11, in <module>
    load_entry_point('rstxml2docbook', 'console_scripts', 'rstxml2db')()
  File "/run/media/dpopov/DATAPART1/Git/rstxml2docbook/src/rstxml2db/cli.py", line 235, in main
    return process(args)
  File "/run/media/dpopov/DATAPART1/Git/rstxml2docbook/src/rstxml2db/xml/process.py", line 137, in process
    xml = transform(doc, args)
  File "/run/media/dpopov/DATAPART1/Git/rstxml2docbook/src/rstxml2db/xml/process.py", line 68, in transform
    rst = resolve_trans(doc)
  File "src/lxml/xslt.pxi", line 580, in lxml.etree.XSLT.__call__
  File "src/lxml/etree.pyx", line 314, in lxml.etree._ExceptionContext._raise_if_stored
lxml.etree.XSLTApplyError: Cannot resolve URI https://developer.openstack.org/api-ref/baremetal/.xml

Steps to reproduce the behaviour

See openstack/ironic

Unknown elements raw and meta

Problem description

Some OpenStack documentation contains raw and meta macros like in this example:

.. meta::
  :description: Bla
  :keywords: Architecture, bla, Foo

.. raw:: html

   <!-- Links -->

These are converted into <meta> tags, or <raw> in the RST XML.

Actual behaviour

Files with these macros in it create this warning message:

[WARNING ] - rstxml2db.xml.process - Unknown element 'meta'
[WARNING ] - rstxml2db.xml.process - Unknown element 'raw'

para below book

Comment out broken parts. Add the FIXME flag.

<book xml:lang="en" xml:id="welcome-to-glance-s-documentation"><title>Welcome to Glance’s documentation!</title><info/>
    
    
    <para>The Image service (glance) project provides a service where users can upload and
            discover data assets that are meant to be used with other services. This currently
            includes images and metadata definitions.</para>

Move tests/doc* into tests/data

Problem description

Under the tests/ directory, we have this tree:

tests/
├── doc
│   └── [...]
├── doc.001
│   └── [...]
├── doc.002
│   └── [...]
└── doc.003
    └── [...]

For consistency reasons, we should better move all the doc* directories into data.

Table is missing cols attribute

Problem description

When transforming a Docutils table into a DocBook table, the cols attribute is missing.

Actual behaviour

Missing cols attribute in DocBook table.

Expected behaviour

The cols attribute is available.

Switch to DocBook5 output format

Problem description

When running rstxml2db, the script internally creates a DocBook4 structure which is then transformed into DocBook 5. This is inconvenient as this additional step has some drawbacks:

You easily forget there is another transformation stylesheet (suse-upgrade.xsl)
It makes it harder to maintain
Our documents are DocBook 5 anyway now
Duplication of suse-upgrade.xsl in daps-xslt/migrate/.

Solution

Get rid of suse-upgrade.xsl and output DocBook 5.
Remove option -4/--db4 in rstxml2db/cli.py
Adapt rstxml2db/xml/process.py::process()
Adapt rstxml2db/core.py
Make rstxml2db/rstxml2db.xsl DocBook5 compatible
Adapt testcases, if needed
Adapt documentation/manpage

version attribute on root element is missing

Problem description

The root element is missing a @version attribute

Actual behaviour

No version attribute.

Expected behaviour

version attribute is there, yeah!

Missing template for doctest_block

Problem description

Some files create a definition list like this:

<definition_list>
   <definition_list_item>
      <term>Usage:</term>
      <definition>
         <doctest_block xml:space="preserve">&gt;&gt;&gt; @wip('Expected Error', expected_exception=Exception, bug="#000000")
&gt;&gt;&gt; def test():
&gt;&gt;&gt;     pass</doctest_block>
      </definition>
   </definition_list_item>
</definition_list>

Actual behaviour

The DocBook's <listitem/> is empty.

Expected behaviour

<variablelist>
  <varlistentry>
    <term>Usage:</term>
    <listitem>
        <screen>...</screen>
    </listitem>
  </varlistentry>
</variablelist>

support_matrix is not supported

@dmpop This is an exquisite bug that I've found. 😉

Problem description

The RST file contains the following structure (found in keystone, doc/source/admin/identity-tokens.rst):

.. support_matrix:: token-support-matrix.ini

The file token-support-matrix.ini is obviously a file in INI format with this content:

# Comments
[targets]
# List of driver implementations for which we are going to track the status of
# features. This list only covers drivers that are in tree. Out of tree
# drivers should maintain their own equivalent document, and merge it with this
# when their code merges into core.
driver-impl-uuid=UUID tokens
driver-impl-fernet=Fernet tokens

[operation.create_unscoped_token]
title=Create unscoped token
status=mandatory
notes=All token providers must be capable of issuing tokens without an explicit
  scope of authorization.
cli=openstack --os-username=<username> --os-user-domain-name=<domain>
  --os-password=<password> token issue
driver-impl-uuid=complete
driver-impl-fernet=complete

# etc...

Actual behaviour

After the first conversion from RST to XML, Sphinx tries to apply some special behavior and creates a table:

<subtitle>Summary</subtitle>
<table>
   <tgroup cols="4">
      <colspec colwidth="1"/>
      <colspec colwidth="1"/>
      <colspec colwidth="1"/>
      <colspec colwidth="1"/>
      <thead>
         <row>
            <entry>
               <emphasis>Feature</emphasis>
            </entry>
            <entry>
               <emphasis>Status</emphasis>
            </entry>
            <entry>
               <strong>Fernet tokens</strong>
            </entry>
            <entry>
               <strong>UUID tokens</strong>
            </entry>
         </row>
      </thead>
      <tbody>
         <row>
            <entry>
               <inline>
                  <reference refid="operation_create_unscoped_token">
                     <strong>Create unscoped token</strong>
                  </reference>
               </inline>
            </entry>
            <entry>
               <inline classes="sp_feature_mandatory">mandatory</inline>
            </entry>
            <entry>
               <inline>
                  <reference refid="operation_create_unscoped_token_fernet">
                     <literal classes="sp_impl_summary sp_impl_complete"
                        >✔</literal>
                  </reference>
               </inline>
            </entry>
            <entry>
               <inline>
                  <reference refid="operation_create_unscoped_token_uuid">
                     <literal classes="sp_impl_summary sp_impl_complete"
                        >✔</literal>
                  </reference>
               </inline>
            </entry>
         </row>
         <!-- more rows here... -->
      </tbody>
   </tgroup>
</table>

That table is not correctly converted to DocBook.

Wrap misplaced inline elements with paragraph

Problem description

Sometimes you have this RST XML structure:

<entry>
  <inline>...</inline>
</entry>
<!-- or -->
<list_item>
   <strong>...</strong>
</list_item>

Actual behaviour

See above

Expected behaviour

Wrap any misplaced inline elements (like strong, literal, etc.) into paragraph like this:

<entry>
  <paragraph><inline>...</inline></paragraph>
</entry>
<!-- or -->
<list_item>
   <paragraph><strong>...</strong></paragraph>
</list_item>

Tag <emphasis> belongs to empty namespace

Problem description

When converting with rstxml2db, the following XML structures is transformed wrongly:

<!-- RST XML -->
<paragraph>The <emphasis>quick</emphasis>... </paragraph>

<!-- DocBook -->
<para>The <emphasis xmlns="">quick</emphasis>... </para>

Expected behaviour

The tag emphasis should NOT contain xmlns.

Support manpage macro

Problem description

The :manpage: macro is not supported. Example RST file:

to change the values of the :manpage:`bind(2)` parameters:

This creates the following content in the RST XML file:

to change the values of the <manpage classes="manpage" page="bind" path="bind(2)" section="2" xml:space="preserve">bind(2)</manpage> parameters

Actual behaviour

Files with this directive in it creates this warning message:

[WARNING ] - rstxml2db.xml.process - Unknown element 'manpage'

Improve test coverage

Improve code stability

Glossary doesn't match

Problem description

In openstackdoc/horizon/glossary.xml, the glossary is not transformed correctly. The result is:

Actual behaviour

glossary
  +-- title
  +-- para
  +-- para
  [...]

Should contain the following structure:

<glossary xmlns="http://docbook.org/ns/docbook"
 xmlns:xlink="http://www.w3.org/1999/xlink" version="5.1">
 <title>Glossary</title>

 <glossentry xml:id="glos.horizon">
  <glossterm>Horizon</glossterm>
  <glossdef>
   <para>The ...</para>
  </glossdef>
 </glossentry>

 <glossentry xml:id="glos.dashboard">
  <glossterm>Dashboard</glossterm>
  <glossdef>
   <para>A Python class ...</para>
  </glossdef>
 </glossentry>
</glossary>

DeprecationWarning: invalid escape sequence \s

Problem description

pytest reports a deprecation warning:

tests/test_docstrings.py:80
  /tmp/toms/rstxml2docbook/tests/test_docstrings.py:80: DeprecationWarning: invalid escape sequence \s
    m = re.search(":param\s+\w*\s*%s:" % arg, doc)

tests/test_step_transform.py:18
  /tmp/toms/rstxml2docbook/tests/test_step_transform.py:18: DeprecationWarning: invalid escape sequence \ 
    </document>'''"""

Actual behaviour

See above

Expected behaviour

No DeprecationWarning reported.

Missing content?

Neutron, admin-configconfig-qos.xml file

Looks like some content is missing:

              <itemizedlist>
                <listitem/>
              </itemizedlist>

Take 'desc' into account in rstxml2db.xsl

Problem description

For some section the matching IDs are not found by DAPS. They are hiding in the desc tag, so we have to take the 'desc' elements into account while converting the RST XML to DocBook.

Actual behaviour

Some IDs/IDREFs connections are missing.

Proposed solution

New template rule to match desc and convert it to the respective DocBook element.

Example:

<desc desctype="attribute" domain="py" noindex="False" objtype="function">
    <desc_signature class="" first="False" fullname="urls" ids="horizon.urls" module="horizon" names="horizon.urls">
      <desc_addname xml:space="preserve">horizon.</desc_addname>
      <desc_name xml:space="preserve">urls</desc_name>
   </desc_signature>
   <desc_content>
       <paragraph>..:</paragraph>
       <literal_block xml:space="preserve">...</literal_block>
   </desc_content>
</desc>

We could transform it into something like this:

  <variablelist>
   <varlistentry xml:id="horizon.urls">
    <term><function>horizon.urls</function></term>
    <listitem>
     <para>The auto-generated URLconf for horizon. Usage:</para>
     <screen language="python">url(r'', include(horizon.urls)),</screen>
    </listitem>
   </varlistentry>
  </variablelist>

It seems, there are several values possible in desc objtype:

`@objtype`	DocBook 5 Element
`function`	`<function>`
`method`	`<function role="method">`
`class`	`<property role="class">` (`<classname>` would be more appropriate, but not allowed in GeekoDoc)
`attribute`	`<property>`
`classmethod`	`<property role="classmethod">`
`staticmethod`	`<property role="staticmethod">`
n/a	`<literal>`