fmalina / unilex-transcript Goto Github PK

View Code? Open in Web Editor NEW

27.0 3.0 8.0 3.84 MB

Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.

License: GNU General Public License v3.0

Python 6.28% HTML 93.72%

unilex-transcript's Introduction

PDF to semantic HTML conversion

Transcript contains Python programs whose job is to transcribe PDF into sematic HTML.

transcript.py: Get semantic HTML from PDFs converted by pdf2htmlEX.
ttf.py: Recover lost text from PDFs where characters are nothing more than images of themselves.
pdf2html.py: Batch process a folder full of PDFs ready for transcript.py

Read the docstrings for more information.

Example

PDF before and semantic HTML after

Install

Get Python 3 installed along with latest pdf2htmlEX. e.g. with Homebrew:

brew install python3 pdf2htmlEX

Docker install of pdf2htmlEX is also supported (brew one started failing as of late). This particular image is tested and used in the default config via DOCKER_IMG_TAG.

docker pull pdf2htmlex/pdf2htmlex:0.18.8.rc2-master-20200820-ubuntu-20.04-x86_64

Install lxml under python3 pip3 install lxml or just run the following and get freetype-py too.

pip3 install -r requirements.txt

Configure

Configure your project path in your .env file and config.py most importantly the DATA_DIR. This can be any folder let's say DATA_DIR=/path/to/unilex-transcript/tests. If you use a docker install of pdf2htmlEX, you'll need to set DOCKER_INSTALL=1 This will mount your data dir to Docker path. DOCKER_IMG_TAG is also configurable. Go ahead create your .env file and add DATA_DIR=...

Your DATA_DIR should end up containing 3 folders: PDF, HTML and HTM if you otherwise stick with default configuration. Create a 'PDF' folder inside and drop your PDFs there.

PDF is a folder where your PDFs are.
HTML is where pdf2htmlEX output (non-semantic HTML) ends up after running ./pdf2html.py, which just runs pdf2htmlEX with suitable options.
HTM is the final destination where semantic HTML gets born after running ./transcript.py.

Run

./pdf2html.py

./transcript.py

When you change configuration within ./transcript.py or tweak some code. You only need to run ./transcript.py

Development process

Set expected (hand-adjusted) output to aim for and improve codebase to get transcript output closer to the ideal semantic output. Make sure your changes don't make output worse for other tests. Use flake8.

Dual Licensing

Commercial license

If you want to use Transcript to develop and run commercial projects and applications, the Commercial license is the appropriate license. With this option, your source code is kept proprietary.

Once purchased, you will be granted a commercial BSD style license and all set to use Transcript in your business.

Small Team License (£1200) Small Team License for up to 8 developers

Organization License (£3200) Commercial Organization License for Unlimited developers

Open source license

If you are creating an open source application under a license compatible with the GNU GPL license v3, you may use Transcript under the terms of the GPLv3.

unilex-transcript's People

Contributors

Stargazers

Watchers

Forkers

destefani m8e dogweather paddoum mars-96 airtonix pjahad alosongngu

unilex-transcript's Issues

Two issues: _ file selection, manual creation of PDF, HTML, and HTM directories

Issue 1: In the last line of pdf2html.py and transcript.py you select files based on "_" yet it's unclear (and unmentioned anywhere) why filenames must have the format _. With this setting, pdf2html.py and transcript.py where not finding all pdf files in the PDF directory. When I replaced "_" with simply "*" it did find the pdf files.

Issue 2: I may have missed it, but in the README it might help to clearly specify that PDF, HTML and HTM directories have to be manually created by the user. But, it would seem better behavior that the script automatically create HTML and HTM directories if they don't exist, since the script is the only one that writes to those directories.

how to use it with pdf2htmlex debain package

how to use this if I have the pdf2htmlEX setup up and running that is Debian package on Ubuntu 22.04 please let me know as I am already converting pdf to html using pdf2htmlex and now further want to semanticize the html file as it has multiple spans and div tags to be eliminated and tables aren't generated in tr, td tags after my conversion from pdf2htmlex

Internal Error: assembly tables at wrong place

@coolwanglu I'm working with the linux enviornment 16.04 LTS and I compiled my pdf2htmlEX from this site: https://github.com/fmalina/transcript
and ran this command: brew install python3 pdf2htmlEX but when i run my file it keeps saying
No glyph for the key character to derive standard width and height.
For the latin script, this key character is `o' (U+006F).
Lookup 'mark' Mark Positioning lookup 7 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Internal Error: assembly tables at wrong place
Is this font forge error and how to resolve this?

Originally posted by @AroojPerzada in coolwanglu/pdf2htmlEX#333 (comment)

coolwanglu/pdf2htmlEX#333 (comment)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3268: ordinal not in range(128)

pdf file :s5_100006_2008_06_09__1.pdf

run in docker

#Dockerfile to build a pdf2htmlEx image

FROM ubuntu:15.10

#[wanghs@db2 debian]$ docker build -t dc/pdf2htmlex .
#[wanghs@db2 debian]$ docker run --rm --name pdf2htmlex-programming-demo -it -v $(pwd):/tmp dc/pdf2htmlex /bin/bash




ADD sources.list /etc/apt/sources.list

#
#Install git and all dependencies
#
RUN  apt-get -y update &&  apt-get install -qq git cmake autotools-dev libjpeg-dev libtiff5-dev libpng12-dev libgif-dev libxt-dev autoconf automake libtool bzip2 libxml2-dev libuninameslist-dev libspiro-dev libpango1.0-dev libcairo2-dev chrpath uuid-dev uthash-dev    python3.4   python3.4-dev python3-pip  libfreetype6 libqtcore4 libqtgui4 ttfautohint poppler-data libjpeg-dev 


#
#Clone the pdf2htmlEX fork of fontforge
#compile it
#
RUN git clone https://github.com/coolwanglu/fontforge.git fontforge.git
RUN cd fontforge.git && git checkout pdf2htmlEX && ./autogen.sh && ./configure && make V=1 &&  make install

#
#Install poppler utils
#
RUN  apt-get install -qq libpoppler-glib-dev libpoppler-private-dev

#
#Clone and install the pdf2htmlEX git repo



RUN  \
    cd /tmp \ 
    && apt-get install wget \
    && wget https://github.com/coolwanglu/pdf2htmlEX/archive/v0.14.6.tar.gz \
    && tar xvf v0.14.6.tar.gz \
    && cd pdf2htmlEX* 
RUN cd  /tmp/pdf2htmlEX*   &&  cmake . && make &&  make install



# install  transcript  
RUN apt-get install -y libxml2-dev libxslt1-dev

RUN  \
     cd /tmp \ 
     && git clone  https://github.com/fmalina/transcript \
    && cd transcript \
#    && pip3 install  -i https://pypi.tuna.tsinghua.edu.cn/simple lxml cssselect freetype-py
    && pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple  -r requirements.txt 

VOLUME /pdf
WORKDIR /pdf

CMD ["pdf2htmlEX"]

error like this

root@85f8bef09062:/tmp/transcript# ./pdf2html.py
Preprocessing: 11/11
Working: 11/11

[None]
root@85f8bef09062:/tmp/transcript# ./transcript.py
Loading fonts from: ../ttf
FangSong Regular 28562 glyphs
/tmp/transcript/test-data/HTML/s5_100006_2008_06_09__1/s5_100006_2008_06_09__1.html
'ascii' codec can't decode byte 0xee in position 3046: ordinal not in range(128)
Traceback (most recent call last):
  File "./transcript.py", line 362, in batch_process
    semanticize(path)
  File "./transcript.py", line 251, in semanticize
    dom, dimensions = prepare(doc_path)
  File "./transcript.py", line 209, in prepare
    doc = s = open(doc_path).read()
  File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 3046: ordinal not in range(128)

/tmp/transcript/test-data/HTML/3.x/3.x.html
'ascii' codec can't decode byte 0xe5 in position 817: ordinal not in range(128)
Traceback (most recent call last):
  File "./transcript.py", line 362, in batch_process
    semanticize(path)
  File "./transcript.py", line 251, in semanticize
    dom, dimensions = prepare(doc_path)
  File "./transcript.py", line 209, in prepare
    doc = s = open(doc_path).read()
  File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 817: ordinal not in range(128)

/tmp/transcript/test-data/HTML/1/1.html
'ascii' codec can't decode byte 0xe5 in position 815: ordinal not in range(128)
Traceback (most recent call last):
  File "./transcript.py", line 362, in batch_process
    semanticize(path)
  File "./transcript.py", line 251, in semanticize
    dom, dimensions = prepare(doc_path)
  File "./transcript.py", line 209, in prepare
    doc = s = open(doc_path).read()
  File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 815: ordinal not in range(128)

/tmp/transcript/test-data/HTML/2/2.html
'ascii' codec can't decode byte 0xe2 in position 3268: ordinal not in range(128)
Traceback (most recent call last):
  File "./transcript.py", line 362, in batch_process
    semanticize(path)
  File "./transcript.py", line 251, in semanticize
    dom, dimensions = prepare(doc_path)
  File "./transcript.py", line 209, in prepare
    doc = s = open(doc_path).read()
  File "/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3268: ordinal not in range(128)

root@85f8bef09062:/tmp/transcript#

Cannot get any output.

Trying to configure config.py without success.

My pdf was generated by lualatex, and i can convert it to html using pdf2htmlEX.exe°, so i'd assume my pdf is not an issue. However i can't get this repo to work. I've tried converting other pdfs and the outcome is the same described here.

I've configured the config.py like so
DATA_DIR = "/Users/Matteo Wakalaka/font-tools/transcript-data"
PDF_DIR = DATA_DIR+'PDF'
HTML_DIR = DATA_DIR+'HTML'
FULL_FONTS_PATH = "/Users/Matteo Wakalaka/font-tools/otf"

I'm on windows so i can't use brew to install pdf2htmlEX. I installed it from here ° as linked from pdf2htmlEX download page, and added it and ttfautohint to the path in case it's needed for transcript.py or pdf2html.py. (?)

When i try to execute i get this output in the console, and empty folders transcript-dataHTML and transcript-dataHTM are generated.

How do you use this?

I am trying to implement your script on my converted html pages, but I don't understand how to.

Do I simply place transcript.py in the directory with the converted html and run 'python transcript.py'?

Also when I run it I get errors telling me:

File "transcript.py", line 25, in
from ttf import pua_content, recover_text
File "/Users/hartley9/Desktop/transcript-master/ttf.py", line 47
SyntaxError: Non-ASCII character '\xe2' in file /Users/hartley9/Desktop/transcript-master/ttf.py on line 47, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

I have followed many tutorials explaining that I need to install lxml.html, I've installed it but it still doesn't work.

Sorry I realize these aren't very constructive issues, and maybe you don't find them worth helping me with, but any help is very much appreciated!

output files are never created if using pdf2htmlex via docker

I cant install pdf2htmlex on fedora, so i tried using the docker approach.

It wont create output files for some reason.

i looked at how you're launching the image and i noticed you're not aligning the uid/gid in the container to the uid/gid on the host.

Requires: pip install freetype-py - even though installed

$ ./transcript.py
Requires: pip install freetype-py
Loading fonts from: /Users/f/SITES/etc/ttf
$ pip install freetype-py
Requirement already satisfied (use --upgrade to upgrade): freetype-py in /usr/local/lib/python2.7/site-packages
$ pip install freetype-py --upgrade
Requirement already up-to-date: freetype-py in /usr/local/lib/python2.7/site-packages

Any idea what could be going wrong here?

I'm using Mac OSX 10.11.5. I followed the install instructions in the README.

Improve table recognition (renamed)

It cannot create html tables from tables in the pdf and totally skips 'em, including the text. Not very useful.

Tested on Intel manual (see: Attached file)

vol2a_2016_test1.pdf

Example links showing not available

Recover text in sementicize

When reading the source code I came across this weird couple of lines in transcript.py:

    # recover text from embedded fonts with bad CMAPS if > 50% of characters are unicode PUA
    recover = pua_content(dom.text_content()) > 0.5
    if recover:
        print('Recovery needed, not now.')
        return
        recover_text(dom, os.path.dirname(doc_path))

There is a call to recover text there, but it is unreachable since there is a return statement right before it. What is the intended behaviour here? If "not now", when?

Running pdf2html.py

Hello,

Wanted to thank you for creating this, for I have a similar need where I need to batch convert pdfs to html. Having never worked in python before, I have followed your steps in getting this all set up. When I run ./pdf2html.pyhowever, my terminal screen only pops up with [ ].

I am confident this is user error, and will keep tinkering around, but any suggestions would be very appreciated.

No glyph for the key character to derive standard width and height.

I'm working with ubuntu 16.04 LTS and I ran those 2 commands present in your Readme i.e. brew install python3 pdf2htmlEX and pip3 install -r requirements.txt and then configure Data_Path as well but when I run my file i.e. ./pdf2htmlpy ./transcript.py it keeps on saying
No glyph for the key character to derive standard width and height.
For the latin script, this key character is o' (U+006F). No glyph for the key character to derive standard width and height. For the latin script, this key character is o' (U+006F).
Lookup 'mark' Mark Positioning in Arabic lookup 0 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning in Arabic lookup 0 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning in Arabic lookup 0 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning in Arabic lookup 0 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning in Arabic lookup 0 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
No glyph for the key character to derive standard width and height.
For the latin script, this key character is o' (U+006F). Lookup 'mark' Mark Positioning in Arabic lookup 0 has an offset bigger than 65535 bytes. This means FontForge must use an extension lookup to output it. Not all applications support extension lookups. No glyph for the key character to derive standard width and height. For the latin script, this key character is o' (U+006F).
Lookup 'mark' Mark Positioning in Arabic lookup 0 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning in Arabic lookup 0 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning lookup 8 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
No glyph for the key character to derive standard width and height.
For the latin script, this key character is `o' (U+006F).
Lookup 'mark' Mark Positioning lookup 7 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Internal Error: assembly tables at wrong place
Working: 2/2
[None]

Is this font forge error and how to resolve this? as my HTML folder afterwards do contain my pdf converted htmls but without images and it doesn't parse my html, my HTM folder remains empty

my pdf2htmlEX version 0.14.6
Libraries:
poppler 0.41.0
libfontforge 20120731
Default data-dir: /usr/share/pdf2htmlEX
Supported image format: png jpg
where am I wrong?

cssselect does not seem to be installed

First of all, thanks for this complement for the pdf2htmlEX tool, is really cool, I've been playing a lot with the configs!

I wasn't sure about opening the issue, but maybe in the future somebody else can have this issue, so I think it would be a good thing to point out.

At first try, I just followed the README to do all the installation and configurations. After that, I was trying to execute the scripts, but then I had an issue while executing transcript.py. I didn't save the traceback logs, but here is something related with this issue. Although there is a question on StackOverflow (not too much related), it is not a fancy thing to solve. Just installing cssselect with pip was enough to make the script works. I was thinking in send a pull with some contribution for that issue, by just modifying the README and the requirements, but I'm not sure enough about the way of contributing to the project.

Also, on the lxml documentation (updated on March 17th) they mention the lxml.cssselect as an external module, so that is also maybe related to this.