elifesciences / sciencebeam-parser Goto Github PK

A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools together to generate a full XML document.

License: MIT License

Python 97.55% Shell 0.21% XSLT 1.38% Dockerfile 0.38% Makefile 0.48%

sciencebeam grobid

sciencebeam-parser's Introduction

ScienceBeam Parser

⚠️ Under new stewardship

eLife have handed over stewardship of ScienceBeam to The Coko Foundation. You can now find the updated code repository at https://gitlab.coko.foundation/sciencebeam/sciencebeam-parser and continue the conversation on Coko's Mattermost chat server: https://mattermost.coko.foundation/

For more information on why we're doing this read our latest update on our new technology direction: https://elifesciences.org/inside-elife/daf1b699/elife-latest-announcing-a-new-technology-direction

Overview

ScienceBeam Parser allows you to parse scientific documents. Initially is starting as a partial Python variation of GROBID and allows you to re-use some of the models. However, it may deviate more in the future.

Pre-requisites

Docker containers are provided that can be used on multiple operating systems. It can be used as an example setup for Linux / Ubuntu based systems.

Otherwise the following paragraphs list some of the pre-requisits when not using Docker:

This currently only supports Linux due to the binaries used (pdfalto, wapiti). It may also be used on other platforms without Docker, provided matching binaries are configured.

For Computer Vision PyTorch is required.

For OCR, tesseract needs to be installed. On Ubuntu the following command can be used:

apt-get install libtesseract4 tesseract-ocr-eng libtesseract-dev libleptonica-dev

The Word* to PDF conversion requires LibreOffice.

Development

Create Virtual Environment and install Dependencies

make dev-venv

Configuration

There is no implicit "grobid-home" directory. The only configuration file is the default config.yml.

Paths may point to local or remote files. Remote files are downloaded and cached locally (urls are assumed to be versioned).

You may override config values using environment variables. Environment variables should start with SCIENCEBEAM_PARSER__. After that __ is used as a section separator. For example SCIENCEBEAM_PARSER__LOGGING__HANDLERS__LOG_FILE__LEVEL would override logging.handlers.log_file.level.

Generally, resources and models are loaded on demand, depending on the preload_on_startup configuration option (SCIENCEBEAM_PARSER__PRELOAD_ON_STARTUP environment variable). Models will be loaded "eagerly" at startup, by setting the configuration option to true.

Run tests (linting, pytest, etc.)

make dev-test

Start the server

make dev-start

Run the server in debug mode (including auto-reload and debug logging):

make dev-debug

Run the server with auto reload but no debug logging:

make dev-start-no-debug-logging-auto-reload

Submit a sample document to the server

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/pdfalto"

Submit a sample document to the header model

The following output formats are supported:

output_format	description
raw_data	generated data (without using the model)
data	generated data with predicted labels
xml	using simple xml elements for predicted labels
json	json of prediction

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/models/header?first_page=1&last_page=1&output_format=xml"

Submit a sample document to the name-header api

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/models/name-header?first_page=1&last_page=1&output_format=xml"

GROBID compatible APIs

The following APIs are aiming to be compatible with selected endpoints of the GROBID's REST API, for common use-cases.

Submit a sample document to the header document api

The /processHeaderDocument endpoint is similar to the /processFulltextDocument, but it will only contain front matter. It still uses the same segmentation model, but it won't need to process a number of other models.

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processHeaderDocument?first_page=1&last_page=1"

The default response will be TEI XML (application/tei+xml). The Accept HTTP request header may be used to request JATS, with the mime type application/vnd.jats+xml.

curl --fail --show-error \
    --header 'Accept: application/vnd.jats+xml' \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processHeaderDocument?first_page=1&last_page=1"

Regardless, the returned content type will be application/xml.

(BibTeX output is currently not supported)

Submit a sample document to the full text document api

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processFulltextDocument?first_page=1&last_page=1"

The default response will be TEI XML (application/tei+xml). The Accept HTTP request header may be used to request JATS, with the mime type application/vnd.jats+xml.

curl --fail --show-error \
    --header 'Accept: application/vnd.jats+xml' \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processFulltextDocument?first_page=1&last_page=1"

Regardless, the returned content type will be application/xml.

Submit a sample document to the references api

The /processReferences endpoint is similar to the /processFulltextDocument, but it will only contain references. It still uses the same segmentation model, but it won't need to process a number of other models.

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processReferences?first_page=1&last_page=100"

The default response will be TEI XML (application/tei+xml). The Accept HTTP request header may be used to request JATS, with the mime type application/vnd.jats+xml.

curl --fail --show-error \
    --header 'Accept: application/vnd.jats+xml' \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processReferences?first_page=1&last_page=100"

Regardless, the returned content type will be application/xml.

Submit a sample document to the full text asset document api

The processFulltextAssetDocument is like processFulltextDocument. But instead of returning the TEI XML directly, it will contain a zip with the TEI XML document, along with other assets such as figure images.

curl --fail --show-error \
    --output "example-tei-xml-and-assets.zip" \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processFulltextAssetDocument?first_page=1&last_page=1"

The default response will be ZIP containing TEI XML (application/tei+xml+zip). The Accept HTTP request header may be used to request a ZIP containing JATS, with the mime type application/vnd.jats+xml+zip.

curl --fail --show-error \
    --header 'Accept: application/vnd.jats+xml+zip' \
    --output "example-jats-xml-and-assets.zip" \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/processFulltextAssetDocument?first_page=1&last_page=1"

Regardless, the returned content type will be application/zip.

Submit a sample document to the `/convert` api

The /convert API is aiming to be a single endpoint for the conversion of PDF documents to a semantic representation. By default it will return JATS XML.

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/convert?first_page=1&last_page=1"

The following section describe parameters to influence the response:

Using the `Accept` HTTP header parameter

The Accept HTTP header may be used to request a different response type. e.g. application/tei+xml for TEI XML.

curl --fail --show-error \
    --header 'Accept: application/tei+xml' \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/convert?first_page=1&last_page=1"

Regardless, the returned content type will be application/xml.

The /convert endpoint can also be used for a Word* to PDF conversion by specifying application/pdf as the desired response:

curl --fail --show-error --silent \
    --header 'Accept: application/pdf' \
    --form "file=@test-data/minimal-office-open.docx;filename=test-data/minimal-office-open.docx" \
    --output "example.pdf" \
    "http://localhost:8080/api/convert?first_page=1&last_page=1"

Using the `includes` request parameter

The includes request parameter may be used to specify the requested fields, in order to reduce the processing time. e.g. title,abstract to requst the title and the abstract only. In that case fewer models will be used. The output may still contain more fields than requested.

curl --fail --show-error \
    --form "file=@test-data/minimal-example.pdf;filename=test-data/minimal-example.pdf" \
    --silent "http://localhost:8080/api/convert?includes=title,abstract"

The currently supported fields are:

title
abstract
authors
affiliations
references

Passing in any other values (no values), will behave as if no includes parameter was passed in.

Word* support

All of the above APIs will also accept a Word* document instead of a PDF.

Formats that are supported:

.docx (media type: application/vnd.openxmlformats-officedocument.wordprocessingml.document)
.dotx (media type: application/vnd.openxmlformats-officedocument.wordprocessingml.template)
.doc (media type: application/msword)
.rtf (media type: application/rtf)

The support is currently implemented by converting the document to PDF using LibreOffice.

Where no content type is provided, the content type is inferred from the file extension.

For example:

curl --fail --show-error \
    --form "file=@test-data/minimal-office-open.docx;filename=test-data/minimal-office-open.docx" \
    --silent "http://localhost:8080/api/convert?first_page=1&last_page=1"

Docker Usage

docker pull elifesciences/sciencebeam-parser

docker run --rm \
    -p 8070:8070 \
    elifesciences/sciencebeam-parser

Note: Docker images with the tag suffix -cv include the dependencies required for the CV (Computer Vision) models (disabled by default).

docker run --rm \
    -p 8070:8070 \
    --env SCIENCEBEAM_PARSER__PROCESSORS__FULLTEXT__USE_CV_MODEL=true \
    --env SCIENCEBEAM_PARSER__PROCESSORS__FULLTEXT__USE_OCR_MODEL=true \
    elifesciences/sciencebeam-parser:latest-cv

Non-release builds are available with the _unstable image suffix, e.g. elifesciences/sciencebeam-parser_unstable.

sciencebeam-parser's People

Contributors

Stargazers

Watchers

sciencebeam-parser's Issues

benchmark ots

Benchmark OTS conversion.

Untangle dependencies (e.g. create and release sciencebeam-utils project)

@seanwiseman I think I better tackle this sooner. It would be good to help me with it.

I am planning to move the code out of sciencebeam_gym that I am using in sciencebeam and sciencebeam-judge. I would move the code to a new project sciencebeam-utils.

That is mainly:

beam_utils
utils
and some methods from preprocess.preprocessing_utils which I will move to utils.file_path:
- join_if_relative_path (already there actually)
- get_output_file
- change_ext

I have some deprecated code in sciencebeam which uses other stuff I don't want to use but I think I will just disable the corresponding tests generally or if sciencebeam_gym is not available.

I would also move alignment which I currently don't use in sciencebeam-judge but am planning to. It might make sense to use a separate project sciencebeam-alignment (it uses Cython).

Additionally I am also planning to move the following tools (with main):

preprocess.find_file_pairs
preprocess.get_output_files
preprocess.check_file_list
preprocess.split_csv_dataset

Move them to sciencebeam_utils.tools? or is it better at the root sciencebeam_utils?

output_path parameter is not working

XML files are always being saved in the input folder.

sciencebeam.examples.grobid_service_pdf_to_xml

Use new docker-compose ci Jenkins configuration

Remove build_container.sh and use docker-compose to do the build. Update Jenkinsfile.

Test

Description

Blah

Subtasks

Task 1
Task 2
...

Definition of Done

Item 1
Item 2
...

Convert references

deploy doc over azure function or other platform

pls add,thanks for your great works

Difference between GROBID and Sciencebeam-parser

First of all, Sorry for opening a bug. I would like to know about some general point about the Python GROBID and sciencebeam-parser.

Currently sciencebeam-parser only have GROBID intergration, are there any differences in terms of the structure extraction in sciencebeam-parser and GROBID?

Is there any possibility of integration of Table/Figure Extraction tool in sciencebeam-parser so that It can be improved further?

I am performing an Benchmark for various Information Extraction Tools, And I would like to include sciencebeam-parser into my work If it performs differently than GROBID.

Thanks.

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

Conversion to XML from the predictive model

add Science Parse V2 support

add support for Science Parse V2

Convert doc to pdf

API: specify fields to include

API: specify fields to include (and optimise processing accordingly)

i.e. one could add ?include=title,authors,abstract to the URL. It could then process the header only (GROBID would then only process the first three pages)

Add resume flag to beam pipeline runner

If the --resume flag it should only convert the files for which the corresponding output file doesn't exist yet. That check should be performed after applying the limit, i.e. if the limit is 10 and all 10 output files already exist, no items will be processed.

add commit to docker image tag; push sciencebeam docker image

Create benchmark transforms and web page to see them in Texture

#15
"Create benchmark transforms and web page to see them in Texture (use eLife, PLoS, Hindawi and an open BMJ article)"

science-parse evaluation

Add flake8 checking to sciencebeam

Also need to format code using autopep8

Disable Crossref consolidation

Add consolidateHeader=0 parameter to GROBID API call.

Remove Jenkinsfile.update-sciencebeam-gym

@giorgiosironi the sciencebeam project no longer depends on sciencebeam-gym (it's now only using versioned dependencies on pypi).

Can I just remove Jenkinsfile.update-sciencebeam-gym or are there other dependencies?

Release versions for ScienceBeam docker images

It seems to make sense to use release versions now that we encourage xPub to deploy their own ScienceBeam containers.

The versions could be managed like for releases of say sciencebeam-utils. If the version field from the package does not correspond to a git tag, perform the release. In our case push a docker image with the version.

Based on the discussions around the libero project we could scope commit versioned containers with a suffix like _unstable. We could a bit later prune old commit versions from the current container image name.

In the future the code could be release on pypi as well but it doesn't seem useful at present.

/cc @giorgiosironi

ScienceBeam release, GitHub tag not created

It appears that the GitHub tag is currently not created (which might even cause it to override the release version). The code seems to be missing from the Jenkinsfile.

Upgrade to Apache Beam 2.3.0+

Some tests are failing when using Apache Beam 2.3.0 instead of 2.2.0

missing space between author forenames

share sciencebeam-orchester

share the sciencebeam-orchester project which makes it easier to run conversion and evaluation across multiple datasets and tools

processFulltextAssetDocument is not working while processFulltextDocument works fine

I am trying to convert PDF to JATS XML with Assets in zip.

500 Internal Server Error is thrown by latest sciencebeam docker when we try to process PDFs by processFulltextAssetDocument API.

I am using this command to run my docker:
docker run --rm -p 8070:8070 elifesciences/sciencebeam-parser
I have also tried it with Computer Vision via this command:
docker run --rm -p 8070:8070 --env SCIENCEBEAM_PARSER__PROCESSORS__FULLTEXT__USE_CV_MODEL=true --env SCIENCEBEAM_PARSER__PROCESSORS__FULLTEXT__USE_OCR_MODEL=true elifesciences/sciencebeam-parser:latest-cv

Note:
processFulltextDocument works fine or convert API just works fine.
I have also already tried it with multiple sample PDFs like https://www.apa.org/pubs/journals/features/edu-edu0000214.pdf

I am sending this command via cURL:
curl --fail --show-error --header 'Accept: application/vnd.jats+xml+zip' --output "test.zip" --form "file=@/Users/mymacbookusername/Downloads/sample.pdf;filename=sample.pdf" --silent "http://localhost:8070/api/processFulltextAssetDocument?first_page=2"

Here is trackback:
[2021-11-23 10:29:35,329] INFO in sciencebeam_parser.document.tei_document:164: generating tei document done, took: front: 0.008415s, body: 0.617844s, back: 0.455715s, total: 1.081975s
[2021-11-23 10:29:35,336] INFO in sciencebeam_parser.app.parser:286: tei to jats, took=0.007s
[2021-11-23 10:29:35,337] INFO in sciencebeam_parser.app.parser:297: serializing xml, took=0.001s
[2021-11-23 10:29:35,347] ERROR in server:1458: Exception on /api/processFulltextAssetDocument [POST]
Traceback (most recent call last):
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/venv/lib/python3.7/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/opt/sciencebeam_parser/sciencebeam_parser/service/blueprints/api.py", line 760, in process_pdf_to_tei_assets_zip
fulltext_processor_config=self.fulltext_processor_config
File "/opt/sciencebeam_parser/sciencebeam_parser/service/blueprints/api.py", line 701, in _process_pdf_to_response_media_type
response_media_type
File "/opt/sciencebeam_parser/sciencebeam_parser/app/parser.py", line 496, in get_local_file_for_response_media_type
response_media_type
File "/opt/sciencebeam_parser/sciencebeam_parser/app/parser.py", line 393, in get_local_file_for_response_media_type
response_media_type
File "/opt/sciencebeam_parser/sciencebeam_parser/app/parser.py", line 331, in get_local_file_for_response_media_type
relative_xml_filename=relative_xml_filename
File "/opt/sciencebeam_parser/sciencebeam_parser/app/parser.py", line 114, in create_asset_zip_for_semantic_document
assert semantic_graphic.relative_path
AssertionError
[2021-11-23 10:29:35,348] INFO in werkzeug:225: 172.17.0.1 - - [23/Nov/2021 10:29:35] "POST /api/processFulltextAssetDocument?first_page=2 HTTP/1.1" 500 -

500 error when converting PDF

Steps to reproduce

Delete all Docker containers docker rm $(docker ps -aq)
Delete all Docker images docker rmi $(docker images -q)
Clone repo git clone https://github.com/elifesciences/sciencebeam.git
Start containers docker-compose up
Download sample PDF curl "https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMzI2NzEvZWxpZmUtMzI2NzEtdjIucGRm/elife-32671-v2.pdf?_hash=nrG1HRdFl4DZPdYxrP0OOJfOcyNJrkWHhR5HiBe0O4M%3D" > elife-32671-v2.pdf
Send PDF to ScienceBeam curl -XPOST --data-binary @elife-32671-v2.pdf -v -H "content-type: application/octet-stream" http://localhost:8075/api/convert\?filename=elife-32671-v2.pdf

Expected

Returns JATS XML

Actual

Sometimes it returns the JATS but sometimes it returns a HTTP 500 error. Usually the first request after starting the container succeeds and the second one fails but this is not entirely consistent.

ScienceBeam logs for failed request

sciencebeam_1  | DEBUG:sciencebeam.server.blueprints.api:processing file: elife-32671-v2.pdf (6429323 bytes, type "application/pdf")
sciencebeam_1  | DEBUG:sciencebeam.pipeline_runners.simple_pipeline_runner:skipping step (type "application/pdf" not supported): DOC to PDF
sciencebeam_1  | DEBUG:sciencebeam.pipeline_runners.simple_pipeline_runner:executing step (with type "application/pdf"): Convert to TEI
sciencebeam_1  | INFO:sciencebeam.transformers.grobid_service:processing: elife-32671-v2.pdf (6429323) - http://grobid:8070/api/processFulltextDocument
sciencebeam_1  | DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): grobid
grobid_1       | INFO  [2018-05-16 14:10:15,878] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
grobid_1       | ERROR [2018-05-16 14:10:17,516] org.grobid.core.process.ProcessPdf2Xml: pdftoxml process finished with error code: 137. [/opt/grobid/grobid-home/pdf2xml/lin-64/pdftoxml_server, -blocks, -noImageInline, -fullFontName, -noImage, -annotation, /opt/grobid/grobid-home/tmp/origin2492405599912754936.pdf, /opt/grobid/grobid-home/tmp/JQ3qaXQRd7.lxml]
grobid_1       | ERROR [2018-05-16 14:10:17,516] org.grobid.core.process.ProcessPdf2Xml: pdftoxml return message:
grobid_1       |
grobid_1       | 172.18.0.3 - - [16/May/2018:14:10:17 +0000] "POST /api/processFulltextDocument HTTP/1.1" 400 208 "-" "python-requests/2.18.4" 1796
sciencebeam_1  | DEBUG:urllib3.connectionpool:http://grobid:8070 "POST /api/processFulltextDocument HTTP/1.1" 400 208
sciencebeam_1  | [2018-05-16 14:10:17,572] ERROR in app: Exception on /api/convert [POST]
sciencebeam_1  | Traceback (most recent call last):
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
sciencebeam_1  |     response = self.full_dispatch_request()
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1614, in full_dispatch_request
sciencebeam_1  |     rv = self.handle_user_exception(e)
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask_cors/extension.py", line 161, in wrapped_function
sciencebeam_1  |     return cors_after_request(app.make_response(f(*args, **kwargs)))
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1517, in handle_user_exception
sciencebeam_1  |     reraise(exc_type, exc_value, tb)
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1612, in full_dispatch_request
sciencebeam_1  |     rv = self.dispatch_request()
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1598, in dispatch_request
sciencebeam_1  |     return self.view_functions[rule.endpoint](**req.view_args)
sciencebeam_1  |   File "sciencebeam/server/blueprints/api.py", line 70, in _convert
sciencebeam_1  |     content=content, filename=filename, data_type=data_type
sciencebeam_1  |   File "sciencebeam/pipeline_runners/simple_pipeline_runner.py", line 42, in convert
sciencebeam_1  |     current_item = step(current_item)
sciencebeam_1  |   File "sciencebeam/pipelines/__init__.py", line 53, in __call__
sciencebeam_1  |     return self.fn(data)
sciencebeam_1  |   File "sciencebeam/pipelines/grobid_pipeline.py", line 62, in 
sciencebeam_1  |     pdf_content=data['content']
sciencebeam_1  |   File "sciencebeam/pipelines/grobid_pipeline.py", line 56, in 
sciencebeam_1  |     convert_to_tei = lambda pdf_filename, pdf_content: call_grobid((pdf_filename, pdf_content))[1]
sciencebeam_1  |   File "sciencebeam/transformers/grobid_service.py", line 48, in do_grobid_service
sciencebeam_1  |     response.raise_for_status()
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/requests/models.py", line 935, in raise_for_status
sciencebeam_1  |     raise HTTPError(http_error_msg, response=self)
sciencebeam_1  | HTTPError: 400 Client Error: Bad Request for url: http://grobid:8070/api/processFulltextDocument
sciencebeam_1  | INFO:werkzeug:172.19.0.1 - - [16/May/2018 14:10:17] "POST /api/convert?filename=elife-32671-v2.pdf HTTP/1.1" 500 -

RuntimeError: OSError: [Errno 2] No such file or directory [while running 'Map(<functools.partial object at 0x7f367367bdb8>)']

What am I doing wrong? I installed sciencebeam a month ago and last week my vm got deleted, now I reinstalled the sb on it but I always get this:

python2 -m sciencebeam.examples.grobid_service_pdf_to_xml --input "/home/lopo/"
INFO:main:default_values: {'runner': 'FnApiRunner'}
INFO:main:parsed_args: Namespace(autoscaling_algorithm='NONE', cloud=False, grobid_action='/processHeaderDocument', grobid_url='http://localhost:8080/api', input='/home/lopo/', max_num_workers=10, num_workers=10, output_path='/home/lopo', output_suffix='.tei-header.xml', project=None, runner='FnApiRunner', setup_file='./setup.py', start_grobid_service=True, xslt_path=None)
INFO:root:==================== <function annotate_downstream_side_inputs at 0x7f3672f040c8> ====================
INFO:root:==================== <function fix_side_input_pcoll_coders at 0x7f3672f04758> ====================
INFO:root:==================== <function lift_combiners at 0x7f3672f042a8> ====================
INFO:root:==================== <function expand_gbk at 0x7f3672f041b8> ====================
INFO:root:==================== <function sink_flattens at 0x7f3672f04140> ====================
INFO:root:==================== <function greedily_fuse at 0x7f3672f047d0> ====================
INFO:root:==================== <function sort_stages at 0x7f3672f04848> ====================
INFO:root:Running (ref_AppliedPTransform__ReadFullFile/Read_3)+((ref_AppliedPTransform_Map(<functools.partial object at 0x7f367367bdb8>)_4)+((ref_AppliedPTransform_MapKeys_5)+(ref_AppliedPTransform_WriteToFile/Map()_7)))
INFO:root:start <DoOperation WriteToFile/Map() output_tags=['out']>
INFO:root:start <DoOperation MapKeys output_tags=['out']>
INFO:root:start <DoOperation Map(<functools.partial object at 0x7f367367bdb8>) output_tags=['out']>
INFO:root:start <ReadOperation _ReadFullFile/Read source=SourceBundle(weight=1.0, source=<sciencebeam.beam_utils.fileio._ReadFullFileSource object at 0x7f3672f6d4d0>, start_position=None, stop_position=None)>
INFO:sciencebeam.transformers.grobid_service_wrapper:grobid_service_instance: None
INFO:sciencebeam.transformers.grobid_service_wrapper:command_line: java -cp "/home/lopo/stuff/sciencebeam/.temp/grobid-service/lib/" org.grobid.service.main.GrobidServiceApplication
INFO:sciencebeam.transformers.grobid_service_wrapper:args: ['java', '-cp', '/home/tlopo/stuff/sciencebeam/.temp/grobid-service/lib/', 'org.grobid.service.main.GrobidServiceApplication']
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/lopo/stuff/sciencebeam/sciencebeam/examples/grobid_service_pdf_to_xml.py", line 190, in
run()
File "/home/lopo/stuff/sciencebeam/sciencebeam/examples/grobid_service_pdf_to_xml.py", line 182, in run
configure_pipeline(p, known_args)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 410, in exit
self.run().wait_until_finish()
File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 390, in run
self.to_runner_api(), self.runner, self._options).run(False)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/pipeline.py", line 403, in run
return self.runner.run_pipeline(self)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 218, in run_pipeline
return self.run_via_runner_api(pipeline.to_runner_api())
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 221, in run_via_runner_api
return self.run_stages(*self.create_stages(pipeline_proto))
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 859, in run_stages
pcoll_buffers, safe_coders).process_bundle.metrics
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 970, in run_stage
self._progress_frequency).process_bundle(data_input, data_output)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 1174, in process_bundle
result_future = self._controller.control_handler.push(process_bundle)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/portability/fn_api_runner.py", line 1054, in push
response = self.worker.do_instruction(request)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 208, in do_instruction
request.instruction_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py", line 230, in process_bundle
processor.process_bundle(instruction_id)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/bundle_processor.py", line 289, in process_bundle
op.start()
File "apache_beam/runners/worker/operations.py", line 243, in apache_beam.runners.worker.operations.ReadOperation.start
File "apache_beam/runners/worker/operations.py", line 244, in apache_beam.runners.worker.operations.ReadOperation.start
File "apache_beam/runners/worker/operations.py", line 253, in apache_beam.runners.worker.operations.ReadOperation.start
File "apache_beam/runners/worker/operations.py", line 175, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 85, in apache_beam.runners.worker.operations.ConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 403, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 404, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 569, in apache_beam.runners.common.DoFnRunner.receive
File "apache_beam/runners/common.py", line 577, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 618, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 575, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 353, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/core.py", line 973, in
wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
File "/home/lopo/stuff/sciencebeam/sciencebeam/transformers/grobid_service.py", line 52, in run_grobid_service
start_service_if_not_running()
File "/home/lopo/stuff/sciencebeam/sciencebeam/transformers/grobid_service.py", line 25, in start_service_if_not_running
service_wrapper.start_service_if_not_running()
File "/home/lopo/stuff/sciencebeam/sciencebeam/transformers/grobid_service_wrapper.py", line 98, in start_service_if_not_running
args, cwd=cwd, stdout=PIPE, stderr=subprocess.STDOUT
File "/usr/lib/python2.7/subprocess.py", line 394, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1047, in _execute_child
raise child_exception
RuntimeError: OSError: [Errno 2] No such file or directory [while running 'Map(<functools.partial object at 0x7f367367bdb8>)']

relative file lists

Add support for relative file lists:

when creating a file list, allow file paths to be relative to the base path
when loading file lists, make file paths absolute by default (using the directory of the file list as the base path)

The advantage is that a file list can be shared and moved around. Whereas absolute file paths would still point to the original source.

The default should be to use absolute paths to avoid issues (e.g. reading the relative paths wouldn't be useful without taking the source into account).

Upgrade Flask (due to security vulnerability)

Create demo app for ScienceBeam output displayed in Texture

Use GROBID XSL to transform title, abstract etc. to JATS
Load result into Texture locally
Create Docker image to deploy results of transform and texture to the web
Deploy to cloud
Create benchmark transforms and web page to see them in Texture (use eLife, PLoS, Hindawi and an open BMJ article).
Create API to take PDF and return JATS XML using GROBID XSL
Create upload PDF -> Texture page

elifesciences / sciencebeam-parser Goto Github PK

sciencebeam-parser's Introduction

ScienceBeam Parser

⚠️ Under new stewardship

Overview

Pre-requisites

Development

Create Virtual Environment and install Dependencies

Configuration

Run tests (linting, pytest, etc.)

Start the server

Submit a sample document to the server

Submit a sample document to the header model

Submit a sample document to the name-header api

GROBID compatible APIs

Submit a sample document to the header document api

Submit a sample document to the full text document api

Submit a sample document to the references api

Submit a sample document to the full text asset document api

Submit a sample document to the /convert api

Using the Accept HTTP header parameter

Using the includes request parameter

Word* support

Docker Usage

See also

sciencebeam-parser's People

Contributors

Stargazers

Watchers

Forkers

sciencebeam-parser's Issues

Description

Subtasks

Definition of Done

Steps to reproduce

Expected

Actual

Recommend Projects

Recommend Topics

Recommend Org

Submit a sample document to the `/convert` api

Using the `Accept` HTTP header parameter

Using the `includes` request parameter