Giter VIP home page Giter VIP logo

deepfigures-open's Introduction

deepfigures-open

Figure extraction using deep neural nets.

deepfigures-open is the companion code to the paper Extracting Scientific Figures with Distantly Supervised Neural Networks. It provides code to run our model and extract figures from PDFs, as well as code for generating our training data. The generated dataset used in our paper is available for download here.

Note: This is research code and is not intended for use in production.

Setup: Running the Model

Compile pdffigures2

Deepfigures depends on pdffigures2 for caption extraction. You must compile the utility and place it into the bin/ directory:

git clone https://github.com/allenai/pdffigures2
cd pdffigures2
sbt assembly
mv target/scala-2.11/pdffigures2-assembly-0.0.12-SNAPSHOT.jar ../bin
cd ..
rm -rf pdffigures2

If the jar for pdffigures has a different name then 'pdffigures2-assembly-0.0.12-SNAPSHOT.jar', then adjust the PDFFIGURES_JAR_NAME parameter in deepfigures/settings.py accordingly.

Download Weights for the Model

You have to download weights for the deepfigures model into this repository in order to run it. You can download a tarball of the weights here. Once you've downloaded the tarball, extract it and place the weights/ directory in the root of this repository.

If you choose to name the weights directory something different, be sure to update the TENSORBOX_MODEL constant in deepfigures/settings.py.

Setup: Generating Training Data

Set Arxiv Data Directories

In deepfigures/settings.py set the ARXIV_DATA_TMP_DIR and ARXIV_DATA_OUTPUT_DIR variables to local directories on your machine. Make sure that these directories have at least a few TBs of storage since there are a lot of arXiv papers.

Set the Pubmed Data Directories

In deepfigures/settings.py set the PUBMED_INPUT_DIR, PUBMED_INTERMEDIATE_DIR, PUBMED_DISTANT_DATA_DIR, and LOCAL_PUBMED_DISTANT_DATA_DIR to different directories.

PUBMED_INPUT_DIR, PUBMED_INTERMEDIATE_DIR, and PUBMED_DISTANT_DATA_DIR can be directories in S3, but LOCAL_PUBMED_DISTANT_DATA_DIR should be a local directory.

Additionally, PUBMED_INPUT_DIR should have all of the Pubmed Open Access subset papers split into directories with the following structure:

xx/yy/example-pmc-data.tar.gz

Where xx and yy range from 00 to ff.

Install Dependencies

Make sure you have docker installed and that you also have all the requirements installed:

pip install -r requirements.txt

AWS Integration

Much of the functionality for this code requires usage of AWS (such as downloading the data for arxiv). Make sure the deepfigures-local.env file is filled out with your AWS credentials if you want to run with this functionality. Please note that running this code with the AWS functionality will incur charges on your AWS account.

The AWS integration is used for:

  • downloading the arXiv data dump from S3 to generate the arXiv paper labels.
  • storing intermediate computations in S3 while running the pubmed data pipeline.

For most use cases, users will prefer to download the dataset directly rather than rebuilding it themselves.

Using the Library

Use the manage.py script in the root of this repository to view common commands for development. To get a list of commands, run:

python manage.py --help

You'll see something like:

$ python manage.py --help
Usage: manage.py [OPTIONS] COMMAND [ARGS]...

  A high-level interface to admin scripts for deepfigures.

Options:
  -v, --verbose        Turn on verbose logging for debugging purposes.
  -l, --log-file TEXT  Log to the provided file path instead of stdout.
  -h, --help           Show this message and exit.

Commands:
  build           Build docker images for deepfigures.
  detectfigures   Run figure extraction on the PDF at PDF_PATH.
  generatearxiv   Generate arxiv data for deepfigures.
  generatepubmed  Generate pubmed data for deepfigures.
  testunits       Run unit tests for deepfigures.

To learn more about a command, call it with the --help option.

To extract figures from a PDF, use the detectfigures command.

Contact

For questions, contact the authors of the paper Extracting Scientific Figures with Distantly Supervised Neural Networks.

deepfigures-open's People

Contributors

milescrawford avatar nalourie-ai2 avatar nkconnor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepfigures-open's Issues

Error: Page 2 is an image and allow OCR is turned off

10:40:30.718 [main] DEBUG o.a.pdffigures2.GraphicsExtractor$ - Page 2 is an image and allow OCR is false, giving up
Exception in thread "main" org.allenai.pdffigures2.FigureExtractor$OcredPdfException: Page 2 is an image and allow OCR is turned off
at org.allenai.pdffigures2.GraphicsExtractor$.extractRawGraphics(GraphicsExtractor.scala:66)
at org.allenai.pdffigures2.GraphicsExtractor$.extractGraphics(GraphicsExtractor.scala:32)
at org.allenai.pdffigures2.FigureExtractor$$anonfun$7.apply(FigureExtractor.scala:133)

Did anyone get this error?

./vendor/tensorbox indicated but missing from repository

ADD ./vendor/tensorbox /work/vendor/tensorbox

Docker build fails:

Step 8/14 : ADD ./vendor/tensorbox /work/vendor/tensorbox
ADD failed: stat /var/lib/docker/tmp/docker-builder498793371/vendor/tensorbox: no such file or directory
Traceback (most recent call last):
  File "manage.py", line 70, in <module>
    manage()
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/nconnor/Repos/deepfigures-open/scripts/detectfigures.py", line 49, in detectfigures
    build.build.callback()
  File "/home/nconnor/Repos/deepfigures-open/scripts/build.py", line 34, in build
    logger)
  File "/home/nconnor/Repos/deepfigures-open/scripts/__init__.py", line 54, in execute
    cmd=command)

installation problem

hello,
Thanks for the work on deepfigure_open.
I have an installation problem when i try to install it on a virtual machine
The sbt assembly state return this output
[error] /home/opencv/pdffigures2/src/main/scala/org/allenai/pdffigures2/VisualLogger.scala:173:45: value EXIT_ON_CLOSE is not a member of object javax.swing.JFrame
[error] frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE)
[error] ^
[error] one error found
[error] (Compile / compileIncremental) Compilation failed

and my system is :
Openjdk 11.0.5 2019-10-15
OpenJDK Runtime Environment (build 11.0.5+10-post-Ubuntu-0ubuntu1.118.04)
OpenJDK 64-Bit Server VM (build 11.0.5+10-post-Ubuntu-0ubuntu1.118.04, mixed mode, sharing)

Maybe you can help me to figure what's wrong .
thanks for your help

Server access Error

Would you be able to explain what might cause this problem during installation ?

:: problems summary :: :::: ERRORS Server access Error: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target url=https://repo.typesafe.com/typesafe/ivy-releases/org.sonatype.oss/oss-parent/9/jars/oss-parent.jar

Thanks

Invalid value for "PDF_PATH": File "/work/host-input/paper.pdf" does not exist.

Docker build has succeeded.

Usage: rundetection.py [OPTIONS] OUTPUT_DIRECTORY PDF_PATH
Try "rundetection.py -h" for help.

Error: Invalid value for "PDF_PATH": File "/work/host-input/paper.pdf" does not exist.
Traceback (most recent call last):
File "manage.py", line 70, in
manage()
File "/media/amax/Masters/liupengxue/lpx/lib/python3.5/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/media/amax/Masters/liupengxue/lpx/lib/python3.5/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/media/amax/Masters/liupengxue/lpx/lib/python3.5/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/media/amax/Masters/liupengxue/lpx/lib/python3.5/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/media/amax/Masters/liupengxue/lpx/lib/python3.5/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/media/amax/Masters/liupengxue/deepfigures/deepfigures-open-master/scripts/detectfigures.py", line 79, in detectfigures
raise_error=True)
File "/media/amax/Masters/liupengxue/deepfigures/deepfigures-open-master/scripts/init.py", line 54, in execute
cmd=command)
subprocess.CalledProcessError: Command 'docker run --rm --env-file deepfigures-local.env --volume /media/amax/Masters/liupengxue/deepfigures/deepfigures-open-master/results:/work/host-output/ --volume /media/amax/Masters/liupengxue/deepfigures/deepfigures-open-master:/work/host-input/ deepfigures-cpu:0.0.1 python3 /work/scripts/rundetection.py /work/host-output/ /work/host-input/paper.pdf' returned non-zero exit status 2

java.lang.NullPointerException while installing pdffigures2

I installed Scala and tried to install pdffigures2, but it failed:

richard@pito ๎‚ฐ ~/src ๎‚ฐ rm -rf pdffigures*      
 richard@pito ๎‚ฐ ~/src ๎‚ฐ git clone https://github.com/allenai/pdffigures2
Cloning into 'pdffigures2'...
remote: Enumerating objects: 392, done.
remote: Total 392 (delta 0), reused 0 (delta 0), pack-reused 392
Receiving objects: 100% (392/392), 6.43 MiB | 2.48 MiB/s, done.
Resolving deltas: 100% (111/111), done.
 richard@pito ๎‚ฐ ~/src ๎‚ฐ cd pdffigures2
 richard@pito ๎‚ฐ ~/src/pdffigures2 ๎‚ฐ ๎‚  master ๎‚ฐ sbt assembly
[info] Loading project definition from /home/richard/src/pdffigures2/project
[info] Updating {file:/home/richard/src/pdffigures2/project/}pdffigures2-build...
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by sbt.ivyint.ErrorMessageAuthenticator$ (file:/home/richard/.sbt/boot/scala-2.10.4/org.scala-sbt/sbt/0.13.8/ivy-0.13.8.jar) to field java.net.Authenticator.theAuthenticator
WARNING: Please consider reporting this to the maintainers of sbt.ivyint.ErrorMessageAuthenticator$
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
java.lang.NullPointerException
	at java.base/java.util.regex.Matcher.getTextLength(Matcher.java:1769)
	at java.base/java.util.regex.Matcher.reset(Matcher.java:416)
	at java.base/java.util.regex.Matcher.<init>(Matcher.java:253)
	at java.base/java.util.regex.Pattern.matcher(Pattern.java:1130)
	at java.base/java.util.regex.Pattern.split(Pattern.java:1249)
	at java.base/java.util.regex.Pattern.split(Pattern.java:1322)
	at sbt.IO$.pathSplit(IO.scala:744)
	at sbt.IO$.parseClasspath(IO.scala:859)
	at sbt.compiler.CompilerArguments.extClasspath(CompilerArguments.scala:62)
	at sbt.compiler.MixedAnalyzingCompiler$.withBootclasspath(MixedAnalyzingCompiler.scala:189)
	at sbt.compiler.MixedAnalyzingCompiler$.searchClasspathAndLookup(MixedAnalyzingCompiler.scala:167)
	at sbt.compiler.MixedAnalyzingCompiler$.apply(MixedAnalyzingCompiler.scala:177)
	at sbt.compiler.IC$.incrementalCompile(IncrementalCompiler.scala:138)
	at sbt.Compiler$.compile(Compiler.scala:128)
	at sbt.Compiler$.compile(Compiler.scala:114)
	at sbt.Defaults$.sbt$Defaults$$compileIncrementalTaskImpl(Defaults.scala:814)
	at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:805)
	at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:803)
	at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
	at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
	at sbt.std.Transform$$anon$4.work(System.scala:63)
	at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
	at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
	at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
	at sbt.Execute.work(Execute.scala:235)
	at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
	at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
	at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
	at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:844)
[error] (compile:compileIncremental) java.lang.NullPointerException
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? r
[info] Loading project definition from /home/richard/src/pdffigures2/project
java.lang.NullPointerException
	at java.base/java.util.regex.Matcher.getTextLength(Matcher.java:1769)
	at java.base/java.util.regex.Matcher.reset(Matcher.java:416)
	at java.base/java.util.regex.Matcher.<init>(Matcher.java:253)
	at java.base/java.util.regex.Pattern.matcher(Pattern.java:1130)
	at java.base/java.util.regex.Pattern.split(Pattern.java:1249)
	at java.base/java.util.regex.Pattern.split(Pattern.java:1322)
	at sbt.IO$.pathSplit(IO.scala:744)
	at sbt.IO$.parseClasspath(IO.scala:859)
	at sbt.compiler.CompilerArguments.extClasspath(CompilerArguments.scala:62)
	at sbt.compiler.MixedAnalyzingCompiler$.withBootclasspath(MixedAnalyzingCompiler.scala:189)
	at sbt.compiler.MixedAnalyzingCompiler$.searchClasspathAndLookup(MixedAnalyzingCompiler.scala:167)
	at sbt.compiler.MixedAnalyzingCompiler$.apply(MixedAnalyzingCompiler.scala:177)
	at sbt.compiler.IC$.incrementalCompile(IncrementalCompiler.scala:138)
	at sbt.Compiler$.compile(Compiler.scala:128)
	at sbt.Compiler$.compile(Compiler.scala:114)
	at sbt.Defaults$.sbt$Defaults$$compileIncrementalTaskImpl(Defaults.scala:814)
	at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:805)
	at sbt.Defaults$$anonfun$compileIncrementalTask$1.apply(Defaults.scala:803)
	at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
	at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
	at sbt.std.Transform$$anon$4.work(System.scala:63)
	at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
	at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:226)
	at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
	at sbt.Execute.work(Execute.scala:235)
	at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
	at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:226)
	at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
	at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:844)
[error] (compile:compileIncremental) java.lang.NullPointerException
Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore? i
[warn] Ignoring load failure: no project loaded.
[error] Not a valid command: assembly
[error] assembly
[error]         ^

I also tried to sbt compile as per allenai/pdffigures2#16 (comment)

Failed to access the docker container while running the code.

2018-12-12 17:51:48,308:INFO:scripts.build:Executing: docker build --tag deepfigures-cpu:0.0.1 --file
Failed to access the docker container while running the code. Kindly let me know how to resolve this error.

Error Log:

/home/saravanan/deepfigures-open/dockerfiles/cpu/Dockerfile .
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.38/build?buildargs=%7B%7D&cachefrom=%5B%5D&cgroupparent=&cpuperiod=0&cpuquota=0&cpusetcpus=&cpusetmems=&cpushares=0&dockerfile=dockerfiles%2Fcpu%2FDockerfile&labels=%7B%7D&memory=0&memswap=0&networkmode=default&rm=1&session=o7k4ifza9sfm8p1y323f64cp9&shmsize=0&t=deepfigures-cpu%3A0.0.1&target=&ulimits=null&version=1: dial unix /var/run/docker.sock: connect: permission denied
Traceback (most recent call last):
File "manage.py", line 70, in
manage()
File "/home/saravanan/anaconda3/lib/python3.7/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/saravanan/anaconda3/lib/python3.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/saravanan/anaconda3/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/saravanan/anaconda3/lib/python3.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/saravanan/anaconda3/lib/python3.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/saravanan/deepfigures-open/scripts/detectfigures.py", line 49, in detectfigures
build.build.callback()
File "/home/saravanan/deepfigures-open/scripts/build.py", line 34, in build
logger)
File "/home/saravanan/deepfigures-open/scripts/init.py", line 54, in execute
cmd=command)
subprocess.CalledProcessError: Command 'docker build --tag deepfigures-cpu:0.0.1 --file /home/saravanan/deepfigures-open/dockerfiles/cpu/Dockerfile .' returned non-zero exit status 1.

installation problem

hello,
Thanks for the work on deepfigure_open.
I have an installation problem when i try to install it on a virtual machine
The sbt assembly state return this output
[error] /home/opencv/pdffigures2/src/main/scala/org/allenai/pdffigures2/VisualLogger.scala:173:45: value EXIT_ON_CLOSE is not a member of object javax.swing.JFrame
[error] frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE)
[error] ^
[error] one error found
[error] (Compile / compileIncremental) Compilation failed

and my system is :
Openjdk 11.0.5 2019-10-15
OpenJDK Runtime Environment (build 11.0.5+10-post-Ubuntu-0ubuntu1.118.04)
OpenJDK 64-Bit Server VM (build 11.0.5+10-post-Ubuntu-0ubuntu1.118.04, mixed mode, sharing)

Maybe you can help me to figure what's wrong .
thanks for your help

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.