Giter VIP home page Giter VIP logo

pdfextract's Introduction

pdf-extract

A tool and library that can extract various areas of text from a PDF, especially a scholarly article PDF. It performs structural analysis to determine column bounds, headers, footers, sections, titles and so on. It can analyse and categorise sections into reference and non-reference sections and can split reference sections into individual references.

The latest version is 0.1.1. Earlier versions are far less reliable.

pdf-extract requires Ruby 1.9.1 or above.

Quick start

Install the latest version with:

$ gem install pdf-extract

Quick examples

Extract references from a PDF:

$ pdf-extract extract --references myfile.pdf

Extract references and a title from a PDF:

$ pdf-extract extract --references --titles myfile.pdf

Mark the locations of headers, footers and columns in a new PDF:

$ pdf-extract mark --columns --headers --footers myfile.pdf

Extract regions of text from a PDF, preserving line information (offsets from region origin):

$ pdf-extract extract --regions myfile.pdf

Extract regions of text from a PDF without line information (prettier and easier to read):

$ pdf-extract extract --regions --no-lines myfile.pdf

Resolve references to DOIs and output related metadata as BibTeX:

$ pdf-extract extract-bib --resolved_references myfile.pdf

Problems

pdf-extract mistakes normal text for references when attempting to extract references.

pdf-extract attempts to identify reference sections by comparing section features to an idealised model of a reference section. Sometimes this can go wrong. If pdf-extract is producing reference output that clearly includes something that is not a reference, try reducing the reference_flex slightly:

$ pdf-extract extract --references --set reference_flex:0.18 myfile.pdf

The default for reference_flex is 0.2. Make small decrements.

pdf-extract extracts no references.

As above, but try to increase the reference_flex a bit a time:

$ pdf-extract extract --references --set reference_flex:0.25 myfile.pdf

Keep trying with small increments to reference_flex. Note that a reference_flex of 1 means pdf-extract will identify all sections as reference sections.

pdf-extract is still producing weird output after fiddling with reference_flex.

Have a look at pdf-extract's settings:

$ pdf-extract settings

This command will produce a list of settings along with descriptions of what they affect. They can be set by passing a --set key:value argument to pdf-extract.

pdfextract's People

Contributors

jdherman avatar kjw avatar pwnall avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdfextract's Issues

Not all references extracted

Hello everybody!

I have tried to extract references from severel papers but it is always missing 10-11 references. In the papers the are first 10-11 under the title "Reference". They are either on a another page or in another column in the text.

I have tried with --set reference_flex:0.25, and also 0.1 to 0.9. No different in extracting.

pdf-extract works perfect otherwise.

Anyone know what I can do to fix this?

thansk in advance!
Robin

Corrupt on install?

Not having any luck install this on Redhat linux,

Fetching: pdf-extract-0.1.1.gem (100%)
ERROR: Error installing pdf-extract:
invalid gem: package is corrupt, exception while verifying: undefined method `path2class' for #Psych::ClassLoader:0x000000020233a0 (NoMethodError) in /home/scaddenp/.gem/ruby/cache/pdf-extract-0.1.1.gem

ruby -v
ruby 2.0.0p598 (2014-11-13) [x86_64-linux]

gem list
afm (0.2.2)
Ascii85 (1.0.2)
bigdecimal (1.2.7, 1.2.0)
commander (4.4.0)
hashery (2.1.2)
highline (1.7.8)
io-console (0.4.6, 0.4.2)
json (2.0.2, 1.7.7)
libsvm-ruby-swig (0.4.0)
mini_portile2 (2.1.0)
nokogiri (1.6.8)
pdf-core (0.6.1)
pdf-extract (0.1.1)
pdf-reader (1.4.0)
pkg-config (1.1.7)
prawn (2.1.0)
psych (2.1.0, 2.0.0)
rdoc (4.2.2, 4.0.0)
ruby-rc4 (0.1.5)
rubygems-update (2.6.6)
sqlite3 (1.3.11)
ttfunk (1.4.0)

error: undefined method `ascent'

Getting this error for all PDFs I have

amit@amit:~/projects/testapps/pdf/Untitled Folder/pdfs$ pdf-extract extract --references --titles --trace AUSTIN.pdf
/home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/font_metrics.rb:42:in `initialize': undefined method `ascent' for #<PDF::Reader::Font:0x000000030d97d0> (NoMethodError)
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/model/characters.rb:134:in `new'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/model/characters.rb:134:in `block in build_fonts'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/model/characters.rb:131:in `each'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/model/characters.rb:131:in `build_fonts'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/model/characters.rb:163:in `block (2 levels) in include_in'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/pdf.rb:81:in `call'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/pdf.rb:81:in `block (2 levels) in expand_listeners_to_callback_methods'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/pdf.rb:170:in `block in invoke_calls'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/pdf.rb:169:in `each'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/pdf.rb:169:in `invoke_calls'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:42:in `block in parse'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:38:in `each'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:38:in `parse'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:53:in `view'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/bin/pdf-extract:115:in `block (4 levels) in <top (required)>'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/bin/pdf-extract:112:in `each'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/pdf-extract-0.1.1/bin/pdf-extract:112:in `block (3 levels) in <top (required)>'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/commander-4.4.0/lib/commander/command.rb:178:in `call'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/commander-4.4.0/lib/commander/command.rb:178:in `call'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/commander-4.4.0/lib/commander/command.rb:153:in `run'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/commander-4.4.0/lib/commander/runner.rb:444:in `run_active_command'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/commander-4.4.0/lib/commander/runner.rb:68:in `run!'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/commander-4.4.0/lib/commander/delegates.rb:15:in `run!'
    from /home/amit/.rvm/gems/ruby-2.2.2@magnificent/gems/commander-4.4.0/lib/commander/import.rb:5:in `block in <top (required)>'

Failed to install gem

I'm using Mac OS 10.7 ruby 1.9.2p290

Thiagos-MacBook-Pro:~ thiagoperes$ gem install pdf-extract
Building native extensions. This could take a while...
ERROR: Error installing pdf-extract:
ERROR: Failed to build gem native extension.

    /Users/thiagoperes/.rvm/rubies/ruby-1.9.2-p290/bin/ruby extconf.rb

checking for Ruby version >= 1.8.5... yes
checking for gcc... yes
checking for Magick-config... no
Can't install RMagick 2.13.1. Can't find Magick-config in /Users/thiagoperes/.rvm/gems/ruby-1.9.2-p290/bin:/Users/thiagoperes/.rvm/gems/ruby-1.9.2-p290@global/bin:/Users/thiagoperes/.rvm/rubies/ruby-1.9.2-p290/bin:/Users/thiagoperes/.rvm/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/local/git/bin

*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of
necessary libraries and/or headers. Check the mkmf.log file for more
details. You may need configuration options.

Provided configuration options:
--with-opt-dir
--without-opt-dir
--with-opt-include
--without-opt-include=${opt-dir}/include
--with-opt-lib
--without-opt-lib=${opt-dir}/lib
--with-make-prog
--without-make-prog
--srcdir=.
--curdir
--ruby=/Users/thiagoperes/.rvm/rubies/ruby-1.9.2-p290/bin/ruby

Gem files will remain installed in /Users/thiagoperes/.rvm/gems/ruby-1.9.2-p290/gems/rmagick-2.13.1 for inspection.
Results logged to /Users/thiagoperes/.rvm/gems/ruby-1.9.2-p290/gems/rmagick-2.13.1/ext/RMagick/gem_make.out

Install Error

Fedora 22
ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
gcc version 5.1.1 20150422 (Red Hat 5.1.1-1) (GCC)

I try to build and install with gem install pdf-extract

this is the output:

Fetching: libsvm-ruby-swig-0.4.0.gem (100%)
Building native extensions. This could take a while...
/usr/share/rubygems/rubygems/ext/builder.rb:73: warning: Insecure world writable dir /home/antonio in PATH, mode 040777
ERROR: Error installing pdf-extract:
ERROR: Failed to build gem native extension.

/usr/bin/ruby -r ./siteconf20150527-13449-o56yyi.rb extconf.rb

creating Makefile

make "DESTDIR=" clean
rm -f
rm -f libsvm.so .o *.bak mkmf.log ..time

make "DESTDIR="
g++ -I. -I/usr/include -I/usr/include/ruby/backward -I/usr/include -I. -fPIC -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -mtune=generic -m64 -o libsvm_wrap.o -c libsvm_wrap.cxx
libsvm_wrap.cxx: In function ‘void SWIG_Ruby_define_class(swig_type_info_)’:
libsvm_wrap.cxx:1487:9: warning: variable ‘klass’ set but not used [-Wunused-but-set-variable]
VALUE klass;
^
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_svm_type_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2126:136: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2131:131: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_svm_type_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2154:136: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_kernel_type_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2179:139: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2184:134: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_kernel_type_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2207:139: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_degree_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2232:134: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2237:129: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_degree_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2260:134: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_gamma_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2285:133: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2290:131: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_gamma_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2313:133: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_coef0_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2338:133: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2343:131: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_coef0_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2366:133: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_cache_size_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2391:138: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2396:136: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_cache_size_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2419:138: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_eps_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2444:131: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2449:129: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_eps_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2472:131: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_C_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2497:129: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2502:127: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_C_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2525:129: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_nr_weight_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2550:137: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2555:132: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_nr_weight_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2578:137: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_weight_label_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2603:140: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2608:133: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_weight_label_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2631:140: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_weight_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2656:134: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2661:130: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_weight_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2684:134: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_nu_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2709:130: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2714:128: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_nu_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2737:130: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_p_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2762:129: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2767:127: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_p_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2790:129: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_shrinking_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2815:137: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2820:132: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_shrinking_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2843:137: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_probability_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2868:139: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2873:134: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_parameter_probability_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2896:139: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_problem_l_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2960:127: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:2965:124: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_problem_l_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:2988:127: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_problem_y_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3013:127: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3018:125: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_problem_y_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3041:127: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_problem_x_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3066:127: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3071:128: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_problem_x_get(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3094:127: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_train(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3158:144: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3163:146: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_cross_validation(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3194:155: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3199:157: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3204:143: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3209:144: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_save_model(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3236:142: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3241:147: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_load_model(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3268:142: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_get_svm_type(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3294:149: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_get_nr_class(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3318:149: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_get_labels(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3343:147: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3348:135: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_get_svr_probability(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3371:156: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_predict_values(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3399:151: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3404:150: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3409:142: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_predict(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3435:144: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3440:143: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_predict_probability(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3470:156: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3475:155: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3480:147: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_destroy_model(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3502:144: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_check_parameter(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3528:154: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3533:156: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_check_probability_model(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3557:160: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_new_int(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3581:133: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_delete_int(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3603:131: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_int_getitem(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3629:132: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3634:137: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_int_setitem(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3662:132: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3667:137: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3672:134: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_new_double(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3695:136: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_delete_double(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3717:137: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_double_getitem(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3743:138: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3748:140: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_double_setitem(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3776:138: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3781:140: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3786:140: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_node_array(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3834:137: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_node_array_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3865:144: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3870:141: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3875:141: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3880:144: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_node_array_destroy(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3901:148: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_node_matrix(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3924:138: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_node_matrix_set(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3952:146: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3957:142: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx:3962:145: error: format not a string literal and no format arguments [-Werror=format-security]
libsvm_wrap.cxx: In function ‘VALUE wrap_svm_node_matrix_destroy(int, VALUE, VALUE)’:
libsvm_wrap.cxx:3983:150: error: format not a string literal and no format arguments [-Werror=format-security]
cc1plus: some warnings being treated as errors
Makefile:222: set di istruzioni per l'obiettivo "libsvm_wrap.o" non riuscito
make: *_* [libsvm_wrap.o] Errore 1

make failed, exit code 2

Gem files will remain installed in /home/antonio/.gem/ruby/gems/libsvm-ruby-swig-0.4.0 for inspection.

Results logged to /home/antonio/.gem/ruby/extensions/x86_64-linux/libsvm-ruby-swig-0.4.0/gem_make.out

font_metrics.rb:42:in `initialize': undefined method `ascent'

I installed pdf-extract using gem install and I'm getting the following error. A change in the library?

Update: downgrading to ruby-1.9.1 does not help

$ pdf-extract --trace extract --references --titles d912f50dae928909ed.pdf
/Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/font_metrics.rb:42:in `initialize': undefined method `ascent' for #<PDF::Reader::Font:0x007fc611c82650> (NoMethodError)
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/model/characters.rb:134:in `new'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/model/characters.rb:134:in `block in build_fonts'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/model/characters.rb:131:in `each'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/model/characters.rb:131:in `build_fonts'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/model/characters.rb:163:in `block (2 levels) in include_in'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/pdf.rb:81:in `call'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/pdf.rb:81:in `block (2 levels) in expand_listeners_to_callback_methods'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/pdf.rb:170:in `block in invoke_calls'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/pdf.rb:169:in `each'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/pdf.rb:169:in `invoke_calls'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:42:in `block in parse'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:38:in `each'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:38:in `parse'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:53:in `view'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/bin/pdf-extract:115:in `block (4 levels) in <top (required)>'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/bin/pdf-extract:112:in `each'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/pdf-extract-0.1.1/bin/pdf-extract:112:in `block (3 levels) in <top (required)>'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/commander-4.1.3/lib/commander/command.rb:180:in `call'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/commander-4.1.3/lib/commander/command.rb:180:in `call'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/commander-4.1.3/lib/commander/command.rb:155:in `run'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/commander-4.1.3/lib/commander/runner.rb:402:in `run_active_command'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/commander-4.1.3/lib/commander/runner.rb:78:in `run!'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/commander-4.1.3/lib/commander/delegates.rb:11:in `run!'
        from /Users//.rvm/gems/ruby-1.9.3-p362/gems/commander-4.1.3/lib/commander/import.rb:10:in `block in <top (required)>'

Dupes with extract-bibs

The extracted bibtex files often seem to contain exact duplicate entries, which is causing me issues when trying to parse them.

Problem with test file after installation (NoMethodError)

After Installation and trying to extract the referencies of a pdf appear the next problem:


/Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/font_metrics.rb:42:in `initialize': undefined method `ascent' for #<PDF::Reader::Font:0x007fd8cb1692d8> (NoMethodError)
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/model/characters.rb:134:in `new'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/model/characters.rb:134:in `block in build_fonts'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/model/characters.rb:131:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/model/characters.rb:131:in `build_fonts'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/model/characters.rb:163:in `block (2 levels) in include_in'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/pdf.rb:81:in `call'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/pdf.rb:81:in `block (2 levels) in expand_listeners_to_callback_methods'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/pdf.rb:170:in `block in invoke_calls'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/pdf.rb:169:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/pdf.rb:169:in `invoke_calls'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:42:in `block in parse'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:38:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:38:in `parse'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:53:in `view'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/bin/pdf-extract:115:in `block (4 levels) in <top (required)>'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/bin/pdf-extract:112:in `each'
    from /Library/Ruby/Gems/2.0.0/gems/pdf-extract-0.1.1/bin/pdf-extract:112:in `block (3 levels) in <top (required)>'
    from /Library/Ruby/Gems/2.0.0/gems/commander-4.2.1/lib/commander/command.rb:180:in `call'
    from /Library/Ruby/Gems/2.0.0/gems/commander-4.2.1/lib/commander/command.rb:180:in `call'
    from /Library/Ruby/Gems/2.0.0/gems/commander-4.2.1/lib/commander/command.rb:155:in `run'
    from /Library/Ruby/Gems/2.0.0/gems/commander-4.2.1/lib/commander/runner.rb:421:in `run_active_command'
    from /Library/Ruby/Gems/2.0.0/gems/commander-4.2.1/lib/commander/runner.rb:81:in `run!'
    from /Library/Ruby/Gems/2.0.0/gems/commander-4.2.1/lib/commander/delegates.rb:8:in `run!'
    from /Library/Ruby/Gems/2.0.0/gems/commander-4.2.1/lib/commander/import.rb:10:in `block in <top (required)>

Can't install pdfextract

Hi,

if I try to install via gem (gem2.0 or gem1.9) , I have this error:

% sudo gem install pdf-extract ±[master]
Building native extensions. This could take a while...
ERROR: Error installing pdf-extract:
ERROR: Failed to build gem native extension.

    /usr/bin/ruby1.9.1 extconf.rb

creating Makefile

make

compiling libsvm_wrap.cxx

and if I try to build the program myself with rake, I have:

rake2.0 build ±[master]
rake aborted!
cannot load such file -- libsvm
/home/teto/pdfextract/lib/pdf/extract/references/score.rb:1:in <top (required)>' /home/teto/pdfextract/lib/pdf/extract/references/references.rb:3:inrequire_relative'
/home/teto/pdfextract/lib/pdf/extract/references/references.rb:3:in <top (required)>' /home/teto/pdfextract/lib/pdf/extract.rb:10:inrequire_relative'
/home/teto/pdfextract/lib/pdf/extract.rb:10:in <top (required)>' /home/teto/pdfextract/lib/pdf-extract.rb:1:inrequire_relative'
/home/teto/pdfextract/lib/pdf-extract.rb:1:in <top (required)>' /home/teto/pdfextract/tasks/assign.rb:3:inrequire_relative'
/home/teto/pdfextract/tasks/assign.rb:3:in <top (required)>' /home/teto/pdfextract/Rakefile:1:inrequire_relative'

/home/teto/pdfextract/Rakefile:1:in `<top (required)>'

Any idea ?

REgards

pdf-extract extract units?

From what I see those units are definitely not pixels (width="62.53" height="4.47" line_height="4.47" - 62.53 pixels?). Is there any way to make pdf-extract to show position in pixels?

extract-bib option fails

Hi,

Executing pdf-extract as follows:

pdf-extract extract-bib --resolved_references   bibpro.pdf

fails with the following error:

error: input must be an IO-like object or a filename. Use --trace to view backtrace

I added puts input.inspect as instrumentation in object_hash.rb to the extract_io_from(input) method of the class ObjectHash, as follows:

def extract_io_from(input)
  puts input.inspect
  if input.respond_to?(:seek) && input.respond_to?(:read)
    input
  elsif File.file?(input.to_s)
    StringIO.new read_as_binary(input)
  else
    raise ArgumentError, "input must be an IO-like object or a filename"
  end
end

The output emitted was "extract-bib", suggesting that the argument is being misinterpreted to be a file name.

Any thoughts/suggestions on the matter?

Thanks!

Errors showing up for some pdf files

While using pdf-extract, it shows various errors sometimes:

error: undefined method `ascent' for nil:NilClass. Use --trace to view backtrace

error: undefined method `load_file' for #<PDF::Core::ObjectStore:0xcc69308 @objects={}, @identifiers=[]>. Use --trace to view backtrace

PS: I'm using Ruby v1.9.3p194.

Running pdfextract from R?

Any thoughts on running pdfextract from R? Perhaps an R package? Or is there a hack? This would make an awesome workflow is possible.

Thanks!
Scott Chamberlain

error: undefined method `ascent'

error: undefined method `ascent' for #<PDF::Reader::Font:0x007ff94483b130>. Use --trace to view backtrace

Happens for all calls of pdf-extract. This installation error may be relevant:

Parsing documentation for libsvm-ruby-swig-0.4.0
unable to convert "\xCA" from ASCII-8BIT to UTF-8 for ext/libsvm.bundle, skipping
unable to convert "\xCA" from ASCII-8BIT to UTF-8 for ext/libsvm_wrap.o, skipping
unable to convert "\xCA" from ASCII-8BIT to UTF-8 for ext/svm.o, skipping
unable to convert "\xCA" from ASCII-8BIT to UTF-8 for lib/libsvm-ruby-swig/libsvm.bundle, skipping
Installing ri documentation for libsvm-ruby-swig-0.4.0

Thanks for any help anyone can provide!

Title.rb line 30 error: undefined method `[]' for nil:NilClass

Hi!

I'm using pdf-extract to get some papers data and everything was working well, but for some PDF papers I'm receiving this error message: undefined method '[]' for nil:NilClass on line 30 of the file lib/analysis/titles.rb, like above:

titles.sort_by! { |r| -r[:line_height] }
tallest_line = titles.first[:line_height] # this is line 30 --> titles is an empty array []
title_slop = tallest_line - (tallest_line * pdf.settings[:title_slop])
titles.reject! { |r| r[:line_height] < title_slop }

Any idea how to solve this?

Thanks!

Problem with outdent delimited refs

Hi guys,

So i just installed pdf-extract (pretty neat program btw ;)) but now the PDF I'm feeding it has the outdent delimited references (see http://nlabs.labs.crossref.org/pdfextract/citation_categories/). Unfortunately instead of taking the first line for the reference (the one that is outdented), PDF-extract thinks the reference starts at the line below and ends one line too far (see image). When I play with the reference_flex (lower it 0.15) the reference from the first page comes out ok (the page contains normal text plus reference section) but the references on the following page still get the same indent-outdent problem.. I didn't see any setting to play with the indent outdent, can anyone help ? thanks !

outdent_ex

error: undefined method `*' for nil:NilClass

Under Ruby 1.9.3 (with pdf-reader downgraded to 1.1.1), I get

$ pdf-extract extract --references example.pdf

error: undefined method `*' for nil:NilClass. Use --trace to view backtrace

Under Ruby 1.9.1 (again with pdf-reader at 1.1.1), I get

$ pdf-extract extract --references example.pdf

error: undefined method `sort_by!' for #<Array:0x007fa62e9cc930>. Use --trace to view backtrace

Any suggestions? I'm not sure what else to try, but am excited to get pdfextract running!

No result output

no result output, only shows

<?xml version="1.0"?>
<pdf/> 

in terminal, tries several pdf files, all show the same, any idea how to fix it?

Extracting authors names and emails with PDFExtract

Hi! First, this is an amazing tool for researchers! Thanks.

Anyone having success extracting author names and emails using pdf-extract?

pdf-extract extract --headers my/pdf/file.pdf is returning nothing.

Thanks!

Wrong version on the Ruby repository

First I have to say that this tool is absolutely awesome to extract bib data. So much time saved with just one command line, many many thanks!

However I had to build the gem myself to have the 0.1.1 version. The one in the repository seems to be outdated, without extract-bib etc... even though the version of the package indicate 0.1.1
Just wanted to point that out.

Thanks again.

zones is missing a dependency on left_margins (RuntimeError)

Hi, I'm getting the following error message when attempting to run pdf-extract: zones is missing a dependency on left_margins (RuntimeError), which is from the stack trace given below. Any thoughts on this?

I am running this on Mac OS X 10.11,2 El Capitan, with the following Ruby configuration:

$ ruby --version
ruby 2.3.0p0 (2015-12-25 revision 53290) [x86_64-darwin15]

$ gem --version
2.5.1

Stack trace:

pdf-extract --trace  extract-bib --resolved_references  Attacks\ on\ Cryptographic\ Protocols-\ A\ Survey.pdf
/usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract-0.1.1/lib/pdf/extract/pdf.rb:144:in `block (3 levels) in invoke_calls': zones     is missing a dependency on left_margins (RuntimeError)    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract/pdf.rb:142:in `each'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract/pdf.rb:142:in `block (2 levels) in invoke_calls'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract/pdf.rb:141:in `each_pair'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract/pdf.rb:141:in `block in i    nvoke_calls'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract/pdf.rb:137:in `each_pair'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract/pdf.rb:137:in `invoke_cal    ls'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract.rb:43:in `block in parse'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract.rb:39:in `each'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract.rb:39:in `parse    '    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/lib/pdf/extract.rb:54:in `view'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/bin/pdf-extract:121:in `block (    4 levels) in <top (required)>'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/bin/pdf-extract:118:in `each'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/pdf-extract    -0.1.1/bin/pdf-extract:118:in `block     (3 levels) in <top (required)>'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/commander-4    .4.0/lib/commander/command.rb:178:in     `call'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/commander-4    .4.0/lib/commander/command.rb:153:in     `run'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/commander-4    .4.0/lib/commander/runner.rb:444:in     `run_a    ctive_command'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/commander-4.4.0/lib/commander/runner.rb:68:in `run!'    
  from /usr/local/lib/ruby/gems/2.3.0/gems/commander-4.4.0/lib/commander/delegates.rb:15:in `run!'
  from /usr/local/lib/ruby/gems/2.3.0/gems/commander-4.4.0/lib/commander/import.rb:5:in `block in <top (required)>'

Thanks,

Roger Alexander.

Run issue

I have the version from git and does not work, please help

gems/commander-4.4.3/lib/commander/runner.rb:409:in block in require_program': program version required (Commander::Runner::CommandError)
from /Users/xx/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/commander-4.4.3/lib/commander/runner.rb:408:in each' from /Users/xx/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/commander-4.4.3/lib/commander/runner.rb:408:in require_program'
from /Users/xx/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/commander-4.4.3/lib/commander/runner.rb:52:in run!' from /Users/xx/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/commander-4.4.3/lib/commander/delegates.rb:15:in run!'
from /Users/xx/.rbenv/versions/2.4.0/lib/ruby/gems/2.4.0/gems/commander-4.4.3/lib/commander/import.rb:5:in block in <top (required)>' /Users/xx/Downloads/extractor/pdfextract-master/lib/pdf/extract/references/score.rb:11:in module:Score': uninitialized constant Libsvm::Model (NameError)
Did you mean? Module
from /Users/xx/Downloads/extractor/pdfextract-master/lib/pdf/extract/references/score.rb:4:in <module:PdfExtract>' from /Users/xx/Downloads/extractor/pdfextract-master/lib/pdf/extract/references/score.rb:3:in <top (required)>'
from /Users/xx/Downloads/extractor/pdfextract-master/lib/pdf/extract/references/references.rb:3:in require_relative' from /Users/xx/Downloads/extractor/pdfextract-master/lib/pdf/extract/references/references.rb:3:in <top (required)>'
from /Users/xx/Downloads/extractor/pdfextract-master/lib/pdf/extract.rb:10:in require_relative' from /Users/xx/Downloads/extractor/pdfextract-master/lib/pdf/extract.rb:10:in <top (required)>'
from /Users/xx/Downloads/extractor/pdfextract-master/lib/pdf-extract.rb:1:in require_relative' from /Users/xx/Downloads/extractor/pdfextract-master/lib/pdf-extract.rb:1:in <top (required)>'
from ../pdfextract-master/bin/pdf-extract:5:in require_relative' from ../pdfextract-master/bin/pdf-extract:5:in

'
`

When I do > sudo gem install pdf-extract

I fail on
error: unknown warning option '-Werror=unused-command-line-argument-hard-error-in-future'; did you mean '-Werror=unused-command-line-argument'? [-Werror,-Wunknown-warning-option]

Undefined symbols for architecture x86_64:
"_iconv", referenced from:
_main in conftest-5b1568.o
"_iconv_open", referenced from:
_main in conftest-5b1568.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Citation for a review of Article X is recovered rather than the citation for Article X itself.

Perhaps this is already known, or else is a problem in the database this tool consults, but I figured I would record some testcases here for use in bugfixing (if the problem is on this end.)

This:
[19] R. L. Graham, D. E. Knuth and T. S. Motzin, Complements and transitive
closures, Discrete Math., 21 (1972), 17–29.
[20] P. R. Halmos, Lectures on Boolean Algebras, Van Nostrand, Princeton, 1963.
[21] P. C. Hammer, Kuratowski’s Closure theorem, Nieuw Arch. Wisk., 8 (1960),
74–80.

Generated this bibtex:
@Article{Wallace_1964,
doi = {10.1126/science.144.3618.531-b},
url = {http://dx.doi.org/10.1126/science.144.3618.531-b},
year = 1964,
month = {may},
publisher = {American Association for the Advancement of Science ({AAAS})},
volume = {144},
number = {3618},
pages = {531--532},
author = {A. D. Wallace},
title = {Lectures on Boolean Algebras. Paul R. Halmos. Van Nostrand, Princeton, N.J., 1963. vi $\mathplus$ 147 pp. Illus. Paper, {\textdollar}2.95},
journal = {Science}
}

(Wallace appears nowhere in the pdf I'm extracting.) Similarly this:
[23] E. Hewitt, A problem in set-theoretic topology, Duke Math. J., 10 (1943),
309–333.
[24] G. E. Hughes and M. J. Cresswell, A New Introduction to Modal Logic, Routledge,
London 1996.

[25] M. Jackson, Closure semilattices, Algebra Universalis, 52 (2004), 1–37.

lead to this:

@Article{Zakharyaschev_1997,
doi = {10.2307/2275655},
url = {http://dx.doi.org/10.2307/2275655},
year = 1997,
month = {dec},
publisher = {Cambridge University Press ({CUP})},
volume = {62},
number = {04},
pages = {1483--1484},
author = {Michael Zakharyaschev},
title = {Hughes G. E. and Cresswell M. J.. A new introduction to modal logic. Routledge, London and New York 1996, x $\mathplus$ 421 pp.},
journal = {The Journal of Symbolic Logic}
}

and this

[26] M. Jackson and T. Stokes, Semilattice pseudocomplemented semigroups,
Comm. Algebra, 32 (2004), 2895–2918.
[27] J. L. Kelley, General Topology, Van Nostrand Reinhold Co. Inc. Princeton, NJ,
1955.

[28] W. Koenen, The Kuratowski closure problem in the topology of convexity,
Amer. Math. Monthly, 73 (1966), 704–708.

lead to this

@Article{Larkin_1962,
doi = {10.2307/2964144},
url = {http://dx.doi.org/10.2307/2964144},
year = 1962,
month = {jun},
publisher = {Cambridge University Press ({CUP})},
volume = {27},
number = {02},
pages = {235},
author = {Francis P. Larkin},
title = {Kelley John L.. General topology. D. van Nostrand Company, Inc., New York, Toronto, and London, 1955, xiv $\mathplus$ 298 pp.},
journal = {The Journal of Symbolic Logic}
}

and this

[31] N. Levine, On the commutativity of the closure and interior operators in topological
spaces, Amer. Math. Monthly, 68 (1961), 474–477.
[32] J. C. C. McKinsey and A. Tarski, The algebra of topology, Ann. Math., 45
(1944), 141–191.

[33] L. E. Moser, Closure, interior and union in finite topological spaces, Colloq.
Math., 38 (1977), 41–51.

lead to this:
@Article{Vaughan_1944,
doi = {10.2307/2267577},
url = {http://dx.doi.org/10.2307/2267577},
year = 1944,
month = {dec},
publisher = {Cambridge University Press ({CUP})},
volume = {9},
number = {04},
pages = {96--97},
author = {H. E. Vaughan},
title = {{McKinsey} J. C. C. and Tarski Alfred. The algebra of topology. Annals of mathematics, ser. 2 vol. 45 (1944), pp. 141{\textendash}191.},
journal = {The Journal of Symbolic Logic}
}

Compatibility with current Ruby versions? (CommandError & LoadError)

Hey,

I recently tried running PDFExtract fresh off Github on a pristine new Ruby on Rails installation (Ruby 2.2.4, Rails 4.2.6), which produced the following errors:

C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/commander-4.4.0/lib/commander/runner.rb:407:in 'block in require_program': program version required (Commander::Runner::CommandError) from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/commander-4.4.0/lib/commander/runner.rb:406:in 'each' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/commander-4.4.0/lib/commander/runner.rb:406:in 'require_program' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/commander-4.4.0/lib/commander/runner.rb:52:in 'run!' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/commander-4.4.0/lib/commander/delegates.rb:15:in 'run!' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/commander-4.4.0/lib/commander/import.rb:5:in 'block in <top (required)>' C:/RailsInstaller/Ruby2.2.0/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:69:in 'require': 126: Das angegebene Modul wurde nicht gefunden. - C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/libsvm-ruby-swig-0.4.0/ext/libsvm.so (LoadError) from C:/RailsInstaller/Ruby2.2.0/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:69:in 'require' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/libsvm-ruby-swig-0.4.0/lib/svm.rb:1:in '<top (required)>' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:69:in 'require' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:69:in 'require' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/references/score.rb:1:in '<top (required)>' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/references/references.rb:3:in 'require_relative' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/references/references.rb:3:in '<top (required)>' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:10:in 'require_relative' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:10:in '<top (required)>' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/bin/pdf-extract:5:in 'require_relative' from C:/RailsInstaller/Ruby2.2.0/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/bin/pdf-extract:5:in '<top (required)>' from C:/RailsInstaller/Ruby2.2.0/bin/pdf-extract:23:in 'load' from C:/RailsInstaller/Ruby2.2.0/bin/pdf-extract:23:in '<main>'

I then tried downgrading to Ruby 1.9.3, but this turned out to be incompatible with some "prawn"-dependency which requires a newer version of Ruby.

What kind of environment will PDFExtract run in?

Thanks,
Basanta

Windows Support

It would seem that it is not possible to run pdf-extract from MS Windows as built? Just Googling around didn't yield any promising reasons why this is occurring? Any ideas?

System Configuration:

  • Microsoft Windows 8.1 x64
  • Ruby 2.2.3p173 (2015-08-18 revision 51636) [x64-mingw32]
  • DevKit-mingw64-64-4.7.2-20130224-1432-sfx
  • pdf-extract gem version 0.1.1

Error message:

C:\...\ > pdf-extract extract --references .\test2.mask.pdf
C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/commander-4.3.5/lib/commander/runner.rb:391:in `block in require_program': program version required (Commander::Runner::CommandError)
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/commander-4.3.5/lib/commander/runner.rb:390:in `each'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/commander-4.3.5/lib/commander/runner.rb:390:in `require_program'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/commander-4.3.5/lib/commander/runner.rb:52:in `run!'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/commander-4.3.5/lib/commander/delegates.rb:15:in `run!'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/commander-4.3.5/lib/commander/import.rb:5:in `block in <top (required)>'
C:/Ruby22-x64/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require': 126: The specified module could not be found.   - C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/libsvm-ru
by-swig-0.4.0/ext/libsvm.so (LoadError)
        from C:/Ruby22-x64/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/libsvm-ruby-swig-0.4.0/lib/svm.rb:1:in `<top (required)>'
        from C:/Ruby22-x64/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from C:/Ruby22-x64/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54:in `require'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/references/score.rb:1:in `<top (required)>'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/references/references.rb:3:in `require_relative'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/references/references.rb:3:in `<top (required)>'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:10:in `require_relative'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/lib/pdf-extract.rb:10:in `<top (required)>'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/bin/pdf-extract:5:in `require_relative'
        from C:/Ruby22-x64/lib/ruby/gems/2.2.0/gems/pdf-extract-0.1.1/bin/pdf-extract:5:in `<top (required)>'
        from C:/Ruby22-x64/bin/pdf-extract:23:in `load'
        from C:/Ruby22-x64/bin/pdf-extract:23:in `<main>'

Section Breakdown

I am trying to wrap my head around this utility but it seems that it is unable to determine sections within a pdf. For example, lets say that one is looking to extract information from documents which contain blocks of text/paragraphs where each of these content blocks either has a title. These sections could be defined by larger text titling the section, might be in upper case, might be italic, might be underlined... or any combination of those elements.

So, what i am look for is a way to somehow get this utility to determine such a pattern and return the content of the document and annotate each of these sections with corresponding tree pattern markers.

How would one go about this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.