Giter VIP home page Giter VIP logo

pdfindexer's Introduction

##pdfindexer

Purpose

Index and search for keywords in PDF sources (files and URLs) using Apache Lucene and PDFBox The result will be put in a HTML file - the layout can be modified using a Freemarker template

Project

How to build

Integration into Development enviroment

Examples

see test folder for example input and results see Usage below for how to run pdfindexer from command line

Lorem Ipsum

resulting html file is in test/html/pdfindex.html

Cajun project

PDF text from the University of Notthingham about how to publish journals using the brand new Adobe technology (written 1993)

Usage

Directly from jar

java -jar pdfindexer.jar [options]

see usage page below

Usage page

	Pdfindexer Version: 0.0.7
	
	 github: https://github.com/WolfgangFahl/pdfindexer.git
	
	  usage: java com.bitplan.pdfindexer.Pdfindexer
	 --title VAL                  : title to be used in html result
	 -d (--debug)                 : debug
	                                create additional debug output if this switch
	                                is used
	 -e (--autoescape)            : autoescape blanks
		                              set to off if you'd like to use lucene query
		                              syntax		                                
	 -f (--src) VAL               : source url, directory/or file
	 -h (--help)                  : help
	                                show this usage
	 -i (--idxfile) VAL           : index file
	 -k (--keyWords) VAL          : search
	                                comma separated list of keywords to search
	 -l (--sourceFileList) VAL    : path to ascii-file with source urls,directories
	                                or file names
	                                one url/file/directory may be specified by line
	 -m (--maxHits) N             : maximum number of hits per keyword
	 -o (--outputfile) VAL        : (html) output file
	                                the output file will contain the search result
	                                with links to the pages in the pdf files that
	                                haven been searched
	 -p (--templatePath) VAL      : path to Freemarker template file(s) to be used
	                                to format the output
	 -r (--root) VAL              : root
	                                if a  root is specified the paths in the
	                                sourceFileList and in the output will be
	                                considered relative to this root path
	 -s (--silent)                : stay silent
	                                do not create any output on System.out if this
	                                switch is used
	 -t (--templateName) VAL      : name of Freemarker template to be used
	 -v (--version)               : showVersion
	                                show current version if this switch is used
	 -x (--extract)               : extract text
                                extract text content to files	                                
	 -w (--searchKeyWordList) VAL : file with search words

Modifying the template

	 src/main/resources/templates 

contains the default freemarker template "defaultindex.ftl". You might want to modify it our create your own template and use the -t/--templateName option to use it.

Version history

Copyright

Copyright 2013-2015 BITPlan GmbH

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

pdfindexer's People

Contributors

wolfgangfahl avatar mariafahl avatar

Watchers

James Cloos avatar Sanchit Aggarwal avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.