Giter VIP home page Giter VIP logo

pantomime's Introduction

Pantomime, a Library For Working With MIME Types In Clojure

Pantomime is a Clojure interface to Apache Tika.

Originally created as a library that deals with MIME types (Internet media types, sometimes referred to as "content types"), it now also supports extraction of document metadata and text content.

Maven Artifacts

Pantomime artifacts are released to Clojars. If you are using Maven, add the following repository definition to your

pom.xml:

<repository>
  <id>clojars.org</id>
  <url>http://clojars.org/repo</url>
</repository>

Latest Release

With Leiningen:

[com.novemberain/pantomime "2.11.0"]

With Maven:

<dependency>
  <groupId>com.novemberain</groupId>
  <artifactId>pantomime</artifactId>
  <version>2.11.0</version>
</dependency>

Supported Clojure versions

Pantomime requires Clojure 1.8+. The most recent stable release is highly recommended.

Caveats

Pantomime depends on a reasonably modern version of org.apache.commons/commons-compress. This may cause confusing issues with other libraries. If you run into issues with undefined classes, missing methods and such, use lein deps :tree to see if you may have conflicting dependencies then exclude dependencies (either in libraries that bring in older commons-compress versions or Pantomime) as a workaround.

Usage

Detecting MIME type

pantomime.mime/mime-type-of function accepts content as byte arrays, java.io.InputStream and java.net.URL instances as well as filenames as strings and java.io.File instances, and returns MIME type as a string or "application/octet-stream" if detection failed.

An example:

(ns your.app.namespace
  (:require [pantomime.mime :refer [mime-type-of]]))

;; by content (as byte array)
(mime-type-of (.getBytes "filename.pdf"))
;; by file extension
(mime-type-of "filename.pdf")
;; by file content (as java.io.File)
(mime-type-of (File. "some/file/without/extension"))
;; by content (as java.net.URL)
(mime-type-of (URL. "http://domain.com/some/url/path.pdf"))

Pantomime has a variation of mime-type-of function that is suitable for cases when content was fetched from the Web and HTTP headers are also available:

(ns your.app.namespace
  (:require [pantomime.web :refer [mime-type-of]]))

;; body is a string or input stream, headers is a map of lowercased headers.
;; Ring and clj-http both use this format for headers, for example.
(mime-type-of body headers)

In this case, Pantomime will try to detect content type from response body first (because there are applications, frameworks and servers that report content type incorrectly, for example, serve PDFs as text/html) and if it fails, will use content type header.

HTTP headers map must contain "content-type" key for content type header to be used. Most Clojure HTTP client, for instance, clj-http, use lowercase strings for header names so Pantomime follows this convention.

Extension Recommendation

Pantomime can recommend an extension (one of the well known ones) for a MIME type:

(require [pantomime.mime :as pm])

(pm/extension-for-name "application/vnd.ms-excel")
;= ".xls"
(pm/extension-for-name "image/jpeg")
;= ".jpg"
(pm/extension-for-name "application/octet-stream")
;= ".bin"

Parsing and Recognizing Media Types

(ns your.app.namespace
  (:require [pantomime.media :as mt]))

(mt/parse "application/json")

(mt/base-type "text/html; charset=UTF-8") ;; => media type of "text/html"

(mt/application? "application/json")
(mt/application? "application/xhtml+xml")
(mt/application? "application/pdf")
(mt/application? "application/vnd.ms-excel")
(mt/application? (mt/parse "application/json"))

(mt/image? "image/jpeg")
(mt/audio? "audio/mp3")
(mt/video? "video/quicktime")
(mt/text?  "text/plain")
(mt/has-parameters? "text/html; charset=UTF-8") ;; => true
(mt/has-parameters? "text/html") ;; => false
(mt/parameters-of "text/html; charset=UTF-8") ;; => {"charset" "UTF-8"}
(mt/charset-of "text/html; charset=UTF-8") ;; => "UTF-8"

Language Detection

pantomime.languages is a new that provides functions for detecting natural languages:

(require [pantomime.languages :as pl])

(pl/detect-language "this is English, it should not be hard to detect")
;= "en"

(pl/detect-language "parlez-vous Français")
;= "fr"

Note that Tika (and, in turn, Pantomime) supports detection of a limited number of languages. To get the list of supported languages, use the pantomime.languages/supported-languages var.

Metadata and Text Extraction

pantomime.extract provides two functions for extracting metadata, content, and embedded files from byte arrays, java.io.InputStream and java.net.URL instances as well as filenames as strings and java.io.File instances. The extraction functions differ in how they handle embedded documents.

pantomime.extract/parse takes as its single argument any of the types mentioned above. It returns a map containing all the metadata Tika was able to extract from the document, and the text content of the document concatenated with the text of all embedded documents, recursively.

An example:

(require [clojure.java.io :as io]
         [pantomime.extract :as extract])

(pprint (extract/parse "test/resources/pdf/qrl.pdf"))

;= {:producer ("GNU Ghostscript 7.05"),
;=  :pdf:pdfversion ("1.2"),
;=  :dc:title ("main.dvi"),
;=  :dc:format ("application/pdf; version=1.2"),
;=  :xmp:creatortool ("dvips(k) 5.86 Copyright 1999 Radical Eye Software"),
;=  :pdf:encrypted ("false"),
;=  ...
;=  :text "\nQuickly Reacquirable Locks∗\n\nDave Dice Mark Moir ... "
;= }

pantomime.extract/parse-extract-embedded also returns Tika-extracted metadata and document text, but it handles embedded documents differently. Instead of returning the concatenation of all embedded document text, it saves each embedded file to the filesystem and includes a vector of file names and paths in the returned data. Remember to remove those files when you're done with them!

For example, the file fileAttachment.pdf contains a single attached file, which gets saved to /tmp/pantomime-3207476364135900258-embedded:

(require [clojure.java.io :as io]
         [pantomime.extract :as extract])

(pprint (extract/parse-extract-embedded "test/resources/pdf/fileAttachment.pdf"))

;= {:date ("2012-11-23T14:40:50Z"),
;=  :producer ("Acrobat Distiller 9.5.2 (Windows)"),
;=  :creator ("van der Knijff"),
;=  :pdf:pdfversion ("1.7"),
;=  :dc:title ("This is a test document"),
;=  :text "\nThis is a test document. It contains a file attachment..."
;=  ...
;=  :embedded [{:path "/tmp/pantomime-3207476364135900258-embedded",
;=              :name "KSBASE.WQ2"}],
;=  ...}

Note that parse-extract-embedded creates temporary files in the JVM's default location.

If extraction fails, the functions will return the following:

{:text "",
 :content-type ("application/octet-stream"),
 :x-parsed-by ("org.apache.tika.parser.EmptyParser")}

Community

Pantomime has a mailing list. Feel free to join it and ask any questions you may have.

To subscribe for announcements of releases, important changes and so on, please follow @ClojureWerkz on Twitter.

Pantomime Is a ClojureWerkz Project

Pantomime is part of the group of libraries known as ClojureWerkz, together with Monger, Langohr, Neocons, Elastisch, Quartzite and several others.

Continuous Integration

Continuous Integration status

CI is hosted by travis-ci.org

Development

Pantomime uses Leiningen 2. Make sure you have it installed and then run tests against all supported Clojure versions using

lein all test

Then create a branch and make your changes on it. Once you are done with your changes and all tests pass, submit a pull request on Github.

License

Copyright (C) 2011-2019 Michael S. Klishin, and the ClojureWerkz team.

Distributed under the Eclipse Public License, the same as Clojure.

pantomime's People

Contributors

alexott avatar bitdeli-chef avatar bronsa avatar doriantaylor avatar dtsukiji avatar esb-dev avatar joshuathayer avatar lenaschoenburg avatar michaelklishin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pantomime's Issues

tika warning

I'm getting the following warning, is there anyway to suppress (or fix) them?

Jun 06, 2018 6:40:54 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.
TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. 

Jun 06, 2018 6:40:54 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.  

Avoiding the consumption of InputStream when detecting mime type?

Thank you for the wonderful library that works like a charm.

While using this I did notice that when detecting the mime type from an InputStream instance,
the bytes get consumed and I cannot simply reuse the same object.

There are work arounds available (mostly Java), and that is what InputStreams do,
but it would be nice if mime-type-of is free of side effects (at least from the app developer perspective).

Thanks!

P.S. Would greatly appreciate any recommended methods for re-using InputStream instances.

Support latest Tika ?

Hey,

Tika turned 1.14 in October 2016. Will it be simple to bump its dependency in pantomime ?

Cheers,

Rafik

Expose type inheritance feature

I was looking for functionality similar to other shared-mime-info interfaces that will tell you if a type is a subtype of some other type, e.g. that application/vnd.openxmlformats-officedocument.spreadsheetml.sheet is an application/zip. This is exposed in org.apache.tika.mime.MediaTypeRegistry.

I have already implemented a patch. If you're cool with PRs I can fork this project and do one. I have also included a protocol for coercing objects between MimeType and MediaType.

(Why does Tika have two separate, completely unrelated classes to represent type objects, anyway?)

error in readme for "by content" example

readme has:

;; by content (as byte array)
(mime-type-of (.getBytes "filename.pdf"))

but that refers to the string instead of the content

(fs/exists? "filename.pdf")
;= false

(count (.getBytes "filename.pdf"))
;= 12

minified java.io.File

pantomine was unable to detect the mime-type of a minified java.io.File

#<File /<your_path>/js/jquery-1.7.2.min.js>

Support for adding types

Does pantomime have support for adding new MimeTypes, or adding patterns for existing mime types?

New release?

Hi, and thanks for a fine library!

Is there any chance of a new release? I'd particularly be interested in using pantomime with #29.

Thanks and kind regards.

Error on detecting pdf

Upgrading Pantomime from 2.3.0 to 2.8.0 i found the sequent Exception thrown while :
java.lang.NoClassDefFoundError: org/apache/commons/compress/PasswordRequiredException
at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at org.apache.tika.Tika.detect(Tika.java:156)
at org.apache.tika.Tika.detect(Tika.java:287)
at pantomime.mime$eval20617$fn__20618.invoke(mime.clj:38)
at pantomime.mime$eval20596$fn__20597$G__20587__20602.invoke(mime.clj:24)

I struggled a while and the problem occurred while i was performing the following code:

(#{"application/pdf" "image/jpg" "image/jpeg" "image/tiff"}
                            (mime-type-of (io/file file)))

Reverting Pantomime from 2.8 to 2.3 fixes the problem.

Not parsing email attachments

Using (extract/parse-extract-embedded "pg1661.eml") does not parse the attachments from the test email that I am using. It leaves them as part of the :text, instead of parsing them and then providing them as temp files (with the paths provided by the :embedded key).

Here is the email that I'm testing with: test_email.zip
(It's in a zip because github won't allow me to upload a .eml file)

The email was originally produced using gmail, the files were produced using Libreoffice.
There are four different attachments to the email file. 1 - html, 1 - pdf, 1 - docx, 1 - zip

I'm going to continue to investigate this and I will post anything else that I uncover.

pantomime.web/mime-type-of won't detect json content

Hi,
I'm trying to detect whether my request body is json or not but always get the wrong answer

here's what i try on the REPL:

(pantomime.web/mime-type-of "{\"name\":\"lala\"}" {"Content-Type" "application/json"})
=> "text/plain"

Am I missing anything ?

Directions in README don't work

(ns your.app.namespace
(:use [pantomime.mime]))

Is non-functional for me. I'm new to closure but using latest leiningen and clojure 1.3.0. I had to use this instead:

(use 'pantomime.core)

I installed by updating my project.ctl to include [com.novemberain/pantomime "1.0.0"] in :dependencies and it pulled it in fine. It's working great now.

Thanks!

NoSuchMethodError at ZipContainerDetector/detectArchiveFormat

Hi,
I'm getting a NoSuchMethodError when using the extract/parse function. I checked lein deps :tree to see if I have multiple versions of commons-compress, but I have not. I'm pretty new to clojure, so I might have missed something obvious. Here is the stuff I did.

The error message I got:

Execution error (NoSuchMethodError) at org.apache.tika.parser.pkg.ZipContainerDetector/detectArchiveFormat (ZipContainerDetector.java:160).
org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;

My test code:

(ns cljtest.core
  (:gen-class)
  (:require [clojure.java.io :as io]
            [pantomime.extract :as extract]))

(pprint (extract/parse "cv.pdf"))

My project.clj:

(defproject cljtest "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "EPL-2.0 OR GPL-2.0-or-later WITH Classpath-exception-2.0"
            :url "https://www.eclipse.org/legal/epl-2.0/"}
  :dependencies [[org.clojure/clojure "1.10.0"]
                 [com.novemberain/pantomime "2.11.0"]]
  :main ^:skip-aot cljtest.core
  :target-path "target/%s"
  :profiles {:uberjar {:aot :all}})

Output of lein deps :tree:

 [clojure-complete "0.2.5" :exclusions [[org.clojure/clojure]]]
 [com.novemberain/pantomime "2.11.0"]
   [org.apache.commons/commons-compress "1.18"]
   [org.apache.tika/tika-parsers "1.19.1"]
     [com.drewnoakes/metadata-extractor "2.11.0"]
       [com.adobe.xmp/xmpcore "5.1.3"]
     [com.epam/parso "2.0.9"]
     [com.fasterxml.jackson.core/jackson-annotations "2.9.6"]
     [com.fasterxml.jackson.core/jackson-core "2.9.6"]
     [com.fasterxml.jackson.core/jackson-databind "2.9.6" :exclusions [[com.fasterxml.jackson.core/jackson-annotations]]]
     [com.github.jai-imageio/jai-imageio-core "1.4.0"]
     [com.github.junrar/junrar "2.0.0" :exclusions [[commons-logging] [commons-logging/commons-logging-api] [org.apache.commons/commons-vfs2]]]
     [com.github.openjson/openjson "1.0.10"]
     [com.google.code.gson/gson "2.8.5"]
     [com.googlecode.json-simple/json-simple "1.1.1" :exclusions [[junit]]]
     [com.googlecode.juniversalchardet/juniversalchardet "1.0.3"]
     [com.googlecode.mp4parser/isoparser "1.1.22"]
     [com.healthmarketscience.jackcess/jackcess-encrypt "2.1.4" :exclusions [[org.bouncycastle/bcprov-jdk15on] [com.healthmarketscience.jackcess/jackcess]]]
     [com.healthmarketscience.jackcess/jackcess "2.1.12" :exclusions [[commons-logging]]]
       [commons-lang "2.6"]
     [com.pff/java-libpst "0.8.1"]
     [com.rometools/rome "1.5.1" :exclusions [[org.jdom/jdom]]]
       [com.rometools/rome-utils "1.5.1"]
     [commons-codec "1.11"]
     [commons-io "2.6"]
     [de.l3s.boilerpipe/boilerpipe "1.1.0"]
     [edu.ucar/cdm "4.5.5" :exclusions [[commons-logging] [org.slf4j/jcl-over-slf4j] [org.apache.httpcomponents/httpcore] [org.jdom/jdom2]]]
       [com.beust/jcommander "1.35"]
       [com.google.guava/guava "17.0"]
       [edu.ucar/udunits "4.5.5"]
       [joda-time "2.2"]
       [net.sf.ehcache/ehcache-core "2.6.2"]
       [org.quartz-scheduler/quartz "2.2.0"]
         [c3p0 "0.9.1.1"]
     [edu.ucar/grib "4.5.5" :exclusions [[edu.ucar/jj2000] [org.jsoup/jsoup] [org.jdom/jdom2]]]
       [com.google.protobuf/protobuf-java "2.5.0"]
       [org.itadaki/bzip2 "0.9.1"]
     [edu.ucar/httpservices "4.5.5" :exclusions [[commons-logging] [org.apache.httpcomponents/httpclient] [org.apache.httpcomponents/httpcore] [org.apache.httpcomponents/httpmime]]]
     [edu.ucar/netcdf4 "4.5.5" :exclusions [[commons-logging] [org.jdom/jdom2] [net.java.dev.jna/jna]]]
       [net.jcip/jcip-annotations "1.0"]
     [edu.usc.ir/sentiment-analysis-parser "0.1" :exclusions [[org.apache.tika/tika-parsers] [org.apache.tika/tika-batch] [org.apache.tika/tika-translate] [org.apache.tika/tika-langdetect] [org.apache.tika/tika-core] [org.apache.opennlp/opennlp-tools] [org.slf4j/slf4j-log4j12] [org.slf4j/jul-to-slf4j] [org.slf4j/jcl-over-slf4j] [log4j]]]
     [javax.activation/activation "1.1.1"]
     [net.java.dev.jna/jna "4.3.0"]
     [org.apache.commons/commons-csv "1.5"]
     [org.apache.commons/commons-exec "1.3"]
     [org.apache.cxf/cxf-rt-rs-client "3.2.6"]
       [org.apache.cxf/cxf-core "3.2.6"]
         [com.fasterxml.woodstox/woodstox-core "5.1.0" :exclusions [[stax/stax-api] [javax.xml.stream/stax-api]]]
           [org.codehaus.woodstox/stax2-api "4.1"]
         [org.apache.ws.xmlschema/xmlschema-core "2.2.3" :exclusions [[org.apache.bcel/bcel] [xalan]]]
       [org.apache.cxf/cxf-rt-frontend-jaxrs "3.2.6"]
         [javax.annotation/javax.annotation-api "1.3"]
         [javax.ws.rs/javax.ws.rs-api "2.1"]
       [org.apache.cxf/cxf-rt-transports-http "3.2.6"]
     [org.apache.httpcomponents/httpclient "4.5.6" :exclusions [[commons-logging] [commons-codec]]]
       [org.apache.httpcomponents/httpcore "4.4.10"]
     [org.apache.httpcomponents/httpmime "4.5.6"]
     [org.apache.james/apache-mime4j-core "0.8.2"]
     [org.apache.james/apache-mime4j-dom "0.8.2"]
     [org.apache.opennlp/opennlp-tools "1.9.0"]
     [org.apache.pdfbox/jbig2-imageio "3.0.2"]
     [org.apache.pdfbox/jempbox "1.8.16"]
     [org.apache.pdfbox/pdfbox-tools "2.0.12" :exclusions [[commons-logging] [org.apache.pdfbox/pdfbox-debugger]]]
     [org.apache.pdfbox/pdfbox "2.0.12" :exclusions [[commons-logging]]]
       [org.apache.pdfbox/fontbox "2.0.12"]
     [org.apache.poi/poi-ooxml "4.0.0" :exclusions [[stax/stax-api] [xml-apis]]]
       [com.github.virtuald/curvesapi "1.04"]
       [org.apache.poi/poi-ooxml-schemas "4.0.0"]
         [org.apache.xmlbeans/xmlbeans "3.0.1"]
     [org.apache.poi/poi-scratchpad "4.0.0"]
     [org.apache.poi/poi "4.0.0" :exclusions [[commons-codec]]]
       [org.apache.commons/commons-collections4 "4.2"]
     [org.apache.sis.core/sis-metadata "0.8"]
     [org.apache.sis.core/sis-utility "0.8"]
       [javax.measure/unit-api "1.0"]
     [org.apache.sis.storage/sis-netcdf "0.8"]
       [org.apache.sis.core/sis-referencing "0.8"]
       [org.apache.sis.storage/sis-storage "0.8"]
         [org.apache.sis.core/sis-feature "0.8"]
     [org.apache.tika/tika-core "1.19.1"]
     [org.apache.uima/uimafit-core "2.2.0" :exclusions [[org.apache.uima/uimaj-core] [commons-io] [commons-logging/commons-logging-api] [commons-logging] [org.springframework/spring-context] [org.springframework/spring-beans] [org.springframework/spring-core]]]
     [org.apache.uima/uimaj-core "2.9.0"]
     [org.bouncycastle/bcmail-jdk15on "1.60"]
       [org.bouncycastle/bcpkix-jdk15on "1.60"]
     [org.bouncycastle/bcprov-jdk15on "1.60"]
     [org.brotli/dec "0.1.2"]
     [org.ccil.cowan.tagsoup/tagsoup "1.2.1"]
     [org.codelibs/jhighlight "1.0.3" :exclusions [[commons-io]]]
     [org.gagravarr/vorbis-java-core "0.8"]
     [org.gagravarr/vorbis-java-tika "0.8" :exclusions [[org.apache.tika/tika-core]]]
     [org.glassfish.jaxb/jaxb-core "2.3.0.1"]
       [com.sun.istack/istack-commons-runtime "3.0.5"]
       [javax.xml.bind/jaxb-api "2.3.0"]
       [org.glassfish.jaxb/txw2 "2.3.0.1"]
     [org.glassfish.jaxb/jaxb-runtime "2.3.0.1"]
       [com.sun.xml.fastinfoset/FastInfoset "1.2.13" :exclusions [[javax.xml.bind/jsr173_api]]]
       [org.jvnet.staxex/stax-ex "1.7.8" :exclusions [[javax.activation/activation] [javax.xml.stream/stax-api]]]
     [org.jdom/jdom2 "2.0.6"]
     [org.jsoup/jsoup "1.11.3"]
     [org.opengis/geoapi "3.0.1"]
     [org.ow2.asm/asm "6.2"]
     [org.slf4j/jcl-over-slf4j "1.7.25"]
     [org.slf4j/jul-to-slf4j "1.7.25"]
     [org.slf4j/slf4j-api "1.7.25"]
     [org.tallison/jmatio "1.5"]
     [org.tukaani/xz "1.8"]
 [nrepl "0.6.0" :exclusions [[org.clojure/clojure]]]
 [org.clojure/clojure "1.10.0"]
   [org.clojure/core.specs.alpha "0.2.44"]
   [org.clojure/spec.alpha "0.2.176"]

Thanks!

2.10.0 (or 2.9.0) on clojars?

Hi,

Is there any reason 2.10.0 (or even 2.9.0) aren't on clojars?

I want to try out the make-confg functionality, it'll be easier then messing about with the classpath and overriding things.

Thanks, -Joe

java.lang.NoClassDefFoundError: org/apache/commons/compress/PasswordRequiredException

Using

[me.raynes/fs "1.4.6"]
[com.novemberain/pantomime "2.7.0"]

This returns correctly:

(pantomime.mime/mime-type-of (me.raynes.fs/file "/path/myfile"))
"application/x-tar"

However when using 2.8.0 an exception is thrown:

ClassNotFoundException org.apache.commons.compress.PasswordRequiredException  java.net.URLClassLoader.findClass (URLClassLoader.java:381)

java.lang.NoClassDefFoundError: org/apache/commons/compress/PasswordRequiredException
ZipContainerDetector.java:131 org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat
ZipContainerDetector.java:87 org.apache.tika.parser.pkg.ZipContainerDetector.detect
CompositeDetector.java:77 org.apache.tika.detect.CompositeDetector.detect
         Tika.java:156 org.apache.tika.Tika.detect
         Tika.java:287 org.apache.tika.Tika.detect
           mime.clj:38 pantomime.mime/eval15432[fn]
           mime.clj:24 pantomime.mime/eval15411[fn]

Question: Custom file parser in Clojure

The Tika documentation recommends adding custom parsers by extending the AbstractParser Java class and then listing the new customer parser classes in the tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser file:

https://tika.apache.org/1.21/parser_guide.html#Create_your_Parser_class

Any tips for an idiomatic implementation with pantomime? Must custom parsers be compiled AOT and then loaded via pantomime.extract/make-config, or can they be written purely in Clojure and passed to Tika at runtime?

Thanks for the excellent library!

setTimeout() ?

I've been trying to figure out if there is someway to set the timeout for the tesseract process.

I occasionally receive errors such as org.apache.tika.exception.TikaException: TesseractOCRParser timeout... i'm willing to be more patient then 1200 seconds if i can figure out where to increase this.

I'm thinking maybe this has something to do with the ParseContext but haven't found exactly where/how.

Any tips appreciated, thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.