Giter VIP home page Giter VIP logo

book's Introduction

Taming Text, by Grant Ingersoll, Thomas Morton and Drew Farris is
designed to teach software engineers the basic concepts of working
with text to solve search and Natural Language Processing problems.
The book focuses on teaching using existing open source libraries like
Apache Solr, Apache Mahout and Apache OpenNLP to manipulate text.  To
learn more, visit http://www.manning.com/ingersoll.

Getting Started
---------------

Throughout this document, TT_HOME is the directory containing the
checkout of the Taming Text code base.

Taming Text uses Maven for building and running the code.  To get
started, you will need:

1. JDK 1.6+ 
2. Maven 3.0 or higher 
3. The OpenNLP English models, available at 
   http://maven.tamingtext.com/opennlp-models/models-1.5.

   Place the models in the TT_HOME directory in a directory named
   opennlp-models. 

   This can be done by using the following commands on UNIX from the
   TT_HOME directory:

   mkdir opennlp-models 
   cd opennlp-models 
   wget -nd -np -r http://maven.tamingtext.com/opennlp-models/models-1.5/ 
   rm index.html*
   
   
   or using wget (https://eternallybored.org/misc/wget/) and 7-Zip (http://www.7-zip.org/) on windows (both must be added to the path environment variable):
   
   md opennlp-models 
   cd opennlp-models
   wget -nd -np -r http://maven.tamingtext.com/opennlp-models/models-1.5/ 
   del index.htm*
   

4. Get WordNet 3.0 and place it in the TT_HOME directory.
   
   This can be done by using the following commands on UNIX from the
   TT_HOME directory:

   wget -nd -np -m http://maven.tamingtext.com/wordnet/
   rm index.html*
   tar -xf WordNet-3.0.tar.gz

   or using wget (https://eternallybored.org/misc/wget/) and  7-Zip (http://www.7-zip.org/) on windows (both must be added to the path environment variable):
   
   wget -nd -np -r http://maven.tamingtext.com/wordnet/
   del index.html*
   7z x WordNet-3.0.tar.gz
   7z x WordNet-3.0.tar

Building the Source
-------------------

Prior to building the source, for those previously unfamiliar with Maven,
it may be wise to read this to avoid future hassles:
http://maven.apache.org/guides/getting-started/maven-in-five-minutes.html

To build the source, in TT_HOME:

   mvn clean package 

Running the Examples
--------------------

Many of the examples can be run via the 'tt' script in the TT_HOME/bin
directory. Running this script without arguments will display a list
of the example names.

Some of the samples are powered by pre-configured instances of
solr. These can be started with the TT_HOME/bin/start-solr.sh script,
which takes a single argument, the name of the instance to
start. Available instances include solr-qa, solr-clustering and
solr-tagging.

book's People

Contributors

abscondment avatar drewfarris avatar gsingers avatar hkuhn42 avatar navyant24 avatar tamingtext avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

book's Issues

Class cast exception while running Solr with answering system

Hi,
I've setup environment with Answer source code.
While running Solr with these parameters (-Xmx1024m -Dsolr.solr.home=c:\KMS\QA-taming\tamingText-src\apache-solr\solr-qa -Dsolr.data.dir=c:\KMS\QA-taming\tamingText-src\apache-solr\solr-qa\data -Dmodel.dir=c:\KMS\QA-taming\opennlp-models -Dwordnet.dir=c:\KMS\QA-taming\WordNet-3.0)

I'm getting the below error.

Please assist.

Regards,
Moshe

Apr 07, 2014 7:10:06 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.ClassCastException: com.tamingtext.texttamer.solr.SentenceTokenizerFactory cannot be cast to org.apache.solr.analysis.TokenizerFactory
at org.apache.solr.schema.IndexSchema$5.init(IndexSchema.java:966)
at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:148)
at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:986)
at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:453)
at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:433)
at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:140)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:490)
at org.apache.solr.schema.IndexSchema.(IndexSchema.java:123)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:481)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:335)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:165)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:96)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:653)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1239)
at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517)
at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:466)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:222)
at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:194)
at org.mortbay.start.Main.start(Main.java:534)
at org.mortbay.start.Main.start(Main.java:441)

Windows: failed tests and broken sample

I get the following test failures when building on Windows 7 x64, JDK 7.0_17, MVN 3.0.5. I get the same errors regardless of using the windows command line or buidling from cygwin.

https://gist.github.com/developmentalmadness/5110401
https://gist.github.com/developmentalmadness/5110276
https://gist.github.com/developmentalmadness/5110299

I wouldn't worry too much except I can't run the first example, frankenstein.cmd either:

C:\dev\github\tamingtextbook>"C:\Program Files\Java\jdk1.7.0_17\bin\java" -Xms512m -Xmx1024m -classpath ";.\target\test-
classes" com.tamingtext.frankenstein.Frankenstein
Exception in thread "main" java.lang.NoClassDefFoundError: opennlp/tools/sentdetect/SentenceDetector
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2451)
at java.lang.Class.getMethod0(Class.java:2694)
at java.lang.Class.getMethod(Class.java:1622)
at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
Caused by: java.lang.ClassNotFoundException: opennlp.tools.sentdetect.SentenceDetector
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
... 6 more

C:\dev\github\tamingtextbook>ENDLOCAL

Book errata

Page 269 lines 4-5 states:
To answer a question like "When was Einstein born?" they suggest patterns like "$<$NAME$>$ was born in $<$LOCATION$>"

From the surrounding context, it looks like "When" should be replaced with "Where".

(I'll take a free copy as payment for my amazing editorial skills. grin)

Cannot run mvn clean package

When i run "mvn clean package" getting error messag like below on windows 8 PC.

cygwin warning:
  MS-DOS style path detected: C:\apache-maven-3.0.4/boot/
  Preferred POSIX equivalent is: /cygdrive/c/apache-maven-3.0.4/boot/
  CYGWIN environment variable option "nodosfilewarning" turns off this warning.
  Consult the user's guide for more details about POSIX paths:
    http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Taming Text Source 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: http://repo.maven.apache.org/maven2/org/apache/maven/plugins/maven-clean-plugin/2.4.1/maven-clean-plugin-2.4.1.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.510s
[INFO] Finished at: Wed Dec 26 22:56:21 EST 2012
[INFO] Final Memory: 7M/309M
[INFO] ------------------------------------------------------------------------
[ERROR] Plugin org.apache.maven.plugins:maven-clean-plugin:2.4.1 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.apache.maven.plugins:maven-clean-plugin:jar:2.4.1: Could not transfer artifact org.apache.maven.plugins:maven-clean-plugin:pom:2.4.1 from/to central (http://repo.maven.apache.org/maven2): Connection to http://repo.maven.apache.org refused: connect: Address is invalid on local machine, or port is not valid on remote machine -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException

TrainMaxent problem

Hi
I tried the TrainMaxent program.
When I executes the program it says CategoryDataStream cannot be cast to opennlp.tools.util.ObjectStream

Any hint about the root cause of this problem

Cannot load SentenceTokenizerFactory class

Hello,
I am trying to run the instance solr-qa but I take the following error message:
Error loading class 'com.tamingtext.texttamer.solr.SentenceTokenizerFactory'.

I am able to run correctly the instances solr-clustering and solr-tag.

My system is CentOS.

Find below the output of "bin/start-solr.sh solr-qa"

https://gist.github.com/liberisp/7099468

Thank you in advance

NameFinderTest.removeConflicts(..) guard against empty list.

If the removeConflicts method from NameFinderTest is called with an empty list an exception is thrown.

There's really no need to do anything in removeConflicts unless we have more than one item passed into the method via the list argument, so exit early if the list has less than 2 entries.

incompelete sentence in answerr

What is the reason for getting incomplete sentences in answers?
The sentences are randomly ended. How can this be solved?

windows execution error

hi,
while I am reading your book and following your instruction, I found some execution error
So I am writing to you

/book/bin/frankenstein.cmd file has an error.

your source is as shown below

6: for %%i in (..\lib*.jar) do set CLASSPATH=!CLASSPATH!;%%i

-- you should change like this
-- for %%i in (.\target\dependency*.jar) do set CLASSPATH=!CLASSPATH!;%%i

thanks.

Missing Wordnet 3 file verb.idx

java.io.FileNotFoundException: c:\projects\SolrWatson\TT-Home\WordNet-3.0\dict\verb.idx (The system cannot find the file specified) That error pop in several surefire reports.

From here: http://www.shiffman.net/teaching/a2z/wordnet/
"Just found the way to fix this: rename all index.noun, index.verb... to noun.idx, verb.idx..."
That response is from 2 years ago.
I copied in the index.noun etc files, renamed as noun.idx,, etc, then got this failure:
FileNotFoundException: c:\projects\SolrWatson\TT-Home\WordNet-3.0\dict\verb.dat (The system cannot find the file specified)
Did the same thing to data.noun, etc.
Those tests run fine. But, for now, I will not know if the system works since I'm running on Win7 64bit, with this failure not satisfied by anything I do:

Failed to set permissions of path: \tmp\hadoop-Admin\mapred\staging\Admin1270388141.staging to 0700
which is launched by the ExtractTrainingDataTest at the line: TrainClassifier.main(trainArgs);

rerunning even after giving that directory and sub directories full permissions, still fails. Perhaps org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:653) either doesn't check first, or doesn't understand Windows, or something. I'll have to wait till I replicate this experience on a *nix box.

Added later: with my wordnet changes, it builds fine on Ubuntu. Have yet to run exercises. No clue what those changes do to wordnet's behaviors.

Build failure due to test errors

Hello,

I have followed the readme to build the code downloaded from master branch yesterday. I keep getting the following errors during the test phase of maven build. It seems only 3 of the unit tests failed. I want to check if these are known issues or if there is any workaround so that I can complete the build and test out the sample QA system.

I am using Windows 2008 R2, JDK 1.7.0_05, Maven 3.2.2

Thanks
-Jimmy


T E S T S

Running com.tamingtext.carrot2.Carrot2ExampleTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.192 sec
Running com.tamingtext.classifier.bayes.BayesUpdateRequestProcessorTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.566 sec
Running com.tamingtext.classifier.bayes.ExtractTrainingDataTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.961 sec <<< FAILURE!
Running com.tamingtext.classifier.mlt.MoreLikeThisQueryTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.565 sec
Running com.tamingtext.fuzzy.LevenshteinDistanceTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.171 sec
Running com.tamingtext.fuzzy.OverlapMeasuresTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.737 sec
Running com.tamingtext.fuzzy.TrieNodeTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.182 sec
Running com.tamingtext.mahout.VectorExamplesTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.134 sec
Running com.tamingtext.opennlp.AnswerTypeTest
Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 2.956 sec <<< FAILURE!
Running com.tamingtext.opennlp.ChunkParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.611 sec
Running com.tamingtext.opennlp.NameFinderTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 26.591 sec
Running com.tamingtext.opennlp.ParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.751 sec
Running com.tamingtext.opennlp.POSTaggerTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.843 sec
Running com.tamingtext.qa.PassageRankingComponentTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 50.974 sec
Running com.tamingtext.qa.QATest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 53.369 sec <<< FAILURE!
Running com.tamingtext.sentences.SentenceDetectionTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.329 sec
Running com.tamingtext.snowball.SnowballStemmerTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.189 sec
Running com.tamingtext.solr.SolrJTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 15.192 sec
Running com.tamingtext.texttamer.solr.NameFilterTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 10.377 sec
Running com.tamingtext.texttamer.solr.SentenceTokenizerTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.457 sec
Running com.tamingtext.tika.TikaTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.889 sec
Running com.tamingtext.util.StringUtilTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.221 sec

Results :

Tests in error:

Tests run: 43, Failures: 0, Errors: 4, Skipped: 0

Building Taming Text 0.1-SNAPSHOT Error

I am getting a build failure during mvn package command.
What I dont get in this error is that it could not resolve the carrot artifact in openNLP. Why would openNLP have Carrot.

I am appreciating this book quite a bit (read it yesterday cover to cover) and looking forward to testing

Thanks,
Eric

Error Below:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 24.332s
[INFO] Finished at: Wed Jan 16 11:15:15 EST 2013
[INFO] Final Memory: 6M/81M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project taming-text: Could not resolve dependencies for project com.tamingtext:taming-text:jar:0.1-SNAPSHOT: Could not find artifact org.carrot2:carrot2-core:jar:3.6.0-SNAPSHOT in opennlp (http://opennlp.sourceforge.net/maven2/) -> [Help 1]

"mvn clean package" very slow artifact download speeds

The repositories hosting various components seem to be awfully slow. Here is a sample of some of the download speeds being reported by "mvn clean package":

Downloaded: http://repo.maven.apache.org/maven2/org/apache/ant/ant-junit/1.7.1/ant-junit-1.7.1.pom (4 KB at 0.2 KB/sec)
Downloaded: http://repo.maven.apache.org/maven2/org/apache/ant/ant-parent/1.7.1/ant-parent-1.7.1.pom (5 KB at 0.3 KB/sec)
Downloaded: http://repo.maven.apache.org/maven2/org/apache/ant/ant/1.7.1/ant-1.7.1.pom (10 KB at 0.6 KB/sec)

Its not my network, I checked. Network speeds tested using speedtest-cli:
Download: 24.57 Mbit/s
Upload: 10.10 Mbit/s

so it appears that download speed is being throttled on the server side. Looking inside the pom.xml showed that the maven2 repo was pointing at this URL:

http://people.apache.org/maven-snapshot-repository

which probably works, but is not backed up with enough hardware compared to http://repo1.maven.org/maven2/ (which is the central maven2 repo based on this web page (http://www.mkyong.com/maven/where-is-maven-central-repository/).

After switching the url and rerunning "mvn clean package", the download speeds are significantly higher.

Downloaded: http://repo1.maven.org/maven2/org/codehaus/jackson/jackson-mapper-asl/1.4.0/jackson-mapper-asl-1.4.0.pom (2 KB at 15.4 KB/sec)
Downloaded: http://repo1.maven.org/maven2/com/googlecode/json-simple/json-simple/1.1/json-simple-1.1.pom (2 KB at 19.6 KB/sec)

While in the pom.xml, also noticed that the sf.net repository configured for OpenNLP. Nowadays OpenNLP is also available on maven central, so both repository definitions identified by the ids below can be safely removed from the pom.xml, and will result in much faster setup.
opennlp
apache.maven2.snapshot.repository

Still takes upwards of an hour to run the command to completion. You may want to mention that in the README.

Lazy loading error - Chapter 3

When trying to run the curl samples against the Word docs from Chapter 3 a Lazy Loading error occurs.

curl "http://localhost:8983/solr/update/extract?&extractOnly=true" \
   -F "myfile=@src/test/resources/sample-word.doc"

Causes error similar to the following...

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 lazy loading error

org.apache.solr.common.SolrException: lazy loading error
  at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:260)
  at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
  at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
  at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
  at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
  at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
  at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
  at org.mortbay.jetty.Server.handle(Server.java:326)
  at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
  at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
  at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
  at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.apache.solr.common.SolrException: Error loading class 'solr.extraction.ExtractingRequestHandler'
  at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:394)
  at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:419)
  at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:455)
  at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:251)
  ... 21 more
Caused by: java.lang.ClassNotFoundException: solr.extraction.ExtractingRequestHandler
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:789)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:270)
  at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:378)
  ... 24 more
</title>
</head>

page61 Indexing content with Apache Solr

when running command

 curl "http://localhost:8983/solr/update/extract?&extractOnly=true"  -F "myfile=@src/test/resources/sample-word.doc"

i got solr exception

C:\_Work\_git\book>curl "http://localhost:8983/solr/update/extract?&extractOnly=
true" -F "myfile=@src/test/resources/sample-word.doc"



<title>Error 500 lazy loading error

org.apache.solr.common.SolrException: lazy loading error
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWra
ppedHandler(RequestHandlers.java:260)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Request(RequestHandlers.java:242)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:365)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:260)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
Handler.java:1212)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:3
99)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.jav
a:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:1
82)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:7
66)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)

        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHand
lerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.
java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:1
52)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:54
2)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnectio
n.java:945)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.
java:228)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.j
ava:582)
Caused by: org.apache.solr.common.SolrException: Error loading class 'solr.extra
ction.ExtractingRequestHandler'
        at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.
java:394)
        at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:419)
        at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:455)

        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWra
ppedHandler(RequestHandlers.java:251)
        ... 21 more
Caused by: java.lang.ClassNotFoundException: solr.extraction.ExtractingRequestHa
ndler
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Unknown Source)
        at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.
java:378)
        ... 24 more
</title>

HTTP ERROR 500

Problem accessing /solr/update/extract. Reason:

    lazy loading error

org.apache.solr.common.SolrException: lazy loading error
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWra
ppedHandler(RequestHandlers.java:260)
        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Request(RequestHandlers.java:242)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:365)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:260)
        at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
Handler.java:1212)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:3
99)
        at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.jav
a:216)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:1
82)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:7
66)
        at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)

        at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHand
lerCollection.java:230)
        at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.
java:114)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:1
52)
        at org.mortbay.jetty.Server.handle(Server.java:326)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:54
2)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnectio
n.java:945)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
        at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.
java:228)
        at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.j
ava:582)
Caused by: org.apache.solr.common.SolrException: Error loading class 'solr.extra
ction.ExtractingRequestHandler'
        at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.
java:394)
        at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:419)
        at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:455)

        at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWra
ppedHandler(RequestHandlers.java:251)
        ... 21 more
Caused by: java.lang.ClassNotFoundException: solr.extraction.ExtractingRequestHa
ndler
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Unknown Source)
        at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.
java:378)
 

making things work with jdk 1.8

I checked out the code did a mvn eclipse:eclipse and imported the project into a workspace. Now
Running into this with eclipse,
Unbound classpath container: 'JRE System Library [JavaSE-1.6]' in project 'taming-text'
is it possible to build the bits with 1.8?

UnsupportedOperationException from SplitInput

Exception in thread "main" java.lang.UnsupportedOperationException
        at java.util.Collections$SingletonSet$1.remove(Collections.java:3087)
        at java.util.AbstractCollection.clear(AbstractCollection.java:396)
        at org.apache.mahout.common.IOUtils.close(IOUtils.java:137)
        at com.tamingtext.util.SplitInput.countLines(SplitInput.java:583)

The semantics of IOUtils.close(..) have changed slightly betwen mahout 0.4 and 0.6. close() must be passed a mutable Collection because it removes elements from the collection as it successfully closes them. As such Collections.singletonSet(writable) is no longer a valid parameter and results in an UnsupportedOperationException

data import command on page 149 does not complete

I followed the directions on the github main page to download the code, ran mvn package, and went to the bin directory and ran ./start-solr.sh solr-clustering &. So far so good.

But when I went to http://localhost:8983/solr/dataimport?command=full-import, but the data import could not complete. The text in the status message was "Indexing failed. Rolled back all changes." In the console, I found this error message:

SEVERE: Exception thrown while getting data
java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.startribune.com/sports/index.rss2
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1626)
...

For whatever reason, the script seems unable to get to the url http://www.startribune.com/sports/index.rss2. However, I can get to the url from the browser window. Is there a known issue with getting to this page from the solr example?

I am trying to do Mahout clustering and have been getting errors when trying to cluster some other documents (the docs that came with solr 4.6.1) so now I am trying to follow the book's examples exactly so I could guarantee that the clustering process will run correctly.

“NoSuchMethodErrors” due to multiple versions of commons-codec:commons-codec:jar

Issue description

Hi, there are multiple versions of commons-codec:commons-codec in book-master. As shown in the following dependency tree, according to Maven "nearest wins" strategy, only commons-codec:commons-codec:1.6 can be loaded, commons-codec:commons-codec:1.2, commons-codec:commons-codec:1.5 and commons-codec:commons-codec:1.4 will be shadowed.

However, several methods defined in shadowed version commons-codec:commons-codec:1.2, commons-codec:commons-codec:1.5 and commons-codec:commons-codec:1.4 are referenced by client project via org.apache.mahout:mahout-core:0.6, org.apache.mahout:mahout-integration:0.6, org.apache.solr:solr-solrj:3.6.0, org.apache.tika:tika-parsers:0.10 and org.carrot2:carrot2-core:3.6.0 but missing in the actually loaded version commons-codec:commons-codec:1.6.

For instance, the following missing method(defined in commons-codec:commons-codec:1.2, commons-codec:commons-codec:1.5 and commons-codec:commons-codec:1.4) are actually referenced by book-master, which will introduce a runtime error(i.e., "NoSuchMethodError") into book-master.

  1. <org.apache.commons.codec.binary.Base64: java.lang.String encodeToString(byte[])> is invoked by book-master via the following path:
path--
<com.tamingtext.util.SplitInput: void splitFile(org.apache.hadoop.fs.Path)> com.tamingtext:taming-text:0.1-SNAPSHOT;
<org.apache.hadoop.fs.FileSystem: org.apache.hadoop.fs.FSDataInputStream open(org.apache.hadoop.fs.Path)> org.apache.hadoop:hadoop-core:0.20.204.0;
<org.apache.hadoop.hdfs.HftpFileSystem: org.apache.hadoop.fs.FSDataInputStream open(org.apache.hadoop.fs.Path,int)> org.apache.hadoop:hadoop-core:0.20.204.0;
<org.apache.hadoop.hdfs.HftpFileSystem: java.net.HttpURLConnection openConnection(java.lang.String,java.lang.String)> org.apache.hadoop:hadoop-core:0.20.204.0;
<org.apache.hadoop.hdfs.HftpFileSystem: java.lang.String updateQuery(java.lang.String)> org.apache.hadoop:hadoop-core:0.20.204.0;
<org.apache.hadoop.security.token.Token: java.lang.String encodeToUrlString()> org.apache.hadoop:hadoop-core:0.20.204.0;
<org.apache.hadoop.security.token.Token: java.lang.String encodeWritable(org.apache.hadoop.io.Writable)> org.apache.hadoop:hadoop-core:0.20.204.0;
<org.apache.commons.codec.binary.Base64: java.lang.String encodeToString(byte[])>

Suggested fixing solutions:

  1. Change direct dependency commons-codec:commons-codec from 1.6 to 1.4. Because version 1.4 includes the above missing methods and is compatible with other versions of commons-codec:commons-codec in the project.
  2. Use configuration to unify the version of library commons-codec:commons-codec to be 1.4 in the pom file.

Please let me know which solution do you prefer? I can submit a PR to fix it.

Thank you very much for your attention.
Best regards,

Dependency tree----


[INFO] |  |  \- (commons-codec:commons-codec:jar:1.2:compile - omitted for conflict with 1.6)
[INFO] |  |  \- (commons-codec:commons-codec:jar:1.6:compile - omitted for duplicate)
[INFO] |  +- commons-codec:commons-codec:jar:1.6:compile
[INFO] |  +- (commons-codec:commons-codec:jar:1.4:compile - omitted for conflict with 1.6)
[INFO] |  |  \- (commons-codec:commons-codec:jar:1.5:compile - omitted for conflict with 1.6)
[INFO] |  |  +- (commons-codec:commons-codec:jar:1.4:compile - omitted for conflict with 1.6)
[INFO] |  |  +- (commons-codec:commons-codec:jar:1.2:compile - omitted for conflict with 1.6)
[INFO] |  +- (commons-codec:commons-codec:jar:1.4:compile - omitted for conflict with 1.6)
[INFO] |  |  \- (commons-codec:commons-codec:jar:1.4:compile - omitted for conflict with 1.6)

Unable to search queries which use synonyms.txt

After following up the solr-qa application steps, I was able to run the application and get answers to my questions.
However I wasn't able to get answers if the questions had synonyms instead of the original words. I populated the synonyms.txt file like this -

Profile,account
edit,change,Configure,setup,create,establish

Frankenstein run error

Hi,

I've downloaded source code and received the following error on attempting to run the frankenstein.sh script.

Initializing Frankenstein
Exception in thread "main" java.io.FileNotFoundException: ../../opennlp-models
  at com.tamingtext.frankenstein.Frankenstein.init(Frankenstein.java:226)
  at com.tamingtext.frankenstein.Frankenstein.main(Frankenstein.java:72)
  1. opennlp-models is located in the TT_HOME folder and was downloaded as instructed
  2. I'd look myself and see what problem is at lines 76 and 226 but I can't seem to find the frankenstein.java file (just the compiled .class file)

Really looking forward to the digging into the book and thanks in advance for your help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.