patrickfrey / strusanalyzer Goto Github PK

Library for document analysis (segmentation, tokenization, normalization, aggregation) with the goal to get a set of items that can be inserted into a strus storage. Also some functions for analysing tokens or phrases of the strus query are provided.

Home Page: http://www.project-strus.net

License: Mozilla Public License 2.0

CMake 6.54% Makefile 0.01% C++ 93.40% Python 0.05%

strusanalyzer's People

Stargazers

Watchers

strusanalyzer's Issues

C++11 dynamic exceptions warning in textwolf header files

In file included from /home/abaumann/projects/strus/strusAnalyzer/3rdParty/textwolf/include/textwolf/xmlpathselect.hpp:16:0,
                 from /home/abaumann/projects/strus/strusAnalyzer/include/private/xpathAutomaton.hpp:10,
                 from /home/abaumann/projects/strus/strusAnalyzer/src/segmenter_cjson/segmenter.hpp:13,
                 from /home/abaumann/projects/strus/strusAnalyzer/src/segmenter_cjson/libstrus_segmenter_cjson.cpp:11:
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/textwolf/include/textwolf/xmlscanner.hpp:61:30: warning: dynamic exception specifications are deprecated in C++11 [-Wdeprecated]

Seen on ArchLinux with gcc (GCC) 7.1.1 20170630

cJSON library throws some warnings

[ 87%] Building C object 3rdParty/cjson/src/CMakeFiles/cjson.dir/cJSON.c.o
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/cjson/src/cJSON.c: In function ‘cJSON_strcasecmp’:
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/cjson/src/cJSON.c:37:2: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
  if (!s1) return (s1==s2)?0:1;if (!s2) return 1;
  ^~
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/cjson/src/cJSON.c:37:31: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’
  if (!s1) return (s1==s2)?0:1;if (!s2) return 1;
                               ^~
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/cjson/src/cJSON.c: In function ‘print_object’:
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/cjson/src/cJSON.c:480:3: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
   if (fmt) *ptr++='\n';*ptr=0;
   ^~
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/cjson/src/cJSON.c:480:24: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’
   if (fmt) *ptr++='\n';*ptr=0;
                        ^
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/cjson/src/cJSON.c: In function ‘cJSON_DetachItemFromArray’:
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/cjson/src/cJSON.c:507:2: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
  if (c->prev) c->prev->next=c->next;if (c->next) c->next->prev=c->prev;if (c==array->child) array->child=c->next;c->prev=c->next=0;return c;}
  ^~
/home/abaumann/projects/strus/strusAnalyzer/3rdParty/cjson/src/cJSON.c:507:37: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’
  if (c->prev) c->prev->next=c->next;if (c->next) c->next->prev=c->prev;if (c==array->child) array->child=c->next;c->prev=c->next=0;return c;}
                                     ^~
[ 87%] Linking C static library libcjson.a
[ 87%] Built target cjson

punctuation method, meaning of second parameter

punctuation   producing punctuation elements (end of sentence recognition). The language is specified as parameter (currently only german 'de' and english 'en' supported).

	sentence = orig punctuation("en","") /post/post/body//para();

What is the meaning of the second parameter?

Asynchronous document feeding implementation not complete

no subsequent calls of DocumentAnalyzerContext::putInput without consuming input with getNext possible. An error message is issued by the XML segmenter, but this behaviour is strange.
The interface makes you believe that it is asynchronous, but it is not. Though it was intended to be and it should be asynchronous.

build issues on RHEL-6

textwolf submodule is on git revision 2052c98a74d69353b3e2ad6a3586a44fae8ae56e when
I check out strusAnalyzer with --recursive,
current version in master of textwolf is c141f51e2a4de739134cfa68c022203125c8f8ab, should
c141f51e2a4de739134cfa68c022203125c8f8ab be the commit revision in textwolf in strusAnalyzer?

Meaning of ~ in analyzer xpath selector

    sent = empty content /doc/title~;

What does '~' mean?

remove hardcoded document properties

hardcoded document properties like attribute doclen should be replaced.
The can be substituted by meta data definitions from an accumulator.

The XML segmenter processes only UTF-8

textwolf is able to process other character set encodings too, but the strus standard segmenter has to parse the XML header to determine the character set encoding and to instrument the XML parser accordingly. This is not done. UTF-8 is assumed.

JSON segmenter crashes as sub segmenter

Program terminated with signal SIGSEGV, Segmentation fault.
#0  getTextwolfItems (itemar=..., nd=nd@entry=0x0)
    at /home/abaumann/projects/strus/strusAnalyzer/src/segmenter_cjson/segmenterContext.cpp:79
79              switch (nd->type & 0x7F)
(gdb) bt
#0  getTextwolfItems (itemar=std::vector of length 0, capacity 0, nd=nd@entry=0x0)
    at /home/abaumann/projects/strus/strusAnalyzer/src/segmenter_cjson/segmenterContext.cpp:79
#1  0xb702a7fc in getSegmenterItems (tree=0x0, resar=..., automaton=<optimized out>)
    at /home/abaumann/projects/strus/strusAnalyzer/src/segmenter_cjson/segmenterContext.cpp:198
#2  strus::SegmenterContext::getNext (this=0x9bb59c8, id=@0xbffe51ec: 268435456, pos=@0x9bb5990: 2861, 
    segment=@0xbffe51e4: 0x9bb64a0 "\n{}\n", segmentsize=@0xbffe51e8: 4)
    at /home/abaumann/projects/strus/strusAnalyzer/src/segmenter_cjson/segmenterContext.cpp:238
#3  0xb70f8165 in strus::DocumentAnalyzerContext::analyzeNext (this=<optimized out>, doc=...)
    at /home/abaumann/projects/strus/strusAnalyzer/src/analyzer/documentAnalyzerContext.cpp:152
#4  0x0804e8a5 in main (argc=<optimized out>, argv=<optimized out>)
    at /home/abaumann/projects/strus/strusUtilities/src/strusAnalyze/strusAnalyze.cpp:457

maybe a problem is the empty JSON document:

#2  strus::SegmenterContext::getNext (this=0x9bb59c8, id=@0xbffe51ec: 268435456, pos=@0x9bb5990: 2861, 
    segment=@0xbffe51e4: 0x9bb64a0 "\n{}\n", segmentsize=@0xbffe51e8: 4)

using nested segmenters and positions

I have a subsection JSON inside an XML segmenter.
How can I make sure, that indexing JSON (where fields like title have no order) have
a specific order and come before another section in the XML.

example:

<meta>
{"title":"Andreas Baumann's Personal Home Page"}
</meta>
<body>
<para>
  Using a static HTML generator now called

vs.

<meta>
{"categories":["Hardware","NAS","Linux"],"date":"2017-01-21T14:10:11+01:00","thumbnail":"/images/blog/a-nas-tale/a-nas-tale.png","title":"A NAS tale"}
</meta>
<body>
<para>
  In August 2009 I decided it was time to replace my old Pentium II

the analyzer configuration looks as follows:

[SearchIndex]
	word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/meta()/title();
	word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/body//para();
	
[ForwardIndex]
	title = orig split /posts/post/meta()/title();
	text = orig split /posts/post/body//para();

I see:

1 title 'Andreas'
2 title 'Baumann's'
3 title 'Personal'
4 title 'Home'
5 title 'Page'
6 text 'Using'
7 text 'a'
8 text 'static'

which is ok and:

1 text 'In'
2 text 'August'
3 text '2009'
4 text 'I'
5 text 'decided'
6 text 'it'
7 text 'was'
8 text 'time'
...
63 text 'the'
64 text 'job.'
65 title 'A'
66 title 'NAS'
67 title 'tale'
68 text 'Almost'
69 text 'exactly'
70 text 'a'
...

What I would like to declare is that the text in para starts at a certain offset after all fields in meta.

multi-valued attributes are not supported

I have a field 'subject' with the value:

 Passing (Identity) -- Fiction|Legal stories|Infants switched at birth -- Fiction|Missouri 
--Fiction|Trials (Murder) -- Fiction|Race relations -- Fiction|Impostors and imposture -- Fiction

When splitting it with:

	subject = orig regex("([^\|]+)") subject;

only the last subject is inserted into the index.

What is the strategy to have multi-valued fields?

Position information of overlapping annotations (attributes in XML) is not handled correctly

The start position of an annotation is bound to the tag, if the tag is selected or to the first term after the tag. Subsequent positions are counted from this base. This has the following consequences:

Elements in annotations except the first get a wrong position
When matching a structure in an annotation you might get matches covering two annotations if the annotations are close or overlapping.

Proposed solution:

All elements of an annotation get bound to the one position in the content they are bound to.
Provide special posting set operators that use a second position inside the annoation for matching structures in annotations.
Implement another type of a block for annotations that have a tuple for each position they match: The first element of the tuple is the content position and the second the position inside the annotation. The first position is used in ordinary operations, the second in the operations refering to the annotation positions.

End of sentence recognition does not work at all

See Wikipedia Collection (about 5 % of documents contain an end of sentence marker) !?

date2int metadata mapping

error handling transaction in queue: transaction commit failed, error handling transaction in queue: transaction commit failed, storage transaction with error: error in 'date2int' normalizer: unknown time format: ''

This leads to the question how unknown or illegal dates should be handled:
Either have an unknown value in the metadata or map it to a default value.

How to do conditional indexing per language?

For instance:

<DOC>
  <META>
    <LANGUAGE>en</LANGUAGE>
  </META>
  <TEXT>
    <P>This is a text.</P>
  </TEXT>
</DOC>

I would think of an analyzer confiuguration:

[SearchIndex]
  word = lc:convdia(en) word /DOC/TEXT/P()[/DOC/META/LANGUAGE()="en";
  word = lc:convdia(de) word /DOC/TEXT/P()[/DOC/META/LANGUAGE()="de";

  stem = lc:convida():stem(en) word /DOC/TEXT/P()[/DOC/META/LANGUAGE()="en";
  stem = lc:convida():stem(de) word /DOC/TEXT/P()[/DOC/META/LANGUAGE()="de";

Of course I can always transform the document before and push the language attribute
for instance into the TEXT or the P tag.

document markup does not resolve overlapping markups correctly

If you want to markup a document with matching patterns, you have either to declare the patterns as exclusive. (%MATCHER exclusive) or rely on the correct implementation of the ousting of matches with lower priority by matches of higher priority. The later mechanism implemented in

    void TokenMarkupContextInterface::putMarkup(
                    const analyzer::Position& start,
                    const analyzer::Position& end,
                    const analyzer::TokenMarkup& markup,
                    unsigned int level);

does not work. Neither are overlapping matches in the content marked up correctly, nor does the mechanism of eliminating lower level markup of areas by covering higher level areas work.

List of supported document types and segmenters

A list of currently supported document formats and segmenters as well as what is the default segmenter for a document type would be nice to have.

How to split document in tokens in a mixed tagged format

Using:

[SearchIndex]
	word = lc:convdia(en):stem(en):lc regex("([A-Za-z']+)") /posts/post/body//para();

[ForwardIndex]
	text = orig split /posts/post/body//para();

I get:

6 text 'Using'
7 text 'a'
8 text 'static'
9 text 'HTML'
10 text 'generator'
11 text 'now'
12 text 'called'
13 text 'Hugo'
14 text '.'
15 text 'Before'
16 text 'I'
17 text 'used'
18 text 'HTML'
19 text 'and'
20 text 'server-side-includes.'
23 text 'Synchronization'
24 text 'is'
25 text 'done'
26 text 'with'
27 text 'rsync'
28 text 'over'
29 text 'ssh.'

Documentation says it's a split on whitespace. Why do I get sometimes '.' and somtimes
'word.'?

Does it depend on the way I'm analyzing for the search index?

How to index the filename?

Segmenters should have options

Some segmenters like CSV need options, because their behaviour is not fully standardised and the data column descriptions may be declared at a separate place, not part of the file itself.

Some segmenters like LibXML (does not exist, but may in the future) may have options to explicitely switch off some behaviour required by the standard but not desired because of vulnerabilities. For example attacks by recursive entity declarations.

I would suggest a structure SegmenterOptions containing an array of name value pairs passed to SegmenterInterface::createInstance(). The options available are dependent on the segmenter implementation.

RandomFeed is very slow on FreeBSD and OSX

The RandomFeed test takes significantly longer on OSX and FreeBSD than on Linux.

We think it's because of the way exceptions work differently in different
C++ libraries.

What is the meaning of scheme in DocumentClass

I understand the MIME-type and the encoding.
What is scheme? The standard document class detector doesn't seem to set it?

unexcpected error message if XML header is broken

Missing the ?> marker in the XML declaration like:

<?xml version="1.0" encoding="UTF-8">

results in:


ERROR error in analyze document: error in XML document at position 30: expected tag attribute

error handling when semicolon is missing in attribute definition

[Attribute]
	no = orig content lineno
	docid = orig content id;

If executing:

strusAnalyze -D 'text/tab-separated-values' -g tsv gutenberg.ana gutenberg.tsv | & less

I just don't see 'no'. I also don't see an error message about the ';' missing.

inconsistent parameters

    sent = empty punctuation("en","") /doc/text//();
    stem = lc:convdia(en):stem(en) word /doc/title();

Why is the parameter for stem a token en and for punctuation the string "en"?

exceptions of type strus::runtime_error ignored

for example: I throw an exception in a segmenter, strusAnalyze happily continues to index the
document. The exception in this case was thrown in the defineSubSection method of the segmenter
as such:

throw strus::runtime_error(_TXT("xxx"));

When I see the definition of 'runtime_error', it seems to me that this is a function in internationatiztion.hpp/cpp!

And the code is mixed up with std::runtime_error and code throwing a strus::runtime_error which
will then never be caught. For instance in programLoader.cpp there is a catch (std::runtime_error).

Defining a function with the same name as a standard exception name is very counter-intuitive!

If I change it to std::runtime_error also nothing gets catched.

I'm really puzzled here..

tokenizer 'word' is not logical

It separates don't into the tokens don and t.
Of course, I can work around with the regex tokenizer...

Packaging fixes (unfinished)

diff --git a/dist/redhat/strus.spec b/dist/redhat/strus.spec
deleted file mode 100644
index d10b421..0000000
--- a/dist/redhat/strus.spec
+++ /dev/null
@@ -1,260 +0,0 @@
-# StrusAnalyzer spec file
-
-# Set distribution based on some OpenSuse and distribution macros
-# this is only relevant when building on https://build.opensuse.org
-###
-
-%define rhel 0
-%define rhel5 0
-%define rhel6 0
-%define rhel7 0
-%if 0%{?rhel_version} >= 500 && 0%{?rhel_version} <= 599
-%define dist rhel5
-%define rhel 1
-%define rhel5 1
-%endif
-%if 0%{?rhel_version} >= 600 && 0%{?rhel_version} <= 699
-%define dist rhel6
-%define rhel 1
-%define rhel6 1
-%endif
-%if 0%{?rhel_version} >= 700 && 0%{?rhel_version} <= 799
-%define dist rhel7
-%define rhel 1
-%define rhel7 1
-%endif
-
-%define centos 0
-%define centos5 0
-%define centos6 0
-%define centos7 0
-%if 0%{?centos_version} >= 500 && 0%{?centos_version} <= 599
-%define dist centos5
-%define centos 1
-%define centos5 1
-%endif
-%if 0%{?centos_version} >= 600 && 0%{?centos_version} <= 699
-%define dist centos6
-%define centos 1
-%define centos6 1
-%endif
-%if 0%{?centos_version} >= 700 && 0%{?centos_version} <= 799
-%define dist centos7
-%define centos 1
-%define centos7 1
-%endif
-
-%define scilin 0
-%define scilin5 0
-%define scilin6 0
-%define scilin7 0
-%if 0%{?scilin_version} >= 500 && 0%{?scilin_version} <= 599
-%define dist scilin5
-%define scilin 1
-%define scilin5 1
-%endif
-%if 0%{?scilin_version} >= 600 && 0%{?scilin_version} <= 699
-%define dist scilin6
-%define scilin 1
-%define scilin6 1
-%endif
-%if 0%{?scilin_version} >= 700 && 0%{?scilin_version} <= 799
-%define dist scilin7
-%define scilin 1
-%define scilin7 1
-%endif
-
-%define fedora 0
-%define fc20 0
-%define fc21 0
-%if 0%{?fedora_version} == 20
-%define dist fc20
-%define fc20 1
-%define fedora 1
-%endif
-%if 0%{?fedora_version} == 21
-%define dist fc21
-%define fc21 1
-%define fedora 1
-%endif
-
-%define suse 0
-%define osu122 0
-%define osu123 0
-%define osu131 0
-%define osu132 0
-%define osufactory 0
-%if 0%{?suse_version} == 1220
-%define dist osu122
-%define osu122 1
-%define suse 1
-%endif
-%if 0%{?suse_version} == 1230
-%define dist osu123
-%define osu123 1
-%define suse 1
-%endif
-%if 0%{?suse_version} == 1310
-%define dist osu131
-%define osu131 1
-%define suse 1
-%endif
-%if 0%{?suse_version} == 1320
-%define dist osu132
-%define osu132 1
-%define suse 1
-%endif
-%if 0%{?suse_version} > 1320
-%define dist osufactory
-%define osufactory 1
-%define suse 1
-%endif
-
-%define sles 0
-%define sles11 0
-%define sles12 0
-%if 0%{?suse_version} == 1110
-%define dist sle11
-%define sles11 1
-%define sles 1
-%endif
-%if 0%{?suse_version} == 1315 
-%define dist sle12
-%define sles12 1
-%define sles 1
-%endif
-
-Summary: Library implementing the document and query analysis for a text search engine
-Name: strusanalyzer
-Version: 0.0.1
-Release: 0.1
-License: GPLv3
-Group: Development/Libraries/C++
-
-Source: %{name}_%{version}.tar.gz
-
-URL: http://project-strus.net
-
-BuildRoot: %{_tmppath}/%{name}-root
-
-# Build dependencies
-###
-
-# OBS doesn't install the minimal set of build tools automatically
-BuildRequires: gcc
-BuildRequires: gcc-c++
-BuildRequires: cmake
-
-%if %{rhel} || %{centos} || %{scilin} || %{fedora}
-%if %{rhel5} || %{centos5}
-Requires: boost148 >= 1.48
-BuildRequires: boost148-devel >= 1.48
-%else
-Requires: boost >= 1.48
-Requires: boost-thread >= 1.48
-Requires: boost-system >= 1.48
-Requires: boost-date_time >= 1.48
-BuildRequires: boost-devel
-%endif
-%endif
-%if %{suse} || %{sles}
-BuildRequires: boost-devel
-%if %{osu122} || %{osu123} || %{sles11} || %{sles12}
-Requires: libboost_thread1_49_0 >= 1.49.0
-Requires: libboost_system1_49_0 >= 1.49.0
-Requires: libboost_date_time1_49_0 >= 1.49.0
-%endif
-%if %{osu131}
-Requires: libboost_thread1_53_0 >= 1.53.0
-Requires: libboost_system1_53_0 >= 1.53.0
-Requires: libboost_date_time1_53_0 >= 1.53.0
-%endif
-%endif
-
-# Check if 'Distribution' is really set by OBS (as mentioned in bacula)
-%if ! 0%{?opensuse_bs}
-Distribution: %{dist}
-%endif
-
-Packager: Patrick Frey 
-
-%description
-Library implementing the document and query analysis for a text search engine.
-
-%package devel
-Summary: strusanalyzer development files
-Group: Development/Libraries/C++
-
-%description devel
-The libraries and header files used for development with strusanalyzer.
-
-Requires: %{name} >= %{version}-%{release}
-
-%prep
-%setup
-
-%build
-
-mkdir build
-cd build
-cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Release -DLIB_INSTALL_DIR=%{_libdir} ..
-make %{?_smp_mflags}
-
-%install
-
-cd build
-make DESTDIR=$RPM_BUILD_ROOT install
-
-# TODO: avoid building this stuff in cmake. how?
-rm -rf $RPM_BUILD_ROOT%{_libdir}/debug
-rm -rf $RPM_BUILD_ROOT%{_prefix}/src/debug
-
-%clean
-rm -rf $RPM_BUILD_ROOT
-
-%check
-cd build
-make test
-
-%files
-%defattr( -, root, root )
-%dir %{_libdir}/%{name}
-%{_libdir}/%{name}/libstemmer.so.0.0
-%{_libdir}/%{name}/libstemmer.so.0.0.1
-%{_libdir}/%{name}/libstrus_tokenizer_word.so.0.0
-%{_libdir}/%{name}/libstrus_tokenizer_word.so.0.0.1
-%{_libdir}/%{name}/libstrus_tokenizer_punctuation.so.0.0
-%{_libdir}/%{name}/libstrus_tokenizer_punctuation.so.0.0.1
-%{_libdir}/%{name}/libstrus_segmenter_textwolf.so.0.0
-%{_libdir}/%{name}/libstrus_segmenter_textwolf.so.0.0.1
-%{_libdir}/%{name}/libstrus_analyzer.so.0.0
-%{_libdir}/%{name}/libstrus_analyzer.so.0.0.1
-%{_libdir}/%{name}/libstrus_normalizer_dictmap.so.0.0
-%{_libdir}/%{name}/libstrus_normalizer_dictmap.so.0.0.1
-%{_libdir}/%{name}/libstrus_normalizer_charconv.so.0.0
-%{_libdir}/%{name}/libstrus_normalizer_charconv.so.0.0.1
-%{_libdir}/%{name}/libstrus_textproc.so.0.0
-%{_libdir}/%{name}/libstrus_textproc.so.0.0.1
-%{_libdir}/%{name}/libstrus_normalizer_snowball.so.0.0
-%{_libdir}/%{name}/libstrus_normalizer_snowball.so.0.0.1
-
-%files devel
-%{_libdir}/%{name}/libstemmer.so
-%{_libdir}/%{name}/libstrus_tokenizer_word.so
-%{_libdir}/%{name}/libstrus_tokenizer_punctuation.so
-%{_libdir}/%{name}/libstrus_segmenter_textwolf.so
-%{_libdir}/%{name}/libstrus_analyzer.so
-%{_libdir}/%{name}/libstrus_normalizer_dictmap.so
-%{_libdir}/%{name}/libstrus_normalizer_charconv.so
-%{_libdir}/%{name}/libstrus_textproc.so
-%{_libdir}/%{name}/libstrus_normalizer_snowball.so
-%dir %{_includedir}/%{name}
-%{_includedir}/%{name}/*.hpp
-%dir %{_includedir}/%{name}/lib
-%{_includedir}/%{name}/lib/*.hpp
-%dir %{_includedir}/%{name}/private
-%{_includedir}/%{name}/private/*.hpp
-
-%changelog
-* Fri Mar 20 2015 Patrick Frey  0.0.1-0.1
-- preliminary release
diff --git a/dist/redhat/strusanalyzer.spec b/dist/redhat/strusanalyzer.spec
new file mode 100644
index 0000000..55a501e
--- /dev/null
+++ b/dist/redhat/strusanalyzer.spec
@@ -0,0 +1,262 @@
+# StrusAnalyzer spec file
+
+# Set distribution based on some OpenSuse and distribution macros
+# this is only relevant when building on https://build.opensuse.org
+###
+
+%define rhel 0
+%define rhel5 0
+%define rhel6 0
+%define rhel7 0
+%if 0%{?rhel_version} >= 500 && 0%{?rhel_version} <= 599
+%define dist rhel5
+%define rhel 1
+%define rhel5 1
+%endif
+%if 0%{?rhel_version} >= 600 && 0%{?rhel_version} <= 699
+%define dist rhel6
+%define rhel 1
+%define rhel6 1
+%endif
+%if 0%{?rhel_version} >= 700 && 0%{?rhel_version} <= 799
+%define dist rhel7
+%define rhel 1
+%define rhel7 1
+%endif
+
+%define centos 0
+%define centos5 0
+%define centos6 0
+%define centos7 0
+%if 0%{?centos_version} >= 500 && 0%{?centos_version} <= 599
+%define dist centos5
+%define centos 1
+%define centos5 1
+%endif
+%if 0%{?centos_version} >= 600 && 0%{?centos_version} <= 699
+%define dist centos6
+%define centos 1
+%define centos6 1
+%endif
+%if 0%{?centos_version} >= 700 && 0%{?centos_version} <= 799
+%define dist centos7
+%define centos 1
+%define centos7 1
+%endif
+
+%define scilin 0
+%define scilin5 0
+%define scilin6 0
+%define scilin7 0
+%if 0%{?scilin_version} >= 500 && 0%{?scilin_version} <= 599
+%define dist scilin5
+%define scilin 1
+%define scilin5 1
+%endif
+%if 0%{?scilin_version} >= 600 && 0%{?scilin_version} <= 699
+%define dist scilin6
+%define scilin 1
+%define scilin6 1
+%endif
+%if 0%{?scilin_version} >= 700 && 0%{?scilin_version} <= 799
+%define dist scilin7
+%define scilin 1
+%define scilin7 1
+%endif
+
+%define fedora 0
+%define fc20 0
+%define fc21 0
+%if 0%{?fedora_version} == 20
+%define dist fc20
+%define fc20 1
+%define fedora 1
+%endif
+%if 0%{?fedora_version} == 21
+%define dist fc21
+%define fc21 1
+%define fedora 1
+%endif
+
+%define suse 0
+%define osu122 0
+%define osu123 0
+%define osu131 0
+%define osu132 0
+%define osufactory 0
+%if 0%{?suse_version} == 1220
+%define dist osu122
+%define osu122 1
+%define suse 1
+%endif
+%if 0%{?suse_version} == 1230
+%define dist osu123
+%define osu123 1
+%define suse 1
+%endif
+%if 0%{?suse_version} == 1310
+%define dist osu131
+%define osu131 1
+%define suse 1
+%endif
+%if 0%{?suse_version} == 1320
+%define dist osu132
+%define osu132 1
+%define suse 1
+%endif
+%if 0%{?suse_version} > 1320
+%define dist osufactory
+%define osufactory 1
+%define suse 1
+%endif
+
+%define sles 0
+%define sles11 0
+%define sles12 0
+%if 0%{?suse_version} == 1110
+%define dist sle11
+%define sles11 1
+%define sles 1
+%endif
+%if 0%{?suse_version} == 1315 
+%define dist sle12
+%define sles12 1
+%define sles 1
+%endif
+
+Summary: Library implementing the document and query analysis for the strus search engine
+Name: strusanalyzer
+Version: 0.0.1
+Release: 0.1
+License: GPLv3
+Group: Development/Libraries/C++
+
+Source: %{name}_%{version}.tar.gz
+
+URL: http://project-strus.net
+
+BuildRoot: %{_tmppath}/%{name}-root
+
+# Build dependencies
+###
+
+# OBS doesn't install the minimal set of build tools automatically
+BuildRequires: gcc
+BuildRequires: gcc-c++
+BuildRequires: cmake
+
+%if %{rhel} || %{centos} || %{scilin} || %{fedora}
+%if %{rhel5} || %{centos5}
+Requires: boost148 >= 1.48
+BuildRequires: boost148-devel >= 1.48
+%else
+Requires: boost >= 1.48
+Requires: boost-thread >= 1.48
+Requires: boost-system >= 1.48
+Requires: boost-date_time >= 1.48
+BuildRequires: boost-devel
+%endif
+%endif
+%if %{suse} || %{sles}
+BuildRequires: boost-devel
+%if %{osu122} || %{osu123} || %{sles11} || %{sles12}
+Requires: libboost_thread1_49_0 >= 1.49.0
+Requires: libboost_system1_49_0 >= 1.49.0
+Requires: libboost_date_time1_49_0 >= 1.49.0
+%endif
+%if %{osu131}
+Requires: libboost_thread1_53_0 >= 1.53.0
+Requires: libboost_system1_53_0 >= 1.53.0
+Requires: libboost_date_time1_53_0 >= 1.53.0
+%endif
+%endif
+
+BuildRequires: strus-devel >= 0.0.1
+Requires: strus >= 0.0.1
+
+# Check if 'Distribution' is really set by OBS (as mentioned in bacula)
+%if ! 0%{?opensuse_bs}
+Distribution: %{dist}
+%endif
+
+Packager: Patrick Frey 
+
+%description
+Library implementing the document and query analysis for the strus search engine
+
+%package devel
+Summary: strusAnalyzer development files
+Group: Development/Libraries/C++
+
+%description devel
+The libraries and header files used for development with strusAnalyzer.
+
+Requires: %{name} >= %{version}-%{release}
+
+%prep
+%setup
+
+%build
+
+mkdir build
+cd build
+cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=Release -DLIB_INSTALL_DIR=%{_libdir} ..
+make %{?_smp_mflags}
+
+%install
+
+cd build
+make DESTDIR=$RPM_BUILD_ROOT install
+
+# TODO: avoid building this stuff in cmake. how?
+rm -rf $RPM_BUILD_ROOT%{_libdir}/debug
+rm -rf $RPM_BUILD_ROOT%{_prefix}/src/debug
+
+%clean
+rm -rf $RPM_BUILD_ROOT
+
+%check
+cd build
+make test
+
+%files
+%defattr( -, root, root )
+%dir %{_libdir}/strus
+%{_libdir}/strus/libstrus_analyzer.so.0.0.1
+%{_libdir}/strus/libstrus_analyzer.so.0.0
+%{_libdir}/strus/libstrus_normalizer_snowball.so.0.0.1
+%{_libdir}/strus/libstrus_normalizer_snowball.so.0.0
+%{_libdir}/strus/libstrus_tokenizer_punctuation.so.0.0.1
+%{_libdir}/strus/libstrus_tokenizer_punctuation.so.0.0
+%{_libdir}/strus/libstrus_segmenter_textwolf.so.0.0.1
+%{_libdir}/strus/libstrus_segmenter_textwolf.so.0.0
+%{_libdir}/strus/libstrus_normalizer_charconv.so.0.0.1
+%{_libdir}/strus/libstrus_normalizer_charconv.so.0.0
+%{_libdir}/strus/libstrus_tokenizer_word.so.0.0.1
+%{_libdir}/strus/libstrus_tokenizer_word.so.0.0
+%{_libdir}/strus/libstrus_textproc.so.0.0.1
+%{_libdir}/strus/libstrus_textproc.so.0.0
+%{_libdir}/strus/libstrus_normalizer_dictmap.so.0.0.1
+%{_libdir}/strus/libstrus_normalizer_dictmap.so.0.0
+%{_libdir}/strus/libstemmer.so
+
+%files devel
+%{_libdir}/strus/libstrus_analyzer.so
+%{_libdir}/strus/libstrus_normalizer_snowball.so
+%{_libdir}/strus/libstrus_tokenizer_punctuation.so
+%{_libdir}/strus/libstrus_segmenter_textwolf.so
+%{_libdir}/strus/libstrus_normalizer_charconv.so
+%{_libdir}/strus/libstrus_tokenizer_word.so
+%{_libdir}/strus/libstrus_textproc.so
+%{_libdir}/strus/libstrus_normalizer_dictmap.so
+%dir %{_includedir}/strus
+%{_includedir}/strus/*.hpp
+%dir %{_includedir}/strus/lib
+%{_includedir}/strus/lib/*.hpp
+%dir %{_includedir}/strus/analyzer
+%{_includedir}/strus/analyzer/*.hpp
+
+%changelog
+* Fri Mar 20 2015 Patrick Frey  0.0.1-0.1
+- preliminary release
+

OS X: Giving up for now :)

Hi Patrick,

Looks like StrusModule wants to depend on strus_private_utils, and I think I can see how to do that, but by starting to cutpaste stuff around which doesn't feel like the right approach.

If you like, I can provide you with SSH access to a 24/7 OS X box. Does that sound useful?

Thanks,

David

Required order of definitions when using sub content with segmenter switch

A sub content in the document analyzer that switches to a different segmenter has to appear before any definition using selections of this sub content. If you do not follow this order restriction then the selection will be empty.

At least an error message as hint would help a lot.

adding a segmenter requires changes across repos

Adding a segmenter TSV requires extending in src/analyzer/libstrus_analyzer_objbuild.cpp in strusAnalyzer and in src/analyzerObjectBuilder.cpp in strusModule.

patrickfrey / strusanalyzer Goto Github PK

strusanalyzer's People

Stargazers

Watchers

strusanalyzer's Issues

Recommend Projects

Recommend Topics

Recommend Org