iresearch-toolkit / iresearch Goto Github PK

IResearch is a cross-platform, high-performance search analytics library written entirely in C++ with the focus on a pluggability of different ranking/similarity models

Home Page: https://iresearch-toolkit.github.io/iresearch/

License: Other

CMake 1.41% C++ 97.25% Shell 0.19% Python 1.10% SWIG 0.05%

search-engine relevant-search tf-idf bm25 ranking analytics

iresearch's Introduction

!!! THE PROJECT IS ARCHIVED AND NO LONGER MAINTAINED !!!

IResearch search engine

Version 1.3

Overview
High level architecture and main concepts
Build
Pyresearch
Included 3rd party dependencies
External 3rd party dependencies
Query filter building blocks
Supported compilers
License

Overview

The IResearch library is meant to be treated as a standalone index that is capable of both indexing and storing individual values verbatim. Indexed data is treated on a per-version/per-revision basis, i.e. existing data version/revision is never modified and updates/removals are treated as new versions/revisions of the said data. This allows for trivial multi-threaded read/write operations on the index. The index exposes its data processing functionality via a multi-threaded 'writer' interface that treats each document abstraction as a collection of fields to index and/or store. The index exposes its data retrieval functionality via 'reader' interface that returns records from an index matching a specified query. The queries themselves are constructed query trees built directly using the query building blocks available in the API. The querying infrastructure provides the capability of ordering the result set by one or more ranking/scoring implementations. The ranking/scoring implementation logic is plugin-based and lazy-initialized during runtime as needed, allowing for addition of custom ranking/scoring logic without the need to even recompile the IResearch library.

High level architecture and main concepts

Index

An index consists of multiple independent parts, called segments and index metadata. Index metadata stores information about active index segments for the particular index version/revision. Each index segment is an index itself and consists of the following logical components:

segment metadata
field metadata
term dictionary
postings lists
list of deleted documents
stored values

Read/write access to the components carried via plugin-based formats. Index may contain segments created using different formats.

Document

A database record is represented as an abstraction called a document. A document is actually a collection of indexed/stored fields. In order to be processed each field should satisfy at least IndexedField or StoredField concept.

IndexedField concept

For type T to be IndexedField, the following conditions have to be satisfied for an object m of type T:

Expression	Requires	Effects
`m.name()`	The output type must be convertible to `irs::string_ref`	A value uses as a key name.
`m.get_tokens()`	The output type must be convertible to `irs::token_stream*`	A token stream uses for populating in invert procedure. If value is `nullptr` field is treated as non-indexed.
`m.index_features()`	The output type must be implicitly convertible to `irs::IndexFeatures`	A set of features requested for evaluation during indexing. E.g. it may contain request of processing positions and frequencies. Later the evaluated information can be used during querying and scoring.
`m.features()`	The output type must be convertible to `const irs::flags&`	A set of user supplied features to be associated with a field. E.g. it may contain request of storing field norms. Later the stored information can be used during querying and scoring.

StoredField concept

For type T to be StoredField, the following conditions have to be satisfied for an object m of type T:

Expression	Requires	Effects
`m.name()`	The output type must be convertible to `irs::string_ref`	A value uses as a key name.
`m.write(irs::data_output& out)`	The output type must be convertible to bool.	One may write arbitrary data to stream denoted by `out` in order to retrieve written value using index_reader API later. If nothing has written but returned value is `true` then stored value is treated as flag. If returned value is `false` then nothing is stored even if something has been written to `out` stream.

Writer

A single instance per-directory object that is used for indexing data. Data may be indexed in a per-document basis or sourced from another reader for trivial directory merge functionality. Each commit() of a writer produces a new version/revision of the view of the data in the corresponding directory. Additionally the interface also provides directory defragmentation capabilities to allow compacting multiple smaller version/revision segments into larger more compact representations. A writer supports two-phase transactions via begin()/commit()/rollback() methods.

Reader

A reusable/refreshable view of an index at a given point in time. Multiple readers can use the same directory and may point to different versions/revisions of data in the said directory.

Build prerequisites

CMake

v3.10 or later

Boost

v1.57.0 or later (headers only)

set environment

BOOST_ROOT=<path-to>/boost_1_57_0

Lz4

install (*nix)

make
make install

or point LZ4_ROOT at the source directory to build together with IResearch

install (win32)

If compiling IResearch with /MT add add_definitions("/MTd") to the end of cmake_unofficial/CMakeLists.txt since cmake will ignore the command line argument -DCMAKE_C_FLAGS=/MTd

mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=<install-path> -DBUILD_STATIC_LIBS=on -g "Visual studio 17" -Ax64 ../contrib/cmake_unofficial
cmake --build .
cmake --build . --target install

or point LZ4_ROOT at the source directory to build together with IResearch

set environment

LZ4_ROOT=<install-path>

win32 binaries also available in:

ICU

v53 or higher

install (*nix)

./configure --disable-samples --disable-tests --enable-static --srcdir="$(pwd)" --prefix=<install-path> --exec-prefix=<install-path>
make install

or point ICU_ROOT at the source directory to build together with IResearch or via the distributions' package manager: libicu

install (win32)

look for link: "ICU4C Binaries"

set environment

ICU_ROOT=<path-to-icu>

Snowball

install (*nix)

the custom CMakeLists.txt is intended to be used with snowball v2.0.0 and later versions. At least it was tested to work on commit 53739a805cfa6c77ff8496dc711dc1c106d987c1

git clone https://github.com/snowballstem/snowball.git
mkdir build && cd build
cmake -DENABLE_STATIC=OFF -DNO_SHARED=OFF -g "Unix Makefiles" ..
cmake --build .
cmake -DENABLE_STATIC=OFF -DNO_SHARED=ON -g "Unix Makefiles" ..
cmake --build .

or point SNOWBALL_ROOT at the source directory to build together with IResearch or via the distributions' package manager: libstemmer

install (win32)

the custom CMakeLists.txt was based on revision 5137019d68befd633ce8b1cd48065f41e77ed43e later versions may be used at your own risk of compilation failure

git clone https://github.com/snowballstem/snowball.git
git reset --hard adc028f3ae646623bda2f99191fe9dc3287a909b
mkdir build && cd build
set PATH=%PATH%;<path-to>/build/Debug
cmake -DENABLE_STATIC=OFF -DNO_SHARED=OFF -g "Visual studio 12" -Ax64 ..
cmake --build .
cmake -DENABLE_STATIC=OFF -DNO_SHARED=ON -g "Visual studio 12" -Ax64 ..
cmake --build .

or point SNOWBALL_ROOT at the source directory to build together with IResearch

For static builds:

in MSVC open: build/snowball.sln

set: stemmer -> Properties -> Configuration Properties -> C/C++ -> Code Generation -> Runtime Library = /MTd

BUILD -> Build Solution

set environment

SNOWBALL_ROOT=<path-to-snowball>

VelocyPack

point VPACK_ROOT at the source directory to build together with IResearch

Gooogle test

install (*nix)

mkdir build && cd build
cmake ..
make

or point GTEST_ROOT at the source directory to build together with IResearch

install (win32)

mkdir build && cd build
cmake -g "Visual studio 12" -Ax64 -Dgtest_force_shared_crt=ON -DCMAKE_DEBUG_POSTFIX="" ..
cmake --build .
mv Debug ../lib

or point GTEST_ROOT at the source directory to build together with IResearch

set environment

GTEST_ROOT=<path-to-gtest>

Stopword list (for use with analysis::text_analyzer)

download any number of lists of stopwords, e.g. from: https://github.com/snowballstem/snowball-website/tree/master/algorithms/*/stop.txt https://code.google.com/p/stop-words/

install

mkdir
for each language, (e.g. "c", "en", "es", "ru"), create a corresponding subdirectory (a directory name has 2 letters except the default locale "c" which has 1 letter)
place the files with stopwords, (utf8 encoded with one word per line, any text after the first whitespace is ignored), in the directory corresponding to its language (multiple files per language are supported and will be interpreted as a single list)

set environment

IRESEARCH_TEXT_STOPWORD_PATH=<path-to-stopword-lists>

If the variable IRESEARCH_TEXT_STOPWORD_PATH is left unset then locale specific stopword-list subdirectories are deemed to be located in the current working directory

Build

git clone <IResearch code repository>/iresearch.git iresearch
cd iresearch
mkdir build && cd build

generate build file <*nix>:

cmake -DCMAKE_BUILD_TYPE=[Debug|Release|Coverage] -g "Unix Makefiles" ..

if some libraries are not found by the build then set the needed environment > variables (e.g. BOOST_ROOT, BOOST_LIBRARYDIR, LZ4_ROOT, OPENFST_ROOT, GTEST_ROOT)

if ICU or Snowball from the distribution paths are not found, the following additional > environment variables might be required: > ICU_ROOT_SUFFIX=x86_64-linux-gnu SNOWBALL_ROOT_SUFFIX=x86_64-linux-gnu

generate build file (win32):

cmake -g "Visual studio 12" -Ax64 ..

If some libraries are not found by the build then set the needed environment variables (e.g. BOOST_ROOT, BOOST_LIBRARYDIR, LZ4_ROOT, OPENFST_ROOT, GTEST_ROOT)

set Build Identifier for this build (optional)

echo "<build_identifier>" > BUILD_IDENTIFIER

build library:

cmake --build .

test library:

cmake --build . --target iresearch-check

install library:

cmake --build . --target install

code coverage:

cmake --build . --target iresearch-coverage

Pyresearch

There is Python wrapper for IResearch. Wrapper gives access to directory reader object. For usage example see /python/scripts

Build

To build Pyresearch SWIG generator should be available. Add -DUSE_PYRESEARCH=ON to cmake command-line to generate Pyresearch targets

Install

Run target pyresearch-install

win32 install notes:

Some version of ICU installers seems to fail to make available all icu dlls through PATH enviroment variable, manual adjustment may be needed.

(*nix) install notes:

Shared version of libiresearch is used. Install IResearch before running Pyresearch.

External 3rd party dependencies

External 3rd party dependencies must be made available to the IResearch library separately. They may either be installed through the distribution package management system or build from source and the appropriate environment variables set accordingly.

Boost

v1.57.0 or later (locale system thread) used for functionality not available in the STL (excluding functionality available in ICU)

Lz4

used for compression/decompression of byte/string data

ICU

used by analyzers for parsing, transforming and tokenising string data

Snowball

used by analyzers for computing word stems (i.e. roots) for more flexible matching matching of words from languages not supported by 'snowball' are done verbatim

Google Test

used for writing tests for the IResearch library

VelocyPack

used for JSON serialization/deserialization

Stopword list

used by analysis::text_analyzer for filtering out noise words that should not impact text ranging e.g. for 'en' these are usualy 'a', 'the', etc... download any number of lists of stopwords, e.g. from: https://github.com/snowballstem/snowball-website/tree/master/algorithms/*/stop.txt https://code.google.com/p/stop-words/ or create a custom language-specific list of stopwords place the files with stopwords, (utf8 encoded with one word per line, any text after the first whitespace is ignored), in the directory corresponding to its language (multiple files per language are supported and will be interpreted as a single list)

Query filter building blocks

Filter	Description
irs::by_edit_distance	for filtering of values based on Levenshtein distance
irs::by_granular_range	for faster filtering of numeric values within a given range, with the possibility of specifying open/closed ranges
irs::by_ngram_similarity	for filtering of values based on NGram model
irs::by_phrase	for word-position-sensitive filtering of values, with the possibility of skipping selected positions
irs::by_prefix	for filtering of exact value prefixes
irs::by_range	for filtering of values within a given range, with the possibility of specifying open/closed ranges
irs::by_same_position	for term-insertion-order sensitive filtering of exact values
irs::by_term	for filtering of exact values
irs::by_terms	for filtering of exact values by a set of specified terms
irs::by_wildcard	for filtering of values based on matching pattern
irs::ByNestedFilter	for filtering of documents based on matching pattern on its sub-documents
irs::And	boolean conjunction of multiple filters, influencing document ranks/scores as appropriate
irs::Or	boolean disjunction of multiple filters, influencing document ranks/scores as appropriate (including "minimum match" functionality)
irs::Not	boolean negation of multiple filters

Supported compilers

GCC: 10+
MSVC: 2019+
Clang: 12+

License

This software is provided under the Apache 2.0 Software license provided in the LICENSE.md file. Licensing information for third-party products used by IResearch search engine can be found in THIRD_PARTY_README.md

iresearch's People

Contributors

Stargazers

Watchers

iresearch's Issues

IRES-177: Modify codebase according to the coding guidelines

Jira issue originally created by user @gnusi:

IRES-27: Concurrent indexing

Jira issue originally created by user @gnusi:

IRES-26: Tasks related to "search"

Jira issue originally created by user @gnusi:

IRES-325: Use size_t where it possible

Jira issue originally created by user @gnusi:

Use size_t where it possible

Remove ungly static casts where it possible, e.g.:

docid and docmax
writevint and sizet

Search for:
uint32_t(
staticcast<uint32t>(

IRES-230: Jasmine Tests shell-foxx-manager-install-spec.js, shell-query-timecritical-spec.js, shell-foxx-repository-spec.js, shell-foxx-query-spec.js, shell-foxx-model-events-spec.js failed

Jira issue originally created by user belyaa2:

Jasmine test failed with multiple errors:

Main errors:

Manifest 'js/apps/*db/*system/unittest/broken/APP/manifest.json' does not provide required attribute 'name'
Manifest file 'js/apps/db/_system/unittest/broken/APP/manifest.json' is invald: The App name can only contain a to z, A to Z, 0-9, '-' and ''
Manifest file 'js/apps/*db/*system/unittest/broken/APP/manifest.json' is invald: The version requires the format: .., all have to be integer numbers.
JavaScript exception in file 'js/apps/*db/*system/unittest/broken/APP/broken-controller.js' at 3,8: SyntaxError: Unexpected identifier
Cannot compute Foxx application routes: [ArangoError 3006: File: broken-controller.js syntax error in script SyntaxError: Unexpected identifier
Cannot compute Foxx application routes: [ArangoError 3007: Route has to start with /
Cannot compute Foxx application routes: [ArangoError 14: file not found: js/apps/*db/*system/unittest/broken/APP/illegal/file/name/]�
Cannot compute Foxx application routes: [ArangoError 3005: failed to execute script File: broken-controller.js Error: Error: This is an error from the controller.]�
Setup not possible for mount '/unittest/broken': Error: This is an error from the setup.�
JavaScript exception in file 'js/apps/*db/*system/unittest/broken/APP/broken-exports.js' at 3,8: SyntaxError: Unexpected identifier�
JavaScript exception in file 'js/apps/*db/*system/unittest/broken/APP/broken-setup.js' at 3,8: SyntaxError: Unexpected identifier�
Setup not possible for mount '/unittest/broken': SyntaxError: Unexpected identifier�
Setup not possible for mount '/unittest/broken'
Cannot compute Foxx application routes: [ArangoError 14: file not found: js/apps/*db/*system/unittest/broken/APP/does-not-exist.js]�
Setup not possible for mount '/unittest/broken'
JavaScript exception in file './js/server/modules/org/arangodb/arango-statement.js' at 86,45: [ArangoError 1500: query killed (while executing)

Errors have logs like following:
{quote}
Running Jasmine Tests: ./js/common/tests/shell-foxx-manager-install-spec.js, ./js/common/tests/shell-query-timecritical-spec.js, ./js/server/tests/shell-foxx-repository-spec.js, ./js/server/tests/shell-foxx-query-spec.js, ./js/server/tests/shell-foxx-model-events-spec.js

..�[31m2015-12-24T12:11:31Z [225] ERROR Manifest 'js/apps/*db/*system/unittest/broken/APP/manifest.json' does not provide required attribute 'name'�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR Manifest 'js/apps/*db/*system/unittest/broken/APP/manifest.json' does not provide required attribute 'version'�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR Manifest file 'js/apps/*db/*system/unittest/broken/APP/manifest.json' is invald: missing manifest attribute�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR Error: �[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at checkManifest (./js/server/modules/org/arangodb/foxx/manager.js:259:13)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at validateManifestFile (./js/server/modules/org/arangodb/foxx/manager.js:301:7)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at appConfig (./js/server/modules/org/arangodb/foxx/manager.js:433:17)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at createApp (./js/server/modules/org/arangodb/foxx/manager.js:447:18)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at _scanFoxx (./js/server/modules/org/arangodb/foxx/manager.js:726:15)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at db._executeTransaction.action (./js/server/modules/org/arangodb/foxx/manager.js:871:17)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at [object ArangoDatabase].ArangoDatabase._executeTransaction (./js/server/modules/org/arangodb/arango-database.js:142:10)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at _install (./js/server/modules/org/arangodb/foxx/manager.js:866:10)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at Object.install (./js/server/modules/org/arangodb/foxx/manager.js:921:15)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at Object. (./js/common/tests/shell-foxx-manager-install-spec.js:93:21)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at attemptSync (./js/common/modules/jasmine/core.js:1510:12)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at QueueRunner.run (./js/common/modules/jasmine/core.js:1498:9)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at QueueRunner.execute (./js/common/modules/jasmine/core.js:1485:10)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at Spec.Env.queueRunnerFactory (./js/common/modules/jasmine/core.js:518:35)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at Spec.execute (./js/common/modules/jasmine/core.js:306:10)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at Object. (./js/common/modules/jasmine/core.js:1708:37)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at attemptAsync (./js/common/modules/jasmine/core.js:1520:12)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at QueueRunner.run (./js/common/modules/jasmine/core.js:1496:16)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at next (./js/common/modules/jasmine/core.js:1517:37)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at complete (./js/common/modules/jasmine/core.js:333:9)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at QueueRunner.clearStack (./js/common/modules/jasmine/core.js:506:9)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at QueueRunner.run (./js/common/modules/jasmine/core.js:1505:12)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at QueueRunner.execute (./js/common/modules/jasmine/core.js:1485:10)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at Spec.Env.queueRunnerFactory (./js/common/modules/jasmine/core.js:518:35)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at Spec.execute (./js/common/modules/jasmine/core.js:306:10)�[0m
�[31m2015-12-24T12:11:31Z [225] ERROR at Object. (./js/common/modules/jasmine/core.js:1708:37)�[0m
...
{quote}

Build marked as successfull by error (I suggest because of test crashes).

See ArangoDBUnitTestsShellServer 433 log for details.

IRES-277: Custom memory allocators

Jira issue originally created by user @gnusi:

Use custom memory allocators for directories, containers, etc

IRES-349: Compact FST

Jira issue originally created by user @gnusi:

In order to reduce term dictionary index memory footprint we can use compact representation of FST

IRES-97: Support big endian platforms

Jira issue originally created by user @gnusi:

IRES-133: Near realtime search

Jira issue originally created by user @gnusi:

At the moment search is available after commit operation. But should be able to query pre-inverted in-memory data too

IRES-360: Use std::unordered_map in fst_builder

Jira issue originally created by user @gnusi:

IRES-210: Reduce memory footprint of the term dictionary index

Jira issue originally created by user @gnusi:

Currently we are using VectorFst - vector based fst implementation. It looks like it will be better to use more memory efficient CompactFst

IRES-61: Advanced searching features

Jira issue originally created by user @gnusi:

IRES-364: README.md issues

Jira issue originally created by user belyaa2:

README.md contains multiple issues:

specify gcc version

add GCC dependency to `Build prerequisites` section (point that something like `sudo yum groupinstall 'Development Tools'` should be performed because simple gcc installation missed g, see http://unix.stackexchange.com/questions/140350/linux-g-command-not-found for details)

target `iresearch-check` in command `cmake --build . --target iresearch-check` does not work, please fix corresponding issue or change command to one with `cobertura` suffix which works fine

remove `bash` command from examples because it does not work

add `sudo` command to all commands where installation of built packages is performed (lz4 and others)

change link to https://github.com/lz4/lz4 for LZ4 paragraph name because current one is not correct

add more details about lz4 installation because only make commands do not reflect real installation procedure

add git client installation as prerequisites because we need clone git repo

lz4 require m4, so may be such step should be added (but m4 installation is lz4 installation step)

path `GTEST_ROOT=/var/lib/jenkins/tools/gtest-1.7.0/` should be changed like others (e.g. `<path to ...>` should be used instead of hard coded one)

`git clone <iResearch code repository>/iresearch.git iresearch` should be changed to actual github path

add description to use `<build_identifier>` (whether it is mandatory and what should be passed)

`LZ4_ROOT=/usr/local` should be used because of `install` step

`mkdir $GTEST_ROOT/lib` and `cp <path>/googletest-master/googletest//include/gtest/\**.h <path>/googletest-master/googletest/include/` should be added to gtest current installation procedure (for exampe: `mkdir googletest/googletest/lib && cp googletest/googletest/build/libgtest* googletest/googletest/lib/ && cp googletest/googletest/include/gtest/\**.h googletest/googletest/include/`)

installation for win platform should be described as separate part of readme

full build of coverage version should be described separately (note as part of main build procedure)

steps of build procedure should be numbered because current text is hard to read

description of tested environment should be added (OS at least)

change IResearch to IReSearch in name "EMC IResearch search engine"

specify which command `cmake -DENABLESTATIC=OFF -DNO_SHARED=OFF -g "Unix Makefiles" ..` or `cmake -DENABLE_STATIC=OFF -DNOSHARED=ON -g "Unix Makefiles" ..` should be run for snowball installation from sources (or describe in different section build for shared and static cases)

specify command/package to instal libstemmer for centos

add some detailed info about gtest installation (specify package to download, extract and show what should be build, e.g. gtest or gmock)

perhaps, we need to build icu dependencies from sources, download package http://site.icu-project.org/download/58#TOC-ICU4C-Download and read docs from the package

describe libicu installation process step by step and point out which files are need (for Centos there are no static built files in packages, so there is need to configure ICU with option `--enable-static` to build necessary files)

Tested platform by the moment: Ubuntu 14.04.3, gcc 4.8.4

CentOS platform testing is in progress.

IRES-312: Create and apply patch for similarity functionality (ensure tests run without failure)

Jira issue originally created by user @gnusi:

IRES-208: format::get_field_writer(...) takes a parameter

Jira issue originally created by user nabatv:

format::getfieldwriter(...) takes a parameter

find a way to remove the parameter requirement to the call
parameter used when merging segments to advise that values of attribute values might change on calls to doc_itr->next()
e.g. position attributes (offset/payload) must be refreshed after doc_itr->next()

IRES-305: Move to ArangoDB 3.0

Jira issue originally created by user @gnusi:

IRES-111: Error handling and troubleshooting

Jira issue originally created by user @gnusi:

Define/inspect error codes, exceptions, etc

IRES-60: Search for "similar" documents (more like this)

Jira issue originally created by user @gnusi:

Refer to similarity templates or suitability. Template to query.

IRES-299: Remove PVS studio warnings

Jira issue originally created by user @gnusi:

IRES-302: Levenstein distance in a phrase query (word permutations)

Jira issue originally created by user @gnusi:

IRES-43: Basic analysis

Jira issue originally created by user @gnusi:

IRES-339: Implement defragment policy in order to reduce number of segments

Jira issue originally created by user @gnusi:

IRES-326: Remove MSVC warning from iresearch-tests

Jira issue originally created by user @gnusi:

IRES-155: doc_iterator next should return doc_id value

Jira issue originally created by user @gnusi:

IRES-110: Environment

Jira issue originally created by user @gnusi:

IRES-250: make iresearch-coverage fails

Jira issue originally created by user @gnusi:

Execute command:
make iresearch-coverage

Get the following output (rest):
....
Writing data to coverage.info.cleaned
Summary coverage rate:
lines......: 65.1% (10285 of 15802 lines)
functions..: 80.7% (2632 of 3261 functions)
branches...: no data found
Reading data file coverage.info.cleaned
Found 142 entries.
Found common filename prefix "/home/sk/git/iresearch"
Writing .css and .png files.
Generating output.
Processing file build/core/CMakeFiles/iresearch.dir/iql/position.hh
genhtml: ERROR: cannot read /home/sk/git/iresearch/build/core/CMakeFiles/iresearch.dir/iql/position.hh
make[3]: ***** [CMakeFiles/iresearch-coverage] Error 2
make[2]: ***** [CMakeFiles/iresearch-coverage.dir/all] Error 2
make[1]: ***** [CMakeFiles/iresearch-coverage.dir/rule] Error 2
make: ***** [iresearch-coverage] Error 2

IRES-6: Tasks related to searching

Jira issue originally created by user @gnusi:

IRES-311: Create and apply patch for similarity functionality (similarity index hooks JS)

Jira issue originally created by user @gnusi:

IRES-279: Add flush functionality to index writer

Jira issue originally created by user @gnusi:

Add ability to flush segment without commiting

IRES-201: Index time similarity model pluggability

Jira issue originally created by user @gnusi:

IRES-362: Code freeze - testing

Jira issue originally created by user @gnusi:

Missing tests:

TF-IDF, BM25 check document order with/without norms

IRES-147: Provide generic query optimiser implementation (SOP-based)

Jira issue originally created by user @gnusi:

IRES-227: Add support for GCC 4.9 + address sanitizer

Jira issue originally created by user @gnusi:

IRES-120: Setup Lz4 version checking into build procedure

Jira issue originally created by user belyaa2:

During setup docker enabled build environment I installed Lz4 library via package manager for Ubuntu 14.04. Build fails because of obsolete Lz4 version. But there are no warnings about incompatibility with Lz4 version for IReSearch lib. Please add version checking into configuration/build procedure.

Following version is OK (see lz4.h for r131)

/****************************************
*  Version
****************************************/
#define LZ4*VERSION*MAJOR    1    /** for breaking interface changes  **/
#define LZ4*VERSION*MINOR    7    /** for new (non-breaking) interface capabilities **/
#define LZ4*VERSION*RELEASE  1    /** for tweaks, bug-fixes, or development **/
#define LZ4*VERSION_NUMBER (LZ4_VERSION_MAJOR *100*100 <ins> LZ4_VERSION_MINOR *100 </ins> LZ4_VERSION*RELEASE)
int LZ4_versionNumber (void);

IRES-69: Distributed searching

Jira issue originally created by user @gnusi:

Ability to perform search across different indexes

IRES-158: Basic query optimiser

Jira issue originally created by user @gnusi:

IRES-205: Jaccard similarity measure

Jira issue originally created by user @gnusi:

Sort & filter by Jaccard

IRES-261: Do not create dummy term for "contains" query

Jira issue originally created by user @gnusi:

Better to move such logic into meta4 by introducing new type of query

IRES-28: Ablility to store index

Jira issue originally created by user @gnusi:

IRES-338: memory_index_test.profile_bulk_index_multithread_batched failed on mismatched indexed_docs_count

Jira issue originally created by user belyaa2:

{quote}
[ RUN ] memoryindex_test.profile_bulk_index_multithreadbatched
Path to timing log: /home/jenkins/workspace/IReSearchMemLeakChecks-staticfast/88/build/bin/iresearch-tests-s_2016_09_09_06_59_08_OaN2QV/memory_index_test/profile_bulk_index_multithread_batched/profile_bulkindex.log
/home/jenkins/workspace/IReSearchMemLeakChecks-staticfast/88/tests/index/indextests.cpp:689: Failure
Value of: indexeddocscount
Actual: 99998
Expected: parseddocscount
Which is: 100000
[ FAILED ] memoryindex_test.profile_bulk_index_multithreadbatched (2451501 ms)
{quote}

See IReSearchMemLeakChecks-static*fast 88 for full details.

IRES-295: index_writer::flush_all() may cause inconsistence of index_meta

Jira issue originally created by user @gnusi:

      SCOPED*LOCK(lock*);

      for (auto metaItr = meta.segments.begin(); metaItr != meta.segments.end();) {
        auto& seg_meta = metaItr->meta;
        document*mask docs*mask;

        read*document_mask(docs_mask, *dir_, seg*meta);

        // write docs_mask if masks added, if all docs are masked then remove segment altogether
        if (add*document_mask_modified_records(docs_mask, seg*meta)) {
          meta.gen_dirty = true;

          if (docs*mask.size() == seg_meta.docs*count) { // remove empty segments
            metaItr = meta.segments.erase(metaItr);
            continue;
          }

          write*document_mask(*dir_, seg_meta, docs*mask);

//!!!!!!!!!! in case of exception after this line, meta may become inconsistent !!!!!!!!!!

          metaItr->filename = std::move(write*segment_meta(*dir_, seg*meta)); // write with new mask
        }

        <ins></ins>metaItr;
      }
    }

    // 'flushed' and 'writers' are filled in parallel above, differing only in scope
    assert(flushed.size() == segment_ctx.size());
    auto metaItr = flushed.begin();

    // write docs_mask if !empty(), if all docs are masked then remove segment altogether
    for (auto ctxItr = segment*ctx.begin(); ctxItr != segment*ctx.end(); <ins></ins>ctxItr) {
      auto& seg_meta = metaItr->meta;
      auto& seg_ctx = *ctxItr;

      // if have a writer with potential update-replacement records then check if they were seen
      if (seg_ctx.writer) {
        add*document_mask_unused*updates(
          seg*ctx.docs_mask, seg_meta, seg_ctx.writer->docs*context()
        );
      }

      if (seg*ctx.docs_mask.size() == seg_meta.docs*count) { // remove empty segments
        metaItr = flushed.segments.erase(metaItr);
      } else {
        if (!seg*ctx.docs*mask.empty()) { // write non-empty document mask
          write*document_mask(*dir_, seg_meta, seg_ctx.docs*mask); 
          metaItr->filename = std::move(write*segment_meta(*dir_, seg*meta)); // write with new mask
        }

        <ins></ins>metaItr;
      }
    }
  }

iresearch-toolkit / iresearch Goto Github PK

iresearch's Introduction

!!! THE PROJECT IS ARCHIVED AND NO LONGER MAINTAINED !!!

IResearch search engine

Version 1.3

Table of contents

Overview

High level architecture and main concepts

Index

Document

IndexedField concept

StoredField concept

Directory

Writer

Reader

Build prerequisites

set environment

install (*nix)

install (win32)

set environment

install (*nix)

install (win32)

set environment

install (*nix)

install (win32)

set environment

install (*nix)

install (win32)

set environment

Stopword list (for use with analysis::text_analyzer)

install

set environment

Build

Pyresearch

Build

Install

win32 install notes:

(*nix) install notes:

External 3rd party dependencies

Stopword list

Query filter building blocks

Supported compilers

License

iresearch's People

Contributors

Stargazers

Watchers

Forkers

iresearch's Issues

specify gcc version

add GCC dependency to Build prerequisites section (point that something like sudo yum groupinstall 'Development Tools' should be performed because simple gcc installation missed g, see http://unix.stackexchange.com/questions/140350/linux-g-command-not-found for details)

target iresearch-check in command cmake --build . --target iresearch-check does not work, please fix corresponding issue or change command to one with cobertura suffix which works fine

remove bash command from examples because it does not work

add sudo command to all commands where installation of built packages is performed (lz4 and others)

change link to https://github.com/lz4/lz4 for LZ4 paragraph name because current one is not correct

add more details about lz4 installation because only make commands do not reflect real installation procedure

add git client installation as prerequisites because we need clone git repo

lz4 require m4, so may be such step should be added (but m4 installation is lz4 installation step)

path GTEST_ROOT=/var/lib/jenkins/tools/gtest-1.7.0/ should be changed like others (e.g. <path to ...> should be used instead of hard coded one)

git clone <iResearch code repository>/iresearch.git iresearch should be changed to actual github path

add description to use <build_identifier> (whether it is mandatory and what should be passed)

LZ4_ROOT=/usr/local should be used because of install step

installation for win platform should be described as separate part of readme

full build of coverage version should be described separately (note as part of main build procedure)

steps of build procedure should be numbered because current text is hard to read

description of tested environment should be added (OS at least)

change IResearch to IReSearch in name "EMC IResearch search engine"

specify which command cmake -DENABLE*STATIC=OFF -DNO_SHARED=OFF -g "Unix Makefiles" .. or cmake -DENABLE_STATIC=OFF -DNO*SHARED=ON -g "Unix Makefiles" .. should be run for snowball installation from sources (or describe in different section build for shared and static cases)

specify command/package to instal libstemmer for centos

add some detailed info about gtest installation (specify package to download, extract and show what should be build, e.g. gtest or gmock)

perhaps, we need to build icu dependencies from sources, download package http://site.icu-project.org/download/58#TOC-ICU4C-Download and read docs from the package

describe libicu installation process step by step and point out which files are need (for Centos there are no static built files in packages, so there is need to configure ICU with option --enable-static to build necessary files)

Recommend Projects

Recommend Topics

add GCC dependency to `Build prerequisites` section (point that something like `sudo yum groupinstall 'Development Tools'` should be performed because simple gcc installation missed g, see http://unix.stackexchange.com/questions/140350/linux-g-command-not-found for details)

target `iresearch-check` in command `cmake --build . --target iresearch-check` does not work, please fix corresponding issue or change command to one with `cobertura` suffix which works fine

remove `bash` command from examples because it does not work

add `sudo` command to all commands where installation of built packages is performed (lz4 and others)

path `GTEST_ROOT=/var/lib/jenkins/tools/gtest-1.7.0/` should be changed like others (e.g. `<path to ...>` should be used instead of hard coded one)

`git clone <iResearch code repository>/iresearch.git iresearch` should be changed to actual github path

add description to use `<build_identifier>` (whether it is mandatory and what should be passed)

`LZ4_ROOT=/usr/local` should be used because of `install` step

specify which command `cmake -DENABLESTATIC=OFF -DNO_SHARED=OFF -g "Unix Makefiles" ..` or `cmake -DENABLE_STATIC=OFF -DNOSHARED=ON -g "Unix Makefiles" ..` should be run for snowball installation from sources (or describe in different section build for shared and static cases)

describe libicu installation process step by step and point out which files are need (for Centos there are no static built files in packages, so there is need to configure ICU with option `--enable-static` to build necessary files)