Giter VIP home page Giter VIP logo

cassandra-lucene-index's Introduction

Stratio’s Cassandra Lucene Index

Stratio’s Cassandra Lucene Index, derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial and bitemporal search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio’s Cassandra indexes are one of the core modules on which Stratio’s BigData platform is based.

architecture

Index relevance searches allow you to retrieve the n more relevant results satisfying a search. The coordinator node sends the search to each node in the cluster, each node returns its n best results and then the coordinator combines these partial results and gives you the n best of them, avoiding full scan. You can also base the sorting in a combination of fields.

Any cell in the tables can be indexed, including those in the primary key as well as collections. Wide rows are also supported. You can scan token/key ranges, apply additional CQL3 clauses and page on the filtered results.

Index filtered searches are a powerful help when analyzing the data stored in Cassandra with MapReduce frameworks as Apache Hadoop or, even better, Apache Spark. Adding Lucene filters in the jobs input can dramatically reduce the amount of data to be processed, avoiding full scan.

spark_architecture

The following benchmark result can give you an idea about the expected performance when combining Lucene indexes with Spark. We do successive queries requesting from the 1% to 100% of the stored data. We can see a high performance for the index for the queries requesting strongly filtered data. However, the performance decays in less restrictive queries. As the number of records returned by the query increases, we reach a point where the index becomes slower than the full scan. So, the decision to use indexes in your Spark jobs depends on the query selectivity. The trade-off between both approaches depends on the particular use case. Generally, combining Lucene indexes with Spark is recommended for jobs retrieving no more than the 25% of the stored data.

spark_performance

This project is not intended to replace Apache Cassandra denormalized tables, inverted indexes, and/or secondary indexes. It is just a tool to perform some kind of queries which are really hard to be addressed using Apache Cassandra out of the box features, filling the gap between real-time and analytics.

oltp_olap

More detailed information is available at Stratio’s Cassandra Lucene Index documentation.

Features

Lucene search technology integration into Cassandra provides:

Stratio’s Cassandra Lucene Index and its integration with Lucene search technology provides:

  • Full text search (language-aware analysis, wildcard, fuzzy, regexp)
  • Boolean search (and, or, not)
  • Sorting by relevance, column value, and distance
  • Geospatial indexing (points, lines, polygons and their multiparts)
  • Geospatial transformations (bounding box, buffer, centroid, convex hull, union, difference, intersection)
  • Geospatial operations (intersects, contains, is within)
  • Bitemporal search (valid and transaction time durations)
  • CQL complex types (list, set, map, tuple and UDT)
  • CQL user defined functions (UDF)
  • CQL paging, even with sorted searches
  • Columns with TTL
  • Third-party CQL-based drivers compatibility
  • Spark and Hadoop compatibility

Not yet supported:

  • Thrift API
  • Legacy compact storage option
  • Indexing counter columns
  • Indexing static columns
  • Other partitioners than Murmur3

Requirements

  • Cassandra (identified by the three first numbers of the plugin version)
  • Java >= 1.8 (OpenJDK and Sun have been tested)
  • Maven >= 3.0

Build and install

Stratio’s Cassandra Lucene Index is distributed as a plugin for Apache Cassandra. Thus, you just need to build a JAR containing the plugin and add it to the Cassandra’s classpath:

  • Clone the project: git clone http://github.com/Stratio/cassandra-lucene-index
  • Change to the downloaded directory: cd cassandra-lucene-index
  • Checkout a plugin version suitable for your Apache Cassandra version: git checkout A.B.C.X
  • Build the plugin with Maven: mvn clean package
  • Copy the generated JAR to the lib folder of your compatible Cassandra installation: cp plugin/target/cassandra-lucene-index-plugin-*.jar <CASSANDRA_HOME>/lib/
  • Start/restart Cassandra as usual.

Specific Cassandra Lucene index versions are targeted to specific Apache Cassandra versions. So, cassandra-lucene-index A.B.C.X is aimed to be used with Apache Cassandra A.B.C, e.g. cassandra-lucene-index:3.0.7.1 for cassandra:3.0.7. Please note that production-ready releases are version tags (e.g. 3.0.6.3), don't use branch-X nor master branches in production.

Alternatively, patching can also be done with this Maven profile, specifying the path of your Cassandra installation, this task also deletes previous plugin's JAR versions in CASSANDRA_HOME/lib/ directory:

mvn clean package -Ppatch -Dcassandra_home=<CASSANDRA_HOME>

If you don’t have an installed version of Cassandra, there is also an alternative profile to let Maven download and patch the proper version of Apache Cassandra:

mvn clean package -Pdownload_and_patch -Dcassandra_home=<CASSANDRA_HOME>

Now you can run Cassandra and do some tests using the Cassandra Query Language:

<CASSANDRA_HOME>/bin/cassandra -f
<CASSANDRA_HOME>/bin/cqlsh

The Lucene’s index files will be stored in the same directories where the Cassandra’s will be. The default data directory is /var/lib/cassandra/data, and each index is placed next to the SSTables of its indexed column family.

Remember that if you use geo shape search you need to include the JTS jar.

For more details about Apache Cassandra please see its documentation.

Examples

We will create the following table to store tweets:

CREATE KEYSPACE demo
WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE tweets (
   id INT PRIMARY KEY,
   user TEXT,
   body TEXT,
   time TIMESTAMP,
   latitude FLOAT,
   longitude FLOAT
);

Now you can create a custom Lucene index on it with the following statement:

CREATE CUSTOM INDEX tweets_index ON tweets ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         id: {type: "integer"},
         user: {type: "string"},
         body: {type: "text", analyzer: "english"},
         time: {type: "date", pattern: "yyyy/MM/dd"},
         place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
      }
   }'
};

This will index all the columns in the table with the specified types, and it will be refreshed once per second. Alternatively, you can explicitly refresh all the index shards with an empty search with consistency ALL:

CONSISTENCY ALL
SELECT * FROM tweets WHERE expr(tweets_index, '{refresh:true}');
CONSISTENCY QUORUM

Now, to search for tweets within a certain date range:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"}
}');

The same search can be performed forcing an explicit refresh of the involved index shards:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
   refresh: true
}') limit 100;

Now, to search the top 100 more relevant tweets where body field contains the phrase “big data gives organizations” within the aforementioned date range:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1}
}') LIMIT 100;

To refine the search to get only the tweets written by users whose names start with "a":

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1}
}') LIMIT 100;

To get the 100 more recent filtered results you can use the sort option:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
   sort: {field: "time", reverse: true}
}') limit 100;

The previous search can be restricted to tweets created close to a geographical position:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"},
      {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
   sort: {field: "time", reverse: true}
}') limit 100;

It is also possible to sort the results by distance to a geographical position:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"},
      {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
   sort: [
      {field: "time", reverse: true},
      {field: "place", type: "geo_distance", latitude: 40.3930, longitude: -3.7328}
   ]
}') limit 100;

Last but not least, you can route any search to a certain token range or partition, in such a way that only a subset of the cluster nodes will be hit, saving precious resources:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"},
      {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
   sort: [
      {field: "time", reverse: true},
      {field: "place", type: "geo_distance", latitude: 40.3930, longitude: -3.7328}
   ]
}') AND TOKEN(id) >= TOKEN(0) AND TOKEN(id) < TOKEN(10000000) limit 100;

This last is the basis for Hadoop, Spark and other MapReduce frameworks support.

Please, refer to the comprehensive Stratio’s Cassandra Lucene Index documentation.

cassandra-lucene-index's People

Contributors

adelapena avatar andreaspetter avatar fdimuccio avatar jay-zhuang avatar jcortejoso avatar ml0renz0 avatar stratiocommit avatar talberto-zz avatar witokondoria avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cassandra-lucene-index's Issues

bitemporal index unexpedted filtering on vt

I have a simple entry (in a single table):
Table structure:

CREATE TABLE users (
         user_id int,
         name text,
         vt_from text,
         vt_to text,
         tt_from text,
         tt_to text,
         lucene text,
         PRIMARY KEY (user_id, vt_from, tt_from)
     );

     CREATE CUSTOM INDEX users_index on users(lucene)
     USING 'com.stratio.cassandra.lucene.Index'
     WITH OPTIONS = {
         'refresh_seconds' : '1',
         'schema' : '{
         fields : {
         bitemporal : {
             type : "bitemporal",
             tt_from : "tt_from",
             tt_to : "tt_to",
             vt_from : "vt_from",
             vt_to : "vt_to",
             pattern : "yyyy/MM/dd",
             now_value : "2200/12/31"}
         }
     }'};

value

INSERT INTO users (user_id,  name, vt_from, vt_to, tt_from, tt_to)
    VALUES (42, 'Bob', '2015/01/01', '2200/12/31', '2015/01/01', '2015/01/05');

The "valid" date range is from 2015-01-01 to 2200-12-31, so I should be able to read it while filtering on a sub-range: [2015-01-06, 2200-12-31]. But this does not work.

SELECT name FROM users
WHERE lucene = '{
    filter : {
        type : "bitemporal",
        field : "bitemporal",
        vt_from : "2015/01/06",
        vt_to : "2200/12/31"
    }
}';

This query does not find any value, despite the VT range intersects the data VT range (and TT also, as I left them to default)
Same result with specific TT in the query filter ( tt_from : "2015/01/01", tt_to : "2200/12/31")

But the same query does find data with: vt_from : "2015/01/04".
So it seems the data tt_to (2015/01/05) is somehow filtered by the query vt_from (2015/01/06)!

Why I cannot query my data?
the query [tt_from, tt_to] range includes the one of the data, it should not be affected by the query VT.

SELECT COUNT statement return result twice

When I executed the CQL below, I got two lines of result

fred@cqlsh:opensignal> SELECT COUNT(*) FROM os_snapshot2 WHERE ts = 0 and network_name_mapped = 'Verizon' and network_type_mapped = '4' and lucene = '{ filter : {type:"boolean", must:[ {type:"geo_bbox",  field:"place", min_latitude: 25, max_latitude: 26, min_longitude: -100, max_longitude: -97} ]} }';

 count
-------
   100

---MORE---
 count
-------
    59

So I wonder whether this output style is a part of design or some issue.
According to my schema, the CQL will only contact coordinator and only one cohort node, since all the data with ts = 0 and network_name_mapped = 'Verizon' is stored in one node (say, node-4).
After I have checked CQL trace, and then found that all operations processed on coordinator node and node-4. So I thought both coordinator and cohort node had issued the same CQL once. And dump their outputs to CQL execution node.
Please let me know whether this is your design or some other issue.
Here is my table schema

CREATE TABLE opensignal.os_snapshot2 (
    ts timestamp,
    network_name_mapped text,
    network_type_mapped text,
    box_lat double,
    box_long double,
    created_at timeuuid,
    averaged_over double,
    download_speed double,
    lat double,
    long double,
    lucene text,
    network_id_mapped text,
    ping_time double,
    reliability double,
    rssi double,
    speed_averaged_over double,
    upload_speed double,
    PRIMARY KEY ((ts, network_name_mapped), network_type_mapped, box_lat, box_long, created_at)
) WITH CLUSTERING ORDER BY (network_type_mapped ASC, box_lat ASC, box_long ASC, created_at ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
CREATE CUSTOM INDEX os_snapshot2_spatial_index ON opensignal.os_snapshot2 (lucene) USING 'com.stratio.cassandra.lucene.Index';

And my index schema

CREATE CUSTOM INDEX os_snapshot2_spatial_index ON os_snapshot2 (lucene) 
USING 'com.stratio.cassandra.lucene.Index' 
WITH OPTIONS = {  
    'refresh_seconds' : '60',
    'schema' : '{ 
        fields : { 
            place : { type : "geo_point", latitude:"lat", longitude:"long" }, 
            network_type_mapped: { type: "string" } 
        }     
    }' 
};

Query performance compared to DSE Search

When evaluating the query performance for the same schema between DSE Search with Solr and this project, we are noticing the DSE to be at least 10 times faster in the same cluster. Would there be any reason why DSE performs the Solr queries much faster?

2 to 6 times the insert rate without a stratio index

Hey,

I've benchmarked insertions on a cassandra table with and without an index using your code and it appears that having the secondary index divides the insertion rate per 2 on a table with a few fields and far more on one with many fields (up to 6 times slower). Are you aware of it ?
I thought secondary indexes were updated asynchronously but they are for sure updated when the corresponding table is updated.

Cassandra warns for multiple logback.xml

I get this warning after adding lucene index jar file into cassandra lib directory:

19:48:03,074 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs multiple times on the classpath.
19:48:03,074 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/cassandra/lib/cassandra-lucene-index-plugin-2.1.8.4.jar!/logback.xml]
19:48:03,074 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [file:/cassandra/conf/logback.xml]

Add a new column on existing keyspace, how to extend the Lucene index?

If I add a new field (Column) and I like to use this column in the Lucene index.
How can I add a new field to the existing Lucene index?

Possible option = delete the Custum index and create the index again with the new field.
But how can I restart the index process on the old data?

filter match request on map field fails for value containing "-" character

Hi,

Tried with both 2.1.6 and 2.1.8 in different environments and for both of them i am seeing below exception stack when the criteria is more than 12 characters:

Query: select * from where cassandra_lucene_index = '{filter : {type: "match", field: "", value: "<greater then 12 characters>"}}';

Works fine for creiteria valus that are <= 12 characters.

Traceback (most recent call last):
File "./cqlsh.py", line 1150, in perform_simple_statement
rows = future.result(self.session.default_timeout)
File "//apache-cassandra-2.2.0_i1/bin/../lib/cassandra-driver-internal-only-2.6.0c2.post.zip/cassandra-driver-2.6.0c2.post/cassandra/cluster.py", line 3296, in result
raise self._final_exception

ReadFailure: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 1 failures" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

Please let me know in you need any additional information.

sorting is broken after using boolean type filter in query

Hi, I tried to query my example table using lucene index with timestamp sorting. If I use only one filter like 'range' sorting by date works great, I can reverse it also. But after I change my filter to boolean type to have two different conditions like range and prefix then reverse sorting stops working - I cannot sort in reverse anymore by date. So this works fine (can sort in both directions by date):

SELECT modifytime,noteid,userid FROM epodb.notes
WHERE fts_index = '{filter : {type:"range", field:"modifytime", lower:"2015/09/28 00:00:00"},
sort : {fields: [ {field:"modifytime", reverse:true}]}}'
limit 20;

But this one stops sorting in reverse by date (regardless the reverse parameter results are always sorted ascending):

SELECT modifytime,noteid,userid FROM epodb.notes
WHERE fts_index = '{filter : {
type : "boolean",
must : [{type:"range", field:"modifytime", lower:"2015/09/28 00:00:00"},
{type : "prefix", field : "value", value : "a"}],
sort : {fields: [ {field:"modifytime", reverse:true}]}}}'
limit 20;

Invalid bitemporal data prevents cassandra from starting up

Invalid time ranges (where the from is later than the to) prevent cassandra from starting up while replaying the commit log. In the short term, is there a way to correct this with the cluster offline? Longer term, I would think this shouldn't prevent the cluster from even starting up.

ERROR [main] 2015-08-25 08:04:48,502 CassandraDaemon.java:541 - Exception encountered during startup
java.lang.RuntimeException: java.util.concurrent.ExecutionException: com.stratio.cassandra.lucene.IndexException: Error while indexing row java.nio.HeapByteBuffer[pos=0 lim=4 cap=4] in Lucene index onestore.bt.bt_index
        at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:403) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:392) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:463) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:119) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:148) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:128) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:352) [apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:524) [apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:613) [apache-cassandra-2.1.8.jar:2.1.8]
Caused by: java.util.concurrent.ExecutionException: com.stratio.cassandra.lucene.IndexException: Error while indexing row java.nio.HeapByteBuffer[pos=0 lim=4 cap=4] in Lucene index onestore.bt.bt_index
        at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.get(AbstractTracingAwareExecutorService.java:200) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:399) ~[apache-cassandra-2.1.8.jar:2.1.8]
        ... 8 common frames omitted
Caused by: com.stratio.cassandra.lucene.IndexException: Error while indexing row java.nio.HeapByteBuffer[pos=0 lim=4 cap=4] in Lucene index onestore.bt.bt_index
        at com.stratio.cassandra.lucene.Index.index(Index.java:159) ~[cassandra-lucene-index-plugin-2.1.8.2.jar:na]
        at org.apache.cassandra.db.index.SecondaryIndexManager$StandardUpdater.updateRowLevelIndexes(SecondaryIndexManager.java:834) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:229) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Memtable.put(Memtable.java:210) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1237) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:400) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:363) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_51]
        at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) ~[apache-cassandra-2.1.8.jar:2.1.8]
        at java.lang.Thread.run(Thread.java:744) ~[na:1.7.0_51]
Caused by: java.lang.IllegalArgumentException: Wrong order: 2014-01-01T05 TO 2013-01-01T05:00:00.000
        at org.apache.lucene.spatial.prefix.tree.NumberRangePrefixTree.toRangeShape(NumberRangePrefixTree.java:109) ~[cassandra-lucene-index-plugin-2.1.8.2.jar:na]
        at com.stratio.cassandra.lucene.schema.mapping.BitemporalMapper.makeShape(BitemporalMapper.java:282) ~[cassandra-lucene-index-plugin-2.1.8.2.jar:na]
        at com.stratio.cassandra.lucene.schema.mapping.BitemporalMapper.addFields(BitemporalMapper.java:329) ~[cassandra-lucene-index-plugin-2.1.8.2.jar:na]
        at com.stratio.cassandra.lucene.schema.Schema.addFields(Schema.java:185) ~[cassandra-lucene-index-plugin-2.1.8.2.jar:na]
        at com.stratio.cassandra.lucene.service.RowMapperWide.document(RowMapperWide.java:107) ~[cassandra-lucene-index-plugin-2.1.8.2.jar:na]
        at com.stratio.cassandra.lucene.service.RowServiceWide.documents(RowServiceWide.java:129) ~[cassandra-lucene-index-plugin-2.1.8.2.jar:na]
        at com.stratio.cassandra.lucene.service.RowServiceWide.index(RowServiceWide.java:91) ~[cassandra-lucene-index-plugin-2.1.8.2.jar:na]
        at com.stratio.cassandra.lucene.Index.index(Index.java:154) ~[cassandra-lucene-index-plugin-2.1.8.2.jar:na]

Clarification about actions taken by cassandra-lucene-index upon refresh

Hi,

Is there any resource that details how the indexes managed by cassandra-lucene-index. i.e. the actions that the library takes during refresh.

Specifically, i am looking at the behaviour when my indexed map column is updated during the refresh period and if are there any limitations to the number of updates or the read repairs and such.

Also, when i compared the performance of my UPSERT on the indexed column between C* secondary index and the cassandra-lucene-index, i noticed as much as 20% performance degradation for my UPSERTs. Ofcouse, C* secondary index is not an option for us, but would you know how index mgmt is different between the two. I can understand if the cost difference is during read (because we are allowing much more here), but could not figure what that would be the case for my UPSERT.

Thanks for reading.

build failed

[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building cassandra-lucene-index 2.1.6.1
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ cassandra-lucene-index ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ cassandra-lucene-index ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 2 resources
[INFO]
[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ cassandra-lucene-index ---
[INFO] Compiling 116 source files to /home/rav/cassandra-lucene-index/target/classes
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ cassandra-lucene-index ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/rav/cassandra-lucene-index/src/test/resources
[INFO]
[INFO] --- maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ cassandra-lucene-index ---
[INFO] Compiling 56 source files to /home/rav/cassandra-lucene-index/target/test-classes
[INFO]
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ cassandra-lucene-index ---
[INFO] Surefire report directory: /home/rav/cassandra-lucene-index/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running com.stratio.cassandra.lucene.query.SearchTest
Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.555 sec
Running com.stratio.cassandra.lucene.query.LuceneConditionTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.075 sec
Running com.stratio.cassandra.lucene.query.ConditionTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.01 sec
Running com.stratio.cassandra.lucene.query.RegexpConditionTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.1 sec
Running com.stratio.cassandra.lucene.query.ContainsConditionTest
Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.074 sec
Running com.stratio.cassandra.lucene.query.GeoDistanceConditionTest
Tests run: 19, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.1 sec
Running com.stratio.cassandra.lucene.query.PhraseConditionTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.049 sec
Running com.stratio.cassandra.lucene.query.FuzzyConditionTest
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.041 sec
Running com.stratio.cassandra.lucene.query.GeoBBoxConditionTest
Tests run: 23, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.038 sec
Running com.stratio.cassandra.lucene.query.SingleFieldConditionTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running com.stratio.cassandra.lucene.query.MatchConditionTest
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.042 sec
Running com.stratio.cassandra.lucene.query.WildcardConditionTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.038 sec
Running com.stratio.cassandra.lucene.query.MatchAllConditionTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running com.stratio.cassandra.lucene.query.builder.SearchBuilderTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Running com.stratio.cassandra.lucene.query.builder.MatchConditionBuilderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running com.stratio.cassandra.lucene.query.builder.LuceneConditionBuilderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
Running com.stratio.cassandra.lucene.query.builder.SortBuilderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running com.stratio.cassandra.lucene.query.builder.MatchAllConditionBuilderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running com.stratio.cassandra.lucene.query.builder.PrefixConditionBuilderTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running com.stratio.cassandra.lucene.query.builder.SearchBuildersTest
Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.015 sec
Running com.stratio.cassandra.lucene.query.builder.FuzzyConditionBuilderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running com.stratio.cassandra.lucene.query.builder.SortFieldBuilderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running com.stratio.cassandra.lucene.query.builder.RangeConditionBuilderTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running com.stratio.cassandra.lucene.query.builder.ContainsConditionBuilderTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running com.stratio.cassandra.lucene.query.builder.PhraseConditionBuilderTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running com.stratio.cassandra.lucene.query.builder.RegexpConditionBuilderTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running com.stratio.cassandra.lucene.query.builder.WildcardConditionBuilderTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running com.stratio.cassandra.lucene.query.BooleanConditionTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.039 sec
Running com.stratio.cassandra.lucene.query.PrefixConditionTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.023 sec
Running com.stratio.cassandra.lucene.query.SortFieldTest
Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.053 sec
Running com.stratio.cassandra.lucene.query.DateRangeConditionTest
Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.051 sec
Running com.stratio.cassandra.lucene.query.RangeConditionTest
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
Running com.stratio.cassandra.lucene.schema.ColumnsTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running com.stratio.cassandra.lucene.schema.ColumnTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running com.stratio.cassandra.lucene.schema.mapping.DoubleMapperTest
Tests run: 26, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.056 sec
Running com.stratio.cassandra.lucene.schema.mapping.BooleanMapperTest
Tests run: 27, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.037 sec
Running com.stratio.cassandra.lucene.schema.mapping.DateMapperTest
Tests run: 28, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.036 sec
Running com.stratio.cassandra.lucene.schema.mapping.StringMapperTest
Tests run: 28, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.021 sec
Running com.stratio.cassandra.lucene.schema.mapping.BigIntegerMapperTest
Tests run: 60, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.043 sec
Running com.stratio.cassandra.lucene.schema.mapping.LongMapperTest
Tests run: 26, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec
Running com.stratio.cassandra.lucene.schema.mapping.InetMapperTest
Tests run: 25, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.029 sec
Running com.stratio.cassandra.lucene.schema.mapping.BigDecimalMapperTest
Tests run: 64, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.098 sec
Running com.stratio.cassandra.lucene.schema.mapping.FloatMapperTest
Tests run: 26, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.021 sec
Running com.stratio.cassandra.lucene.schema.mapping.UUIDMapperTest
Tests run: 28, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.117 sec
Running com.stratio.cassandra.lucene.schema.mapping.BlobMapperTest
Tests run: 27, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
Running com.stratio.cassandra.lucene.schema.mapping.MapperTest
Tests run: 14, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.019 sec
Running com.stratio.cassandra.lucene.schema.mapping.TextMapperTest
Tests run: 29, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
Running com.stratio.cassandra.lucene.schema.mapping.IntegerMapperTest
Tests run: 26, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec
Running com.stratio.cassandra.lucene.schema.mapping.GeoPointMapperTest
Tests run: 37, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.963 sec
Running com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest
Tests run: 34, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 0.04 sec <<< FAILURE!
testGetStopFromStringColumnWithDefaultPattern(com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest)  Time elapsed: 0.006 sec  <<< FAILURE!
java.lang.AssertionError: expected:<1425081723004> but was:<1425078123004>
        at org.junit.Assert.fail(Assert.java:91)
        at org.junit.Assert.failNotEquals(Assert.java:645)
        at org.junit.Assert.assertEquals(Assert.java:126)
        at org.junit.Assert.assertEquals(Assert.java:470)
        at org.junit.Assert.assertEquals(Assert.java:454)
        at com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest.testGetStopFromStringColumnWithDefaultPattern(DateRangeMapperTest.java:212)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
        at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
        at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

testGetStopFromStringColumnWithCustomPattern(com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest)  Time elapsed: 0.003 sec  <<< FAILURE!
java.lang.AssertionError: expected:<1425078000000> but was:<1425074400000>
        at org.junit.Assert.fail(Assert.java:91)
        at org.junit.Assert.failNotEquals(Assert.java:645)
        at org.junit.Assert.assertEquals(Assert.java:126)
        at org.junit.Assert.assertEquals(Assert.java:470)
        at org.junit.Assert.assertEquals(Assert.java:454)
        at com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest.testGetStopFromStringColumnWithCustomPattern(DateRangeMapperTest.java:221)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
        at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
        at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

testGetStartFromStringColumnWithCustomPattern(com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest)  Time elapsed: 0.002 sec  <<< FAILURE!
java.lang.AssertionError: expected:<1425078000000> but was:<1425074400000>
        at org.junit.Assert.fail(Assert.java:91)
        at org.junit.Assert.failNotEquals(Assert.java:645)
        at org.junit.Assert.assertEquals(Assert.java:126)
        at org.junit.Assert.assertEquals(Assert.java:470)
        at org.junit.Assert.assertEquals(Assert.java:454)
        at com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest.testGetStartFromStringColumnWithCustomPattern(DateRangeMapperTest.java:150)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
        at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
        at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

testGetStartFromStringColumnWithDefaultPattern(com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest)  Time elapsed: 0.002 sec  <<< FAILURE!
java.lang.AssertionError: expected:<1425081723004> but was:<1425078123004>
        at org.junit.Assert.fail(Assert.java:91)
        at org.junit.Assert.failNotEquals(Assert.java:645)
        at org.junit.Assert.assertEquals(Assert.java:126)
        at org.junit.Assert.assertEquals(Assert.java:470)
        at org.junit.Assert.assertEquals(Assert.java:454)
        at com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest.testGetStartFromStringColumnWithDefaultPattern(DateRangeMapperTest.java:141)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
        at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
        at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
        at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
        at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
        at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
        at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
        at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

Running com.stratio.cassandra.lucene.schema.analysis.ClasspathAnalyzerBuilderTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.01 sec
Running com.stratio.cassandra.lucene.schema.analysis.PreBuiltAnalyzersTest
Tests run: 43, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.317 sec
Running com.stratio.cassandra.lucene.schema.analysis.SnowballAnalyzerBuilderTest
Tests run: 26, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.078 sec
Running com.stratio.cassandra.lucene.schema.SchemaTest
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.11 sec
Running com.stratio.cassandra.lucene.service.LuceneIndexTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.743 sec

Results :

Failed tests:   testGetStopFromStringColumnWithDefaultPattern(com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest): expected:<1425081723004> but was:<1425078123004>
  testGetStopFromStringColumnWithCustomPattern(com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest): expected:<1425078000000> but was:<1425074400000>
  testGetStartFromStringColumnWithCustomPattern(com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest): expected:<1425078000000> but was:<1425074400000>
  testGetStartFromStringColumnWithDefaultPattern(com.stratio.cassandra.lucene.schema.mapping.DateRangeMapperTest): expected:<1425081723004> but was:<1425078123004>

Tests run: 834, Failures: 4, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 21.626 s
[INFO] Finished at: 2015-06-20T10:36:33+03:00
[INFO] Final Memory: 30M/77M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project cassandra-lucene-index: There are test failures.
[ERROR]
[ERROR] Please refer to /home/rav/cassandra-lucene-index/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Timeout on multiple concurrent queries

This is probably not an issue but a question.

I have a script which bootstraps the data model and inserts some data. The inserts have some hefty logic (they insert > search > update) and run in parallel. (Just to mention, searching is using the lucene index.)

I used to run it on a single node and recently I added one more to form a cluster (2 nodes). I'm experiencing time outs and I tried both on the old stratio-cassandra-2.1.5.0 and the new cassandra-2.1.9 + cassandra-lucene-index-2.1.9.0. The database nodes + custom indexes were set-up with pretty much default settings from the documentation.

After some investigation, it seemed that the cluster could not keep up with that concurrency, which was in the order of 10's. It seems pretty strange, given that the database is benchmarked way above that.

To conclude, I'm left with few questions:

  1. Is this slow behavior normal or I'm doing something obviously wrong?
  2. If it is normal, which part is more plausible to be the problem: Cassandra or the index?
  3. Which would be the first config adjustments to give better results?
  4. Is it just a concurrency problem?

lucene index plugin deployment

Just a question about deployment and use of this plugin. Suppose I have one cluster in data center 1 with cassandra with main db using vnodes. And now I would like to add another cluster in data center 2 and only there I would like to keep lucene indexes. Is it possible to separate things like that? I saw that there are some concepts with mixed deployments with cassandra where some nodes are used for main db, some for solr, spark etc.

Match and Phrase behavior problem (compared to stratio-cassandra)

Switching from stratio-cassandra 2.1.5 to a bare C* 2.1.6 with the cassandra-lucene-index plugin (compiled today) shows inconsistencies on Match, Phrase and Fuzzy queries, that seem to behave differently than expected in the plugin.

Match and phrase seem to have switched from single word search to sentence search, which prevents using them to search for a single word or a group of words in a string (and leaving wildcard for that, which is not as fast).

Wildcard

Seems to work fine, but gives a real slowdown compared to match/phrase/fuzzy :

select ref_expediteur, adresse_1_destinataire 
from vision_dev.lt 
where lucene = '
            {filter : {type:"range", field:"date_depot_lt", lower:"2015-01-25", upper:"2015-04-01"},  
             query  :  {type:"wildcard", field:"adresse_1_destinataire", value:"*DANIEL*COGNAC*"}}';

I get the following 7 rows :

 ref_expediteur | adresse_1_destinataire
----------------+-------------------------
       95762968 | 15 RUE DANIEL DE COGNAC
       89162952 | 15 RUE DANIEL DE COGNAC
       94262880 | 15 RUE DANIEL DE COGNAC
       95823340 | 15 RUE DANIEL DE COGNAC
       95162706 | 15 RUE DANIEL DE COGNAC
       95042969 | 15 RUE DANIEL DE COGNAC
       94443320 | 15 RUE DANIEL DE COGNAC

Fuzzy

If I use the following query :

select ref_expediteur, adresse_1_destinataire 
from vision_dev.lt 
where lucene = '
            {filter : {type:"range", field:"date_depot_lt", lower:"2015-01-25", upper:"2015-04-01"},  
             query  :  {type:"fuzzy", field:"adresse_1_destinataire", value:"15 RUE DANIEL DECOGNAC"}}';

I get the following 8 rows :

 ref_expediteur | adresse_1_destinataire
----------------+-------------------------
       94262880 | 15 RUE DANIEL DE COGNAC
       95042969 | 15 RUE DANIEL DE COGNAC
     RI36692335 | 15 RUE DANIEL DE COSNAC
       95762968 | 15 RUE DANIEL DE COGNAC
       89162952 | 15 RUE DANIEL DE COGNAC
       95823340 | 15 RUE DANIEL DE COGNAC
       95162706 | 15 RUE DANIEL DE COGNAC
       94443320 | 15 RUE DANIEL DE COGNAC

This seems like a reasonable result.
Than if we differ a bit more from the "real" string :

select ref_expediteur, adresse_1_destinataire 
from vision_dev.lt 
where lucene = '
            {filter : {type:"range", field:"date_depot_lt", lower:"2015-01-25", upper:"2015-04-01"},  
             query  :  {type:"fuzzy", field:"adresse_1_destinataire", value:"15 RUE DANIEL COGNAC"}}';

Then we get no match at all (0 rows).
This could be related to me misunderstanding the accuracy needed for fuzzy to return results.

MATCH

Using the same dataset, we can use match to get results providing the full string (exactly as stored) :

select ref_expediteur, adresse_1_destinataire 
from vision_dev.lt 
where lucene = '
            {filter : {type:"range", field:"date_depot_lt", lower:"2015-01-25", upper:"2015-04-01"},  
             query  :  {type:"match", field:"adresse_1_destinataire", value:"15 RUE DANIEL DE COGNAC"}}';

returns 7 rows :

 ref_expediteur | adresse_1_destinataire
----------------+-------------------------
       95762968 | 15 RUE DANIEL DE COGNAC
       89162952 | 15 RUE DANIEL DE COGNAC
       94262880 | 15 RUE DANIEL DE COGNAC
       95823340 | 15 RUE DANIEL DE COGNAC
       95162706 | 15 RUE DANIEL DE COGNAC
       95042969 | 15 RUE DANIEL DE COGNAC
       94443320 | 15 RUE DANIEL DE COGNAC

Previously, this didn't work since "match" didn't allowed multiple words at once.

Now if I restrict the match to a single word :

select ref_expediteur, adresse_1_destinataire 
from vision_dev.lt 
where lucene = '
            {filter : {type:"range", field:"date_depot_lt", lower:"2015-01-25", upper:"2015-04-01"},  
             query  :  {type:"match", field:"adresse_1_destinataire", value:"COGNAC"}}';

Then I get no result at all.
Shouldn't it return all rows containing the word "COGNAC" in it ?

PHRASE

Phrase gets it right, like match does, when the full string is searched as stored :

select ref_expediteur, adresse_1_destinataire 
from vision_dev.lt 
where lucene = '
            {filter : {type:"range", field:"date_depot_lt", lower:"2015-01-25", upper:"2015-04-01"},  
             query  :  {type:"phrase", field:"adresse_1_destinataire", value:"15 RUE DANIEL DE COGNAC"}}';

returns 7 rows :

 ref_expediteur | adresse_1_destinataire
----------------+-------------------------
       95762968 | 15 RUE DANIEL DE COGNAC
       89162952 | 15 RUE DANIEL DE COGNAC
       94262880 | 15 RUE DANIEL DE COGNAC
       95823340 | 15 RUE DANIEL DE COGNAC
       95162706 | 15 RUE DANIEL DE COGNAC
       95042969 | 15 RUE DANIEL DE COGNAC
       94443320 | 15 RUE DANIEL DE COGNAC

But if I try to skip a word like "DE" and set slop like this :

select ref_expediteur, adresse_1_destinataire 
from vision_dev.lt 
where lucene = '
            {filter : {type:"range", field:"date_depot_lt", lower:"2015-01-25", upper:"2015-04-01"},  
             query  :  {type:"phrase", field:"adresse_1_destinataire", value:"15 RUE DANIEL COGNAC", slop:2}}';

Once again I get no results.

Here's the create statement for the index :

CREATE CUSTOM INDEX lt_lucene_index ON vision_dev.lt (lucene) 
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds' : '10',
   'schema' : '{
       fields : {
           adresse_1_destinataire : {type : "string"},
           adresse_1_expediteur : {type : "string"},
           adresse_2_destinataire : {type : "string"},
           adresse_2_expediteur : {type : "string"},
           date_depot_lt : {type : "date", pattern : "yyyy-MM-dd"}
       }
   }'
};

Am I misunderstanding something ?

filter match request does not work with map field

Hi,

Using filter match request with map field does not retrieve the value correctly.

Below the steps to reproduce:

//create table
create table mytesttable (idcol int, testtextcol text, testmapcol map<text,text>, cassandra_lucene_index text, primary key (idcol));

//create index
CREATE CUSTOM INDEX IF NOT EXISTS lucene_test_idx ON mytesttable (cassandra_lucene_index) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = {'refresh_seconds' : '1', 'schema' : '{ fields : {testtextcol: {type: "string"}, testmapcol: {type: "string"}}}'};

//create test data
insert into mytesttable (idcol, testtextcol, testmapcol) values (1, 'row1', {'attb1': 'row1attb1Val', 'attb2': 'row1attb2Val', 'attb3': 'row1attb3Val'});
insert into mytesttable (idcol, testtextcol, testmapcol) values (2, 'someLongRow2Value', {'attb1': 'someLongattb1Val', 'attb2': 'attb2Val', 'attb3': 'attb3Val'});
insert into mytesttable (idcol, testtextcol, testmapcol) values (3, 'row3', {'attb1': 'row2attb1Val', 'attb2': 'row2attb2Val', 'attb3': 'row2attb3Val'});
insert into mytesttable (idcol, testtextcol, testmapcol) values (4, 'row4', {'attb2': 'row3attb2Val', 'attb3': 'row3attb3Val'});

//this query retrieves the data fine
select * from mytesttable where cassandra_lucene_index = '{filter : {type: "match", field: "testtextcol", value: "row1"}}';

//but this query does not
select * from mytesttable where cassandra_lucene_index = '{filter : {type: "match", field: "testmapcol.attb1", value: "row1attb1Val"}}';

Please let me know if you need any other details.

Improved integration with Cassandra to remove unneeded lucene re-index run

The scenario I created is an index refresh index every second

There is NO NEED to run this code every second if no access (or dirty access - changes of the date) were done to Cassandra from last build. (Even read only is not consider a dirty access).
As now this run unnecessary, creating workload on the machine, while no changes or access are done to the Cassandra!!! .
As you added Lucene to run on the Cassandra itself, you could add a listener/filter code to Lucene/Cassandra refresh loop to disable such runs which is aware of dirtiness or at least for no access.

CREATE INDEX log_os ON sregion.logevent ( os_type );
ON sregion.logevent (stratio_col)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds':'1',
'num_cached_filters':'1',
'ram_buffer_mb':'64',
'max_merge_mb':'5',
'max_cached_mb':'30',
'schema':'{
default_analyzer:"standard",
fields:{
event_code:{type:"string"},
application:{type:"string"},
event_time:{type:"date", pattern:"yyyy/MM/dd"},
username:{type:"string"},
ip_address:{type:"string"},
os_type:{type:"integer"},
data:{type:"text",
analyzer:"english"}}
}';

Indexing Collections Elements

hello,

Is it possible to index collections (precisely a set) elements?
For example, I have a field, which is a set containing these values : "5", "7", "9"

Is it possible to search rows where "8" is contained (or not) in this set?

Julien

Custom index and consistency clarification

Hello,

I simulated the situation of a down node for a "long time" and returning alive.

So I've set a 3 nodes cluster with RF3 replication, hinted_hand_off deactivated
I insert a row such as :

{ id : 1 , user : manu, tweet : hello world } 

with an appropriate custom index

I put my node 3 offline and update the content of "tweet"

Update documents set tweet="hello paris" where id=1

I bring my node 3 online and perform this request :

SELECT * FROM documents WHERE lucene='{query : {type:"phrase", field:"tweet", value:"hello world", slop:1}}';

returning me 0 or 1 row depending on the responding nodes.

Is there a solution to prevent a comportment like that or it's totaly normal ?

new installation (for POC evaluation) questions

hi all,

excuse me if this is not the right location to post my queries, but by lack of information about another more suitable website I am posting them here :

we are investigating different options that exist to run a relatively small 'big data' infrastructure for our customer, and are looking currently at the Stratio-platform offering .
The total amount of data that we need to process is roughly 5 TB.
We have budget for 6 servers, each with 16 dual-thread cores and 128 GB of RAM.

To be sure that your solution can be implemented within the restrictions of our customer's budget, security policies, and IT-departmental policies, we have the following questions:

  • the different server types:
    You mention a "Stratio Manager" and a "Stratio Node" as server roles.
    The buget of our customer allows a maximum of 6 physical machines, installed in two datacenters (3 servers in one, 3 servers in the other datacenter).
    When we would need to allocate one of these 6 physical machines as a "Stratio Manager" role only, then we loose a lot of compute power for this node role alone.
    -> is it possible to combine a "Stratio Manager" on the same physical machine as a "Stratio Node" ?
  • alternatively, would it be possible to run the "Stratio Manager" on a VMware virtual machine ?
  • does the "Stratio Manager" require internet access to download your software packages? This company (like many others) has tight restrictions on which machine can access the internet : it will be only granted via a proxy server. So would the installation script be capable to work with a proxy server? Can it be specified on the command-line which proxy server, and which credentials on that proxy server need to be used to access the internet ?
  • is the "Stratio Manager" required to run after the installation has been completed? In other words, what if the machine where this "Stratio Manager" is installed, goes broke? Is it a Single-Point-Of-Failure to continue running with the "Stratio" suite of software ? What activities cannot be done anymore with the "Stratio Nodes" when this manager is unavailable ?
  • the installation is done with superuser ("root") privileges, obviously because it is installing packages and thus modifying the operating system as it comes pre-installed by our IT department. They want obviously to know what changes will be done on the operating system by this Stratio installation :
    packages that will be installed
    any kernel modules that will be added ?
    any operating system configuration files that would be modified (crontabs, user accounts, device files, driver configs, ...)
  • does any part of the "Stratio" platform require to run with superuser privileges, after the installation is completed ? If so, please provide details.
  • we desire geo-redundancy of our servers : one set of 3 servers running in one datacenter, another set of 3 servers running in another datacenter.
    Can these 6 servers work together as a pool to join forces ?
    What happens if an entire datacenter is lost (how is data replicated, which activities are no longer possible)?
    Can the servers of one datacenter be un-configured (removed) from the 'pool' so that we can re-install them with some other configuration that we want to evaluate ?
  • in your installation requirements, we read that the machines must be RHEL 6.5 or 6.6 ; in this company, they run standard RHEL version 6.7
    -> is this version RHEL 6.7 supported by Stratio ?

Many thanks in advance for your swift replies ! Please reply on each question, feel free to redirect us for each individual question to detailed (not generic) answers somewhere on the web.

Rob

Lucene Index does not return correct data

We have a problem with the consistency of the lucene index.
Our context :

//Create table with the schema bellow
CREATE TABLE keyspace_1.table_1 (
field_1 text,
field_2 int,
field_3 timestamp,
field_4 text,
field_5 text,
field_6 text,
field_7 text,
field_8 text,
field_9 text,
field_10 text,
field_11 text,
field_12 text,
field_13 int,
field_14 text,
field_15 int,
field_16 int,
field_17 text,
field_18 MAP<text, text>,
field_19 text,
field_20 text,
field_21 text,
field_22 int,
field_23 int,
field_24 int,
field_25 text,
field_26 text,
field_27 text,
field_28 text,
field_29 text,
PRIMARY KEY (field_1, field_2, field_3)
);

//Load data on the table (approximativly 100Go of data)

// Create index
alter table keypace_1.table_1 add indextx text ;

CREATE CUSTOM INDEX table_index ON keypace_1.table_1 (indextx) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { field_6 : {type : "string", sorted: "false"}, field_3 : {type : "date", pattern : "yyyy-MM-dd"} } }' };

// The lucene index does not return the correct data

select field_1, field_6, field_3 from keypace_1.table_1 WHERE indextx='{filter :{type:"contains", field:"field_6", values:["DC"]}}';
select field_1, field_6, field_3 from keypace_1.table_1 WHERE indextx='{filter :{type:"match", field:"field_6", value:"DC"}}';

We use Cassandra 2.1.7 with plugin Stratio 2.1.7.1
JDK 1.7.0_45

Thanks,

Bitemporal queries returning different number of records each time

I'm running a simple bitemporal query using version 2.1.8.4. I run the query twice (right after each other) and getting a different number of results each time. No inserts/updates/etc are running against the database during this time. The data has been static for about 12 hours and the index was refreshed prior to running either query. If I run without the lucene condition, I always get the same results.

Query:
SELECT * FROM tab1 WHERE lucene='{filter : {
type:"boolean",
must:[
{type : "bitemporal", field : "bitemporal", vt_from : "2015/08/28 10:46:23:629", vt_to : "2015/08/28 10:46:23:629", tt_from : "2015/08/28 10:46:23:629", tt_to : "2015/08/28 10:46:23:629" }
]
}
}
' AND key='someKey';
Found: 393053.
****** finished query 2015-08-28T10:36:18.512-04:00
Found: 393052.
****** finished query 2015-08-28T10:37:57.464-04:00
Found: 393070.
****** finished query 2015-08-28T10:48:03.837-04:00
Found: 393085.
****** finished query 2015-08-28T10:49:39.656-04:00

INSERT required on table before SELECT will succeed

If you create a table and custom index like this:

CREATE KEYSPACE IF NOT EXISTS test
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};
USE test;
CREATE TABLE IF NOT EXISTS test_update (
    pk int, k0 varchar, lucene text, primary key (pk)
);
CREATE CUSTOM INDEX test_update_index ON test_update (lucene) 
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            k0 : {type : "string"}
        }
    }'
};

Then upsert a row like this:

UPDATE test_update SET k0='foo' WHERE pk=0;

And select like this:

SELECT * FROM test_update WHERE lucene='{
    query : {type:"match", field:"k0", value:"foo"}
}';

There is no result.

If you replace the UPDATE with INSERT like this:

INSERT INTO test_update(pk,k0) VALUES(0,'foo');

and repeat the SELECT, you get the correct result.

Cassandra Lucene Index in Spark and Hadoop

It it mentioned Cassandra Lucene Index is compatible with Spark and Hadoop. Can you please provide any spark sql or hive sql examples to create cassandra lucene index in spark and hadoop ? Thank you .

Add new index without dropping existing indexes

From what I understand, to add a new index on a field, we have to drop the entire custom index and rebuild everything to include the new index. Could addition of new indexes avoid the need to drop existing indexes? This is going to be problematic for large tables.

few questions

  1. Can you include an upgrading advice (general and per release)?
  2. Is it possible to retrieve index schema?

Syncrhonous indexing clarification

Apologies if there is a better forum for this question.

Documentation specifies that indexing_threads = 0 means synchronous indexing. Does this mean synchronous with the write to the indexed partition? What implications does this have for index reads?
For my scenario, if I have:

  • A guarantee of at most one writer per partition
  • Writer performs multiple (index-based) reads then writes in quick succession

Will each index read reflect all previous writes to the partition, or is there still potential for the index to lag behind?

Thanks.

sort not working with Map field

Hi,

Tried both with 2.1.6 and 2.1.8 on difference environment but sorting over Map field does not seem to work.

Please note that sorting with other c* type as well as filter/query over map fields works fine, just that sorting over Map fields does not do anything.

Created index as:

CREATE CUSTOM INDEX IF NOT EXISTS ON () USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = {'refresh_seconds' : '60', 'schema' : '{ fields : { : {type: "string"}}}'};

And trying to use it as:

select * from

where = '{sort: {fields: [{field:"."}]}}';

Hope the usage is correct.

Error while creating index

I created this table:

CREATE TABLE IF NOT EXISTS flat3 (
id uuid,
title text,
description text,
creationDate timestamp,
lastModificationDate timestamp,
lucene TEXT,
PRIMARY KEY(id)
);

And I would like to create the following index (I'm using the 2.1.8.1 version):

CREATE CUSTOM INDEX flat3_index ON flat3 (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '10',
'schema' : '{
fields : {
id : {type : "uuid"},
title : {type : "text"},
description : {type : "text"},
lastModificationDate : {type : "date"}
}
}'
};

But I get the following error:

ErrorMessage code=2300 [Query invalid because of configuration issue] message="Error while validating Lucene index options: 'schema' is invalid : No column definition lastModificationDate for mapper lastModificationDate"

Any idea?

build failed

Running com.stratio.cassandra.lucene.schema.mapping.BitemporalMapperTest
Tests run: 85, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.92 sec <<< FAILURE! - in com.stratio.cassandra.lucene.schema.mapping.BitemporalMapperTest
testToString(com.stratio.cassandra.lucene.schema.mapping.BitemporalMapperTest)  Time elapsed: 0.011 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:<.../dd, nowValue=176644[44]00000}> but was:<.../dd, nowValue=176644[08]00000}>
        at org.junit.Assert.assertEquals(Assert.java:123)
        at org.junit.Assert.assertEquals(Assert.java:145)
        at com.stratio.cassandra.lucene.schema.mapping.BitemporalMapperTest.testToString(BitemporalMapperTest.java:1205)

I think it's related to time zone.

Supporting Cassandra stable releases

I just wanted to know what the roadmap plans are for supporting cassandra releases. I see there is an active branch for latest cassandra 2.2 support for this plugin. While cassandra 2.1.9 has also been released, is there any plan to also release a 2.1.9 supporting version of this plugin?

Will there be two separate releases supporting the latest vs the stable version of cassandra? Will the existing 2.1.8 release of this plugin also work properly with cassandra 2.1.9?

Timeout on malformed query

Hello,
I got timeout if parts of Boolean query is null, for example:

SELECT * FROM table WHERE lucene='{"query":{"type":"boolean", must:null}}'; // timeout
// OperationTimedOut: errors={}, last_host=127.0.0.1

But,

SELECT * FROM table WHERE lucene='{"query":{"type":"boolean", must:{}}}'; 
//InvalidRequest: code=2200 [Invalid query] message="Unformateable JSON search: Can not deserialize instance of java.util.ArrayList out of START_OBJECT token at [Source: java.io.StringReader@48575ddc; line: 1, column: 27] (through reference chain: com.stratio.cassandra.lucene.search.SearchBuilder["query"]->com.stratio.cassandra.lucene.search.condition.builder.BooleanConditionBuilder["must"])"

Can you add some validations?

java.lang.IllegalArgumentException: DocValuesField "y" is too large, must be <= 32766

ERROR [pool-4-thread-1] 2015-07-12 16:28:48,730 Log.java:53 - Unrecoverable error during asynchronously indexing
java.lang.IllegalArgumentException: DocValuesField "y" is too large, must be <= 32766
        at org.apache.lucene.index.SortedDocValuesWriter.addValue(SortedDocValuesWriter.java:68) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at org.apache.lucene.index.DefaultIndexingChain.indexDocValue(DefaultIndexingChain.java:434) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:376) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1350) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at com.stratio.cassandra.lucene.service.LuceneIndex.upsert(LuceneIndex.java:185) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at com.stratio.cassandra.lucene.service.RowServiceWide.doIndex(RowServiceWide.java:95) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at com.stratio.cassandra.lucene.service.RowService$2.run(RowService.java:170) ~[cassandra-lucene-index-plugin-2.1.8.0.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [na:1.8.0_45]
        at java.util.concurrent.FutureTask.run(Unknown Source) [na:1.8.0_45]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [na:1.8.0_45]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [na:1.8.0_45]
        at java.lang.Thread.run(Unknown Source) [na:1.8.0_45]

Phrase queries appear to be broken

I have some rows in a test table:

cqlsh:lucene_test> select id, subject, user, time from emails;

 id | subject                                 | user   | time
----+-----------------------------------------+--------+--------------------------
  5 |                  this is the real thing | robbie | 2015-05-26 21:30:00-0400
  1 |                          this is a test | robbie | 2015-05-26 17:30:00-0400
  2 |                    this is another test | robbie | 2015-05-26 18:30:00-0400
  4 |                   this is a fourth test | robbie | 2015-05-26 20:30:00-0400
  7 |                              this is it | robbie | 2015-05-26 23:30:00-0400
  6 | this is even better than the real thing | robbie | 2015-05-26 22:30:00-0400
  3 |                    this is a third test | robbie | 2015-05-26 19:30:00-0400

(7 rows)

It has this index:

create custom index email_index on emails(lucene)
  using 'com.stratio.cassandra.lucene.Index'
  with options = {
    'refresh_seconds':'1',
    'schema': '{
       fields: {
          id   : {type : "integer"},
          user : {type : "string"},
          subject : {type : "text",  analyzer : "english"},
          body : {type : "text",  analyzer : "english"},
          time : {type : "date", pattern  : "yyyy-MM-dd hh:mm:ss"}
       }
   }'
};

The index works for some queries. Here's an example:

cqlsh:lucene_test> SELECT * FROM emails WHERE lucene='{
               ...     filter : {type:"range", field:"time", lower:"2015-05-26 18:29:59"},
               ...     query  : {type:"fuzzy", field:"subject", value:"thingy", max_edits:1}
               ... }';

 id | body                                             | lucene    | subject                                 | time                     | user
----+--------------------------------------------------+-----------+-----------------------------------------+--------------------------+--------
  5 | this is a test of the emergency broadcast system | 1.1713032 |                  this is the real thing | 2015-05-26 21:30:00-0400 | robbie
  6 | this is a test of the emergency broadcast system | 1.0541729 | this is even better than the real thing | 2015-05-26 22:30:00-0400 | robbie

(2 rows)

I run a variety of phrase queries, all of which fail to return results:

cqlsh:lucene_test> SELECT * FROM emails WHERE lucene='{query : {type:"phrase", field:"subject", values:["this", "is", "test"], slop: 1}}';

 id | body | lucene | subject | time | user
----+------+--------+---------+------+------

(0 rows)
cqlsh:lucene_test> SELECT * FROM emails WHERE lucene='{query : {type:"phrase", field:"subject", values:["this", "is", "test"], slop: 2}}';

 id | body | lucene | subject | time | user
----+------+--------+---------+------+------

(0 rows)
cqlsh:lucene_test> SELECT * FROM emails WHERE lucene='{query : {type:"phrase", field:"subject", values:["this", "is", "test"], slop: 5}}';

 id | body | lucene | subject | time | user
----+------+--------+---------+------+------

(0 rows)
cqlsh:lucene_test> SELECT * FROM emails WHERE lucene='{query : {type:"phrase", field:"subject", values:["this", "is", "a", "test"], slop: 5}}';

 id | body | lucene | subject | time | user
----+------+--------+---------+------+------

(0 rows)
cqlsh:lucene_test> SELECT * FROM emails WHERE lucene='{query : {type:"phrase", field:"subject", values:["this", "is", "a", "test"]}}';

 id | body | lucene | subject | time | user
----+------+--------+---------+------+------

(0 rows)
cqlsh:lucene_test> SELECT * FROM emails WHERE lucene='{query : {type:"phrase", field:"subject", values:["this"]}}';

 id | body | lucene | subject | time | user
----+------+--------+---------+------+------

(0 rows)

Contains query

Using stop words with "Contains query" will throw:
InvalidRequest: code=2200 [Invalid query] message="Value discarded by analyzer"

Shouldn't stop words be discarded without any errors? Or is it intended?

Eg:

...
analyzers : {
              my_custom_analyzer : {
                  type:"snowball",
                  language:"Spanish",
                  stopwords : "el,la,lo,loas,las,a,ante,bajo,cabe,con,contra"
              }
},
...

SELECT * FROM <table> WHERE stratio_col='{ query : { type : "contains", "field : "<field>", values : [ "el", "<other_non_stop_word_term>" ]  }  }';

InvalidRequest: code=2200 [Invalid query] message="Value discarded by analyzer"

OperationTimedOut when sorting on TEXT data

Cassandra version (running on Ubuntu 14.04): Cassandra 2.1.8
cassandra-lucene-index: cassandra-lucene-index-plugin-2.1.8.1.jar

Simple to reproduce. I use the example schema:

CREATE KEYSPACE demo
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE tweets (
id INT PRIMARY KEY,
user TEXT,
body TEXT,
time TIMESTAMP,
latitude FLOAT,
longitude FLOAT,
lucene TEXT
);

CREATE CUSTOM INDEX tweets_index ON tweets (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
id : {type : "integer"},
user : {type : "string"},
body : {type : "text", analyzer : "english"},
time : {type : "date", pattern : "yyyy/MM/dd", sorted : true},
place : {type : "geo_point", latitude:"latitude", longitude:"longitude"}
}
}'
};

Insert some data:

INSERT INTO demo.tweets ( id, body, latitude, longitude, time, user ) VALUES (21, 'Utah is a great state', 79.13032, 67.2017, 1440780495620, 'Kenny');
INSERT INTO demo.tweets ( id, body, latitude, longitude, time, user ) VALUES (22, 'Indiana is a great state', 79.13232, 67.2727, 1440780995620, 'Andrea');

Then use the following query as a sanity check:

SELECT * FROM tweets WHERE lucene='{
query : {
type:"phrase",
field:"body",
value: "is a great state"}
}';

id | body | latitude | longitude | lucene | time | user
----+--------------------------+----------+-----------+--------+--------------------------+--------
22 | Indiana is a great state | 79.13232 | 67.2727 | 1.0 | 2015-08-28 16:56:35+0000 | Andrea
21 | Utah is a great state | 79.13032 | 67.2017 | 1.0 | 2015-08-28 16:48:15+0000 | Kenny

Now try the following:

SELECT * FROM tweets WHERE lucene='{
query : {
type:"phrase",
field:"body",
value: "is a great state"},
sort :
{
fields: [ {field : "user", reverse : true} ]
}
}';

You'll get the following:
OperationTimedOut: errors={}, last_host=x.x.x.x

If this is a legitimate timeout on two rows of data we are in trouble. ;)

If I sort on numeric fields I get back an immediate response. The problem only manifests itself with TEXT fields.

Can't index column value of size 140952 for index null

I'm loading config files in Cassandra + index with Lucene.
A small column family with 4 fields and a config file of 15M gives no problem.
But a column family with more than 100 fields and one field is a config which is only 271K, then I got an error? Can't index column value of size 140952 for index null?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.