I'm running a simple bitemporal query using version 2.1.8.4. I run the query twice (r

Your posted schema is as follow: <div class="snippet-clipboard-content notranslate

Bitemporal queries returning different number of records each time about cassandra-lucene-index HOT 6 CLOSED

stratio commented on August 16, 2024

Bitemporal queries returning different number of records each time

from cassandra-lucene-index.

Comments (6)

adelapena commented on August 16, 2024

Hi,

How many Cassandra nodes are you using?
How are your read/write consistency levels and replication factor?
Are you using lightweight transactions with data revisions? I mean, are you updating data this way:

BEGIN BATCH
UPDATE tab1 SET tt_to = '2015/06/29' 
WHERE key = 'someKey' AND vt_from = '2015/01/01' 
AND tt_from = '2015/01/01' IF tt_to = '2200/12/31';
INSERT INTO tab1(key, city, vt_from, vt_to, tt_from, tt_to) 
VALUES ('someKey', '2015/03/05', '2200/12/31', '2015/06/30', '2200/12/31');
APPLY BATCH;

Can you please show us your table schema and the create index statement?

To discard options, can you please perform a repair and then repeat the queries with the refresh option set to true:

SELECT * FROM tab1 WHERE lucene='{filter : {
type:"boolean",
must:[
 {type : "bitemporal", 
  field : "bitemporal", 
  vt_from : "2015/08/28 10:46:23:629", 
  vt_to : "2015/08/28 10:46:23:629", 
  tt_from : "2015/08/28 10:46:23:629", 
  tt_to : "2015/08/28 10:46:23:629" }
  ]}, 
refresh:true}' 
AND key='someKey';

Thank you in advance.

from cassandra-lucene-index.

harkanJ commented on August 16, 2024

16 node cluster
Replication: Network topology. 2 data centers, 3 to each DC.
Not using LWT. Updates all happened a while ago.

Schema:

CREATE TABLE tab1 (
    key text,
    value blob,
    vt_from text,
    vt_to text,
    tt_from text,
    tt_to text,
    lucene text,
    PRIMARY KEY (key)
);

CREATE CUSTOM INDEX tab1_index on tab1(lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            key : {type: "string"},
            bitemporal : {
                type : "bitemporal",
                tt_from : "tt_from",
                tt_to : "tt_to",
                vt_from : "vt_from",
                vt_to : "vt_to",
                pattern : "yyyy/MM/dd HH:mm:ss:SSS"}
        }
}'};

Node repair is running now but will take some time.

from cassandra-lucene-index.

adelapena commented on August 16, 2024

Ok, thanks.

It would be great if we could have the contents of the searched row:

SELECT vt_from, vt_to, tt_from, tt_to  FROM tab1 WHERE key='someKey';

Obviously we don't need the blob column or any other possibly sensible data.

Additionally, you could try to index the dates with simple mappers:

CREATE CUSTOM INDEX tab1_index on tab1(lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            key : {type: "string"},
            vt_from : {type:"date", pattern : "yyyy/MM/dd HH:mm:ss:SSS},
            vt_to : {type:"date", pattern : "yyyy/MM/dd HH:mm:ss:SSS},
            tt_from : {type:"date", pattern : "yyyy/MM/dd HH:mm:ss:SSS},
            tt_to : {type:"date", pattern : "yyyy/MM/dd HH:mm:ss:SSS},
            bitemporal : {
                type : "bitemporal",
                tt_from : "tt_from",
                tt_to : "tt_to",
                vt_from : "vt_from",
                vt_to : "vt_to",
                pattern : "yyyy/MM/dd HH:mm:ss:SSS"}
        }
}'};

And then try this equivalent search:

SELECT * FROM tab1 WHERE lucene='{filter : {
type:"boolean",
must:[
 {type : "range", 
  field : "vt_from", 
  upper : "2015/08/28 10:46:23:629", 
  include_upper : true },
 {type : "range", 
  field : "vt_to", 
  lower : "2015/08/28 10:46:23:629", 
  include_lower : true},
 {type : "range", 
  field : "tt_from", 
  upper : "2015/08/28 10:46:23:629", 
  include_upper : true },
 {type : "range", 
  field : "tt_to", 
  lower : "2015/08/28 10:46:23:629", 
  include_lower : true}
  ]}, 
refresh:true}' 
AND key='someKey';

This search is much more inefficient, but it could be very useful to determine if the problem is in the bitemporal mapper or somewhere else.

from cassandra-lucene-index.

harkanJ commented on August 16, 2024

Issue is still there with the bitemporal query after running a full node repair.

Running the range query you listed in your last post instead of the bitemporal one does seem to produce a consistent number of rows.

Found: 393104.
****** finished query 2015-08-31T09:34:23.392-04:00
Found: 393104.
****** finished query 2015-08-31T09:36:26.834-04:00

from cassandra-lucene-index.

adelapena commented on August 16, 2024

Your posted schema is as follow:

CREATE TABLE tab1 (
    key text,
    value blob,
    vt_from text,
    vt_to text,
    tt_from text,
    tt_to text,
    lucene text,
    PRIMARY KEY (key)
);

The primary key is only composed by the field key, without including vt_from and tt_from as clustering key. That means that each time you insert a data revision for the aforementioned key='some_key' you are overwriting the previous value, both in Cassandra and in Lucene. This way, all your writes over some_key should end in a single row. This way, we don't understand how Cassandra can return more than one row using the clause WHERE key='some_key', independently of the index behavior.

Can you give us more detailed info about how are the writes to Cassandra? I don't know how could we reproduce the issue otherwise

from cassandra-lucene-index.

harkanJ commented on August 16, 2024

Good point! Not sure how/when vt_from and tt_from got lost in the schema. That does seem to fix the issue.

It does point to a possible issue though in the index not updating during an overwrite on the time columns. Probably not a big issue though as the temporal columns should be part of they.

from cassandra-lucene-index.

Bitemporal queries returning different number of records each time about cassandra-lucene-index HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent