Hey, I've benchmarked insertions on a cassandra table with and witho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

okay thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hov

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

2 to 6 times the insert rate without a stratio index about cassandra-lucene-index HOT 13 CLOSED

stratio commented on July 17, 2024

2 to 6 times the insert rate without a stratio index

from cassandra-lucene-index.

Comments (13)

adelapena commented on July 17, 2024

Each row insertion in C* involves a document write in Lucene. Lucene, like the search engines based on it, is much more expressive than C* at the cost of being significantly slower with writes. Unfortunately it makes no sense to add Lucene's search engine features keeping Cassandra's column-oriented database write throughputs. You can expect a write throughput of a few thousand rows per second per node with relatively modest hardware, which is quite similar to other search engines such as Elasticsearch and Solr,

Asynchronous indexing can improve latencies and increase parallelism for low concurrency writers, but it's not going to increase sustained write throughput for proper parallel writers. We have had asynchronous indexing in the past, but we removed it because there wasn't a great throughput difference in our use cases and the inability to know when a row had beed indexed made us realize it was not worthwhile. Do you think that adding asynchronous indexing again will be useful for your use case?

from cassandra-lucene-index.

cscetbon commented on July 17, 2024

@adelapena I really don't know. I could try it.

what do you mean by

Unfortunately it makes no sense to add Lucene's search engine features keeping Cassandra's column-oriented database write throughputs

To index columns you need first to push them to the table. Or do you mean we should do it on a special table that handles only static informations or updated a few thousand rows per second at most ?

from cassandra-lucene-index.

adelapena commented on July 17, 2024

I mean that the max Lucene write throughput is a few thousand rows per second per node, so this is the max throughput that you can expect in an indexed table. Asynchronous indexing could delay/queue the indexing to allow a temporarily higher throughput but, in the long term, the sustained write throughput will be the same, or a little bit lower due to the queueing cost.

from cassandra-lucene-index.

cscetbon commented on July 17, 2024

@adelapena AFAIK, Datastax does it on cassandra nodes in dedicated DC and they get the same write throughput as others, as they need to replicate updates.
I'm not yet familiar with your code but I've been trying to understand it. I heard that DSE uses batch inserts. Does your code use batches too ? I think not but I see that you could use several threads that can change documents in // using the same indexWriter, right ? In that case, what about using asynchronous batch inserts ?

from cassandra-lucene-index.

adelapena commented on July 17, 2024

We have added again the old asynchronous indexing feature. The changes are available at branch feature/async_indexing, you can take a look at it.

Anyway, asynchronous indexing is based on a buffer that decouples the C* tables writes from the indexing but, when the buffer is full, it locks the C* writes. This is done so in order to avoid endless indexing processes. This can improve indexed tables write throughput for low concurrency writers by increasing the indexing parallelism. The trade off is that writes will return their response before the data is indexed, so clients will not be notified about indexing failures (errors are just logged). DSE Search seems to use a similar approach.

from cassandra-lucene-index.

cscetbon commented on July 17, 2024

okay thank you @adelapena. I'll give it a shot and see if it helps to get better performance.

from cassandra-lucene-index.

cscetbon commented on July 17, 2024

@adelapena it's really better except those times where the writes are locked when the buffer is full. Can you confirm we could lose index data if the node is stopped (or crashes) for example before the buffer is flushed ? I would say that if it happens, even repairs won't fix it as data would not be inconsistent (only index files). Would we be informed of that ? Is there a way to rebuild only the local index ?

from cassandra-lucene-index.

cscetbon commented on July 17, 2024

@adelapena any news on that ?

from cassandra-lucene-index.

adelapena commented on July 17, 2024

Hi,

Index writes are not given as valid by C* until the method SecondaryIndex#forceBlockingFlush is called. This method forces the disk-writing of the Lucene queued documents, waiting for the queue if required. This way, if the node crashes with queued docuements then C* will replay the writes from the commitlog. Otherwise, if the node is properly shutdown, it will stop accepting new writes and the flush operation will wait for the indexing of the queued documents.

An alternative approach would be to use a persistent or off-heap queue instead of the existing buffer. This non-blocking queue will have a larger capacity and will allow a longer delay between storing and indexing. However, the computational cost of such solution (serialization and IO) will slow down the sustained throughput, which wont be acceptable in many use cases.

Anyway, you can rebuild local indexes using nodetool.

from cassandra-lucene-index.

cscetbon commented on July 17, 2024

@adelapena If I understand you well, you're saying that even with the current asynchronous method, we should not lose index data in all cases (shutdown, C* crash).

Regarding the rebuild, I don't see any way to specify that we want to rebuild them only locally

from cassandra-lucene-index.

adelapena commented on July 17, 2024

Yes, @cscetbon, it shouldn't lose data due to a shutdown or crash. You can easily test it by your self. You can set the Stratio's log level to debug in <CASSANDRA_HOME>/conf/logback.xml:

<logger name="com.stratio.cassandra" level="DEBUG"/>

Then run Cassandra and do some insertions, for example:

CREATE KEYSPACE IF NOT EXISTS my_keyspace WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};

USE my_keyspace;

CREATE TABLE IF NOT EXISTS my_table(id int, name varchar, lucene TEXT, PRIMARY KEY(id));
CREATE CUSTOM INDEX my_index ON my_table (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds' : '1',
   'indexing_threads' : '8',
   'schema' : '{
      fields : {
         name : {type : "string"}
      }
   }'
};

CONSISTENCY ALL;
INSERT INTO my_table(id, name) VALUES (1, '1');

SELECT * FROM my_table WHERE lucene = '{filter: { type: "match", field: "name", value: "1" }}';

You can see the document write in the C* log file:

DEBUG 15:48:24 my_keyspace.my_table.my_index update document Document<indexed,tokenized,omitNorms,indexOptions=DOCS,numericType=LONG,numericPrecisionStep=16,docValuesType=NUMERIC<_token_murmur:-4069959284402364209> stored,indexed,omitNorms,indexOptions=DOCS<_partition_key:[0 0 0 1]> indexed,tokenized,omitNorms,indexOptions=DOCS<name:1>> with term _partition_key:

Then stop C* with kill -9 to simulate a crash, and start it again. You will see in the log file that the write, which is an idempotent operation, is replayed:

DEBUG 15:52:48 my_keyspace.my_table.my_index update document Document<indexed,tokenized,omitNorms,indexOptions=DOCS,numericType=LONG,numericPrecisionStep=16,docValuesType=NUMERIC<_token_murmur:-4069959284402364209> stored,indexed,omitNorms,indexOptions=DOCS<_partition_key:[0 0 0 1]> indexed,tokenized,omitNorms,indexOptions=DOCS<name:1>> with term _partition_key:

Regarding the index rebuild, the nodetool command has an argument to specify the address of the specific node which index is going to be rebuilt.

The trade off of asynchronous indexing is that if the index validation fails clients will not be notified about the failure. For example, if the row to be indexed contains a column which is mapped as a date, and the column doesn't match the date pattern, the client will see a successful operation. Of course the failure will be logged in the affected nodes. This is so because the validation is performed behind the distribution layer. Fortunately this behavior has been improved with CASSANDRA-10092, so future versions will be able to perform index validation before be distributed and queued.

from cassandra-lucene-index.

cscetbon commented on July 17, 2024

Thank you for all the information you provided, they should be really useful. I'm eager to test it and to see the new validation feature added to your code, hoping that it won't affect performance. I agree that this way, we could avoid having issues caught in the background not returned to users.

from cassandra-lucene-index.

adelapena commented on July 17, 2024

You're welcome :)

from cassandra-lucene-index.

2 to 6 times the insert rate without a stratio index about cassandra-lucene-index HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent