Giter VIP home page Giter VIP logo

Comments (13)

adelapena avatar adelapena commented on July 17, 2024

Each row insertion in C* involves a document write in Lucene. Lucene, like the search engines based on it, is much more expressive than C* at the cost of being significantly slower with writes. Unfortunately it makes no sense to add Lucene's search engine features keeping Cassandra's column-oriented database write throughputs. You can expect a write throughput of a few thousand rows per second per node with relatively modest hardware, which is quite similar to other search engines such as Elasticsearch and Solr,

Asynchronous indexing can improve latencies and increase parallelism for low concurrency writers, but it's not going to increase sustained write throughput for proper parallel writers. We have had asynchronous indexing in the past, but we removed it because there wasn't a great throughput difference in our use cases and the inability to know when a row had beed indexed made us realize it was not worthwhile. Do you think that adding asynchronous indexing again will be useful for your use case?

from cassandra-lucene-index.

cscetbon avatar cscetbon commented on July 17, 2024

@adelapena I really don't know. I could try it.

what do you mean by

Unfortunately it makes no sense to add Lucene's search engine features keeping Cassandra's column-oriented database write throughputs

To index columns you need first to push them to the table. Or do you mean we should do it on a special table that handles only static informations or updated a few thousand rows per second at most ?

from cassandra-lucene-index.

adelapena avatar adelapena commented on July 17, 2024

I mean that the max Lucene write throughput is a few thousand rows per second per node, so this is the max throughput that you can expect in an indexed table. Asynchronous indexing could delay/queue the indexing to allow a temporarily higher throughput but, in the long term, the sustained write throughput will be the same, or a little bit lower due to the queueing cost.

from cassandra-lucene-index.

cscetbon avatar cscetbon commented on July 17, 2024

@adelapena AFAIK, Datastax does it on cassandra nodes in dedicated DC and they get the same write throughput as others, as they need to replicate updates.
I'm not yet familiar with your code but I've been trying to understand it. I heard that DSE uses batch inserts. Does your code use batches too ? I think not but I see that you could use several threads that can change documents in // using the same indexWriter, right ? In that case, what about using asynchronous batch inserts ?

from cassandra-lucene-index.

adelapena avatar adelapena commented on July 17, 2024

We have added again the old asynchronous indexing feature. The changes are available at branch feature/async_indexing, you can take a look at it.

Anyway, asynchronous indexing is based on a buffer that decouples the C* tables writes from the indexing but, when the buffer is full, it locks the C* writes. This is done so in order to avoid endless indexing processes. This can improve indexed tables write throughput for low concurrency writers by increasing the indexing parallelism. The trade off is that writes will return their response before the data is indexed, so clients will not be notified about indexing failures (errors are just logged). DSE Search seems to use a similar approach.

from cassandra-lucene-index.

cscetbon avatar cscetbon commented on July 17, 2024

okay thank you @adelapena. I'll give it a shot and see if it helps to get better performance.

from cassandra-lucene-index.

cscetbon avatar cscetbon commented on July 17, 2024

@adelapena it's really better except those times where the writes are locked when the buffer is full. Can you confirm we could lose index data if the node is stopped (or crashes) for example before the buffer is flushed ? I would say that if it happens, even repairs won't fix it as data would not be inconsistent (only index files). Would we be informed of that ? Is there a way to rebuild only the local index ?

from cassandra-lucene-index.

cscetbon avatar cscetbon commented on July 17, 2024

@adelapena any news on that ?

from cassandra-lucene-index.

adelapena avatar adelapena commented on July 17, 2024

Hi,

Index writes are not given as valid by C* until the method SecondaryIndex#forceBlockingFlush is called. This method forces the disk-writing of the Lucene queued documents, waiting for the queue if required. This way, if the node crashes with queued docuements then C* will replay the writes from the commitlog. Otherwise, if the node is properly shutdown, it will stop accepting new writes and the flush operation will wait for the indexing of the queued documents.

An alternative approach would be to use a persistent or off-heap queue instead of the existing buffer. This non-blocking queue will have a larger capacity and will allow a longer delay between storing and indexing. However, the computational cost of such solution (serialization and IO) will slow down the sustained throughput, which wont be acceptable in many use cases.

Anyway, you can rebuild local indexes using nodetool.

from cassandra-lucene-index.

cscetbon avatar cscetbon commented on July 17, 2024

@adelapena If I understand you well, you're saying that even with the current asynchronous method, we should not lose index data in all cases (shutdown, C* crash).

Regarding the rebuild, I don't see any way to specify that we want to rebuild them only locally

from cassandra-lucene-index.

adelapena avatar adelapena commented on July 17, 2024

Yes, @cscetbon, it shouldn't lose data due to a shutdown or crash. You can easily test it by your self. You can set the Stratio's log level to debug in <CASSANDRA_HOME>/conf/logback.xml:

<logger name="com.stratio.cassandra" level="DEBUG"/>

Then run Cassandra and do some insertions, for example:

CREATE KEYSPACE IF NOT EXISTS my_keyspace WITH replication = {'class':'SimpleStrategy', 'replication_factor':1};

USE my_keyspace;

CREATE TABLE IF NOT EXISTS my_table(id int, name varchar, lucene TEXT, PRIMARY KEY(id));
CREATE CUSTOM INDEX my_index ON my_table (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds' : '1',
   'indexing_threads' : '8',
   'schema' : '{
      fields : {
         name : {type : "string"}
      }
   }'
};

CONSISTENCY ALL;
INSERT INTO my_table(id, name) VALUES (1, '1');

SELECT * FROM my_table WHERE lucene = '{filter: { type: "match", field: "name", value: "1" }}';

You can see the document write in the C* log file:

DEBUG 15:48:24 my_keyspace.my_table.my_index update document Document<indexed,tokenized,omitNorms,indexOptions=DOCS,numericType=LONG,numericPrecisionStep=16,docValuesType=NUMERIC<_token_murmur:-4069959284402364209> stored,indexed,omitNorms,indexOptions=DOCS<_partition_key:[0 0 0 1]> indexed,tokenized,omitNorms,indexOptions=DOCS<name:1>> with term _partition_key:

Then stop C* with kill -9 to simulate a crash, and start it again. You will see in the log file that the write, which is an idempotent operation, is replayed:

DEBUG 15:52:48 my_keyspace.my_table.my_index update document Document<indexed,tokenized,omitNorms,indexOptions=DOCS,numericType=LONG,numericPrecisionStep=16,docValuesType=NUMERIC<_token_murmur:-4069959284402364209> stored,indexed,omitNorms,indexOptions=DOCS<_partition_key:[0 0 0 1]> indexed,tokenized,omitNorms,indexOptions=DOCS<name:1>> with term _partition_key:

Regarding the index rebuild, the nodetool command has an argument to specify the address of the specific node which index is going to be rebuilt.

The trade off of asynchronous indexing is that if the index validation fails clients will not be notified about the failure. For example, if the row to be indexed contains a column which is mapped as a date, and the column doesn't match the date pattern, the client will see a successful operation. Of course the failure will be logged in the affected nodes. This is so because the validation is performed behind the distribution layer. Fortunately this behavior has been improved with CASSANDRA-10092, so future versions will be able to perform index validation before be distributed and queued.

from cassandra-lucene-index.

cscetbon avatar cscetbon commented on July 17, 2024

Thank you for all the information you provided, they should be really useful. I'm eager to test it and to see the new validation feature added to your code, hoping that it won't affect performance. I agree that this way, we could avoid having issues caught in the background not returned to users.

from cassandra-lucene-index.

adelapena avatar adelapena commented on July 17, 2024

You're welcome :)

from cassandra-lucene-index.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.