yelp / nrtsearch Goto Github PK

A high performance gRPC server on top of Apache Lucene

License: Apache License 2.0

Java 99.85% Shell 0.02% Dockerfile 0.10% Go 0.03%

nrtsearch's Introduction

nrtSearch

A high performance gRPC server, with optional REST APIs on top of Apache Lucene version 8.x source, exposing Lucene's core functionality over a simple gRPC based API.

Documentation is available at readthedocs.

Features

Relies on Lucene's near-real-time segment replication for data replication. This means, a dedicated primary/writer node takes care of indexing operations and expensive operations like segment merges. This allows the replicas' system resources to be dedicated entirely for search queries. This behavior is in contrast to the document replication approach taken by some other popular search engines based on lucene like elasticsearch where every node is a writer and a reader.
Supports concurrent query execution. This is another feature missing from popular search engines based on lucene like elasticsearch.
Can be deployed as a "stateless microservice". Indexes are backed up in s3. Clients can choose to commit data outside of this system once their backup is complete. Upon restarts e.g. if you bring up a new container clients can choose to bootstrap indexes from their previous backed up state. Ability to deploy in a stateless manner allows for easy scalability using container tools like kubernetes, mesos etc.
Provides gRPC streaming APIs for indexing and searching. Also supports REST APIs.

Design

The design goals are mostly similar to the ones mentioned in the Lucene Server project. This project uses ideas and code from luceneserver and builds on them.

A single node can index a stream of documents, run near-real-time searches via a parsed query string, including "scrolled" searches, sorting, index-time sorting, etc.

Fields must first be registered with the registerFields command, where you express whether you will search, sort etc., and then documents can be indexed with those fields.

There is no transaction log, so you must call commit yourself periodically to make recent changes durable on disk. This means that if a node crashes, all indexed documents since the last commit are lost.

Indexing a stream of documents

NrtSearch supports client side gRPC streaming for its addDocuments endpoint. This means that the server API accepts a stream of documents . The client can choose to stream the documents however it wishes. The example nrtSearch client implemented here reads a CSV file and streams documents from it over to the server. The server can index chunks of documents the size of which is configurable as the client continues to send more documents over its stream. gRPC enables this with minimal application code and yields higher performance compared to JSON. TODO[citation needed]: Add performance numbers of stream based indexing for some datasets.

Near-real-time-replication

This requirement is one of the primary reasons to create this project. near-real-time-replication seems a good alternative to document based replication when it comes to costs associated with maintaining large clusters. Scaling document based clusters up/down in a timely manner could be slower due to data migration between nodes apart from paying the cost for reindexing on all nodes.

Below is a depiction of how the system works in regards to Near-real-time(NRT) replication and durability.

Primary node comes up with either no index or reads segments from disk or can restore an index from remote storage if the restore option is specified by the client on the startIndex command. This node will accept indexing requests from clients. It will also periodically publishNrtUpdate to replicas giving them a chance to catch up with the latest primary indexing changes.
Replica nodes are also started using the startIndex command. They will sync with the current primary and update their indexes using lucene's NRT APIs. They can also restore the index from remote storage and then receive the updates since the last backup. These nodes will serve client's search queries.
Each time client invokes commit on primary, it will save its current index state and related metadata e.g. schemas, settings to the disk. Clients should use the ack from this endpoint to commit the data in their channel e.g. kafka.
Client can invoke backupIndex on the primary to backup the index to remote storage.
If a replica crashes, a new one can be brought up and will re-sync with the current primary (optionally restoring the index from remote storage first). It will register itself with the primary once it's brought up.
If a primary crashes, a new one can be brought up with the restore option on startIndex command to regain previous stored state in the cloud, but since primaries don't serve search requests they can also use network attached storage e.g. Amazon EBS to persist data across restarts. The replicas will then re-sync their indexes with the primary.

Build Server and Client

In the home directory.

./gradlew clean installDist test

Note: This code has been tested on Java17

Run gRPC Server

./build/install/nrtsearch/bin/lucene-server

Build gRPC Gateway

./gradlew buildGrpcGateway

Run REST Server (use the appropriate binary for your platform e.g. for mac os)

./build/install/nrtsearch/bin/http_wrapper-darwin-amd64 <gRPC_PORT> <REST_PORT>

Example to run some basic client commands

Create Index

./build/install/nrtsearch/bin/lucene-client createIndex --indexName  testIdx

curl -XPOST localhost:<REST_PORT>/v1/create_index -d '{"indexName": "testIdx"}'

Update Settings

./build/install/nrtsearch/bin/lucene-client settings -f settings.json
cat settings.json
{             "indexName": "testIdx",
              "directory": "MMapDirectory",
              "nrtCachingDirectoryMaxSizeMB": 0.0,
              "indexMergeSchedulerAutoThrottle": false,
              "concurrentMergeSchedulerMaxMergeCount": 16,
              "concurrentMergeSchedulerMaxThreadCount": 8
}

Start Index

./build/install/nrtsearch/bin/lucene-client startIndex -f startIndex.json
cat startIndex.json
{
  "indexName" : "testIdx"
}

RegisterFields

./build/install/nrtsearch/bin/lucene-client registerFields -f registerFields.json
cat registerFields.json
{             "indexName": "testIdx",
              "field":
              [
                      { "name": "doc_id", "type": "ATOM", "storeDocValues": true},
                      { "name": "vendor_name", "type": "TEXT" , "search": true, "store": true, "tokenize": true},
                      { "name": "license_no",  "type": "INT", "multiValued": true, "storeDocValues": true}
              ]
}

Add Documents

./build/install/nrtsearch/bin/lucene-client addDocuments -i testIdx -f docs.csv -t csv
cat docs.csv
doc_id,vendor_name,license_no
0,first vendor,100;200
1,second vendor,111;222

Search

./build/install/nrtsearch/bin/lucene-client search -f search.json
cat search.json
{
        "indexName": "testIdx",
        "startHit": 0,
        "topHits": 100,
        "retrieveFields": ["doc_id", "license_no", "vendor_name"],
         "queryText": "vendor_name:first vendor"
}

API documentation

The build uses protoc-gen-doc program to generate the documentation needed in html (or markdown) files from proto files. It is run inside a docker container. The gradle task to generate this documentation is as follows.

./gradlew buildDocs

This should create a src/main/docs/index.html file that can be seen in your local browser. A sample snapshot

Yelp Indexing tool

Reviews

This tool indexes yelp reviews available at Yelp dataset challenge. It runs a default version with only 1k reviews of the reviews.json or you could download the yelp dataset and place the review.json in the user.home dir and the tool will use that instead. The complete review.json should have close to 7Million reviews. The tool runs multi-threaded indexing and a search thread in parallel reporting the totalHits. Command to run this specific test:

./gradlew clean installDist :test -PincludePerfTests=* --tests "com.yelp.nrtsearch.server.YelpReviewsTest.runYelpReviews" --info

Suggestions

This test indexes businesses, creates an Infix Suggester and fetches suggestions. It requires a host, a port and a writeable directory in a standalone nrtSearch server.

./gradlew :test -DsuggestTmp=remoteServerDir -DsuggestHost=yourStandaloneServerHost -DsuggestPort=yourStandaloneServerHost --tests "com.yelp.nrtsearch.server.YelpSuggestTest"

nrtsearch's People

Contributors

Stargazers

Watchers

nrtsearch's Issues

buildDocs broken for Java13

buildDocs is broken post Java13 upgrade. Stacktrace here. Probably the dependency plugin pseudomuto/protoc-gen-doc does not work with Java13.

Workaround: buildDocs still works with Java12. Until this is fixed, ./gradlew buildDocs will work if your JAVA_HOME is set to java12. The output of buildDocs is an index.html file which can be rendered in the browser.

Add functionality to delete documents matching a query

We can use https://lucene.apache.org/core/8_4_0/core/org/apache/lucene/index/IndexWriter.html#deleteDocuments-org.apache.lucene.search.Query...-

Add custom similarity support

Currently, the only supported lucene similarity implementations for a field are classic and BM25. We should create a plugin interface to allow registration of custom similarity implementations.

Replica doesn't receive new nrt point until a document is indexed to a primary

The primary seems to currently send new nrt point to replicas only after it creates new segments. This can be problematic in cases where the indexing qps is very low, and the replica may not have new segments for a long time. One solution is to send new nrt point to the replica after it is added in the primary.

Add support for phrase query

Query nodes are currently unsupported: https://github.com/Yelp/platypus/blob/0e3f63bed6788ce655976803c661af01f00505a2/src/main/java/org/apache/platypus/server/luceneserver/SearchHandler.java#L418

Reference code for phrase query: https://github.com/mikemccand/luceneserver/blob/c82373625ce9f92d2080a2f07c7d46cbd17e6ac1/src/java/org/apache/lucene/server/handlers/SearchHandler.java#L1138

Validate fields and throw an exception if a field not having search: true during registration is used in query

Use IndexWriter.updateDocuments() in AddDocumentHandler

Currently we don't provide an API to update documents in NRTSearch, but instead have the addDocuments request. It is up to the clients to send delete and add documents externally, and has the following problems:

Could lead to inconsistencies where refreshes can happen between these requests.
If there are two add documents requests, these would end up as two separate documents than being updated.

It would be good to have the updates happen atomically in a single request. We could use the updateDocuments() from lucene for this.

Change addDocuments in AddDocumentsHandler to use updateDocuments()
Add a notion of doc_id to FieldDefs which will be used with the TermNode for deletion.

Add Term query support for Boolean fields

Currently, Boolean fields are not usable in Term queries. The BooleanFieldDef should be modified to implement the TermQueryable interface, so it can match against 'true' and 'false'.

Enable suggestion lookup to filter on contexts which are fields on the indexed documents

We need to be able to filter for contexts during suggest lookup. This functionality currently works only when we build the suggest index from a localSource (i.e. local file). However we do not set up the DocumentDictionary with contexts in case of using nonLocalSource i.e. map context to a field already indexed.

Will need to update DocumentDictionary construction duing buildSuggest and update lookup API to support this.

Add range queries

This can expose the implementations of PointRangeQuery. We shouldn't use TermRangeQuery as it is slower than the point-based ranges.

Expose status endpoint via CLI

Support parallel segment search

Lucene enabled parallel segment search in this LUCENE-6294

Anotherfix for Early termination when numHits is met was backported to 8.3: LUCENE-8939

This functionality is currently missing from our ES deployments since elasticsearch (at least the version we have in production) does not support it.

Being able to search over segments in parallel should be a crucial performance win (w.r.t latency).

Note: The "leaves"/"segments" are divided into "leafSlices" in lucene. Each LeafSlice is then searched upon by a thread. Better Segment To Thread Mapping Algorithm was merged in lucene 9.0 (master): LUCENE-8757. Might be good to borrow some of that code so we can more easily customize how we slice the leaves for individual threads.

add support for snapshot restore

Return structured proto object for search response instead of json string

The SearchResponse object currently contains the response as json string. This needs to be changed to a proper structured proto object instead.

StartHit + topHits behavior

Let's say for a query the total hits is 4. We set startHit to 1 and topHits to 3. This would return 2 hits, since 3 top hits were received from the collector and the first hit was skipped since startHit was 1. Should this behavior instead be that if 3 top hits are requested we return 3 top hits from the startHit ?

Clearer exception: "Field \'" + name + "\' cannot be used in an expression: it was not registered with sort=true"

This error in FieldDefBindings: throw new IllegalArgumentException("Field \'" + name + "\' cannot be used in an expression: it was not registered with sort=true"); is essentially thrown when either the field is not virtual or is a field (including numeric) not having doc values type DocValuesType.NUMERIC. So this error can even be thrown if either the field was not numeric, or was numeric but multivalued, or was numeric but doc values were not stored. Adding sort: true for the fields in such cases will still keep throwing this error. We can probably change this exception to reflect the required condition, i.e. the field type is not virtual or does not have NUMERIC doc values.

remove rootDir param from createIndex

Although #43 removed the need to pass in explicit indexDir name on each createIndex call, we still send this in. We would need to remove this from the supporting clients as well.

Add boost query

Boost query will use BoostQuery in Lucene.

Change info level logs to debug for high volume endpoints

If Primary down when Replica coming up, do we choose to use existing replica index or fail replica startup?

When sendNewReplica() fails example when Primary is not available then the code fails and prevents replica from starting up.

if (t.getMessage().startsWith("replica cannot start") == false) {
        message("exc on start:");
        t.printStackTrace(printStream);
      } else {
        dir.close();
      }
      throw IOUtils.rethrowAlways(t);

Choosing consistency:
I guess we could choose to not have replica come up in the case where Primary is not up which is what happens right now.
Choosing availablity:
Since replica downloads data from s3 and it has data upto the last commit point we could alternatively consume the Exception within sendNewReplica() and choose to not throw it. This would mean replica is not connected to Primary but it can still serve search on the existing frozen index.

add blessing functionality to snapshots

Each time we backup an index we upload it to s3. Upon restore we download it from s3. We need to be able to handle versioning of these resources as well. For example we need to upload some metadata to ensure this:

To bless a version we could do the following

        s3.putObject(BUCKET_NAME, "testservice/_version/testresource/1", "abcdef");
        s3.putObject(BUCKET_NAME, "testservice/_version/testresource/_latest_version", "1");

where abcdef is the resource hash we generate on upload command that is already implemented.

String versionHash = upload("testservice", "testresource", path);
bless("testservice", "testresource", versionHash);

Then download code already ensures that we download the latest version of the data backed up.

Add GET http endpoints for state request and stats request

add tests to restore multiple indexes

totalHits missing in response when hits are zero

I noticed that totalHits field in searchResponse was completely absent when there were no hits instead of being set to 0. I set the totalHits field to 100 instead of setting it to actual hit count in the result, and the field appeared in response with value 100. This happened with the LuceneServerClient as well as when using the REST gateway.
My guess is that this has something to do with 0 being default value for long type and either how GRPC encodes the message or how the grpc-gateway and the LuceneServerClient are converting the GRPC response into json. Note that we probably won't see this issue when using grpc only since the grpc clients should ideally handle this properly even across platforms.

eidfcciknhlgtvguevefibfktlnkeliugbdignddchnneidfcciknhlgtgihtrublhvvlvluunjjhhccgrhtrvnn

Restoring replica on same host as primary fails since backup contains stateDir Path of Primary

When we back up data we back up stateDir and indexes of primary.
StateDir contains a file that has mapping of index name to indexDir Path

{
 "test_index" : "/path/to/indexDir/of/primary"
}

Bug repro steps

Run primary and replica on same host
Backup primary
Down replica
Restore replica using backed up resource info on same host as primary

In current code we re-use the state file for primary from above. This points to primary indexDir instead of the replica indexDir.

One solution is for replica to simply upload its own stateDir in a different namespace and upon restore since the Mode of the server is known to be replica we can download the correct StateDir which in turn points to the old (existing) indexDir path.

This issue does not arise as if we run primary and replica on separate hosts but with same stateDir path.

Start index error masked due to null printstream

Lucene uses a printstream for various logs, but we set this to null in https://github.com/Yelp/nrtsearch/blob/master/src/main/java/com/yelp/nrtsearch/server/luceneserver/ShardState.java#L783 when we set verbose to false. So when an exception is caught here https://github.com/apache/lucene-solr/blob/master/lucene/replicator/src/java/org/apache/lucene/replicator/nrt/ReplicaNode.java#L303 an NPE due to the null printstream is thrown masking the original error.
The printstream is used by Lucene for nrt replication logs as well which can be quite frequent and which we mostly don't need in production, which is why we don't provide the printstream. We may have to look into a better way to avoid the nrt logs while providing the stream, or raise an issue in Lucene to rethrow the original exception if printstream is null.

Add multi-match query and more options in match query

Multi-match query needs to be similar to that in Elasticsearch, with support for per-field boosting. We also need to add the parameters fuzziness, prefix_length and max_expansions to match query.

Create archiver directory automatically if it's missing

An exception is thrown if the archiver directory provided for the server does not exist:
java.lang.RuntimeException: java.io.IOException: Archiver directory doesn't exist: /nail/home/sarthakn/nrtsearch_talk/archiver at org.apache.platypus.server.luceneserver.StartIndexHandler.downloadArtifact(StartIndexHandler.java:133) at org.apache.platypus.server.grpc.LuceneServer$LuceneServerImpl.startIndex(LuceneServer.java:302) at org.apache.platypus.server.grpc.LuceneServerGrpc$MethodHandlers.invoke(LuceneServerGrpc.java:2109) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:172) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:331) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:817) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:830) Caused by: java.io.IOException: Archiver directory doesn't exist: /nail/home/sarthakn/nrtsearch_talk/archiver at org.apache.platypus.server.utils.ArchiverImpl.download(ArchiverImpl.java:92) at org.apache.platypus.server.luceneserver.StartIndexHandler.downloadArtifact(StartIndexHandler.java:131) ... 12 more

We can just create the directory if it's missing instead of throwing an exception.

Allow custom scoring functions in the query

Virtual field allows running an expression. It might be usable in some form for this.

Swagger operation IDs modified after protobuf version update

Due to grpc-ecosystem/grpc-gateway#1193, we have LuceneServer_ prepended to all operation IDs. The PR mentions that the maintainers will add a flag to go back to the previous behavior and we should update our protobuf dependency when that flag gets added.

Till then, we can run sed -i 's/LuceneServer_//g' luceneserver.swagger.json to get back the previous operation IDs. We can either run it manually or just add it to the generation step (risky if there is some other change in swagger generation and we replace something other than operation ID).

eidfcciknhlgtvguevefibfktlnkeliugbdignddchnn

Add gradle task to build Grpc gateway without Docker

This is required to be able to build nrtsearch inside an external docker container without dind.

Match query created is null when 0 tokens are found after analysis

To reproduce: create match query with text "//" for a field which is analyzed by standard analyzer. This eventually throws an NPE.

Restore broken if running primary/replica on same host

When we run both primary and replica on the same node, back them up and try to restore them on same node, the restored indexDir (index data) path ends ups pointing to same directory on disk.

Steps to repo

Backup primary (and replica state)
Down primary
Down secondary
Bring up new primary and restore it using resourceName in 1
Bring up new secondary on same host as 4 and restore it using resourceName in 2

Root cause
We back up index data from primary via the BackupIndex command.
StateDir itself is also backed up and has to be restored before actual data.
Upon restoring actual we reuse the indexDirectory specified in the stateDir.
Then we download data to a new location.
Finally, we simply create a symlink to point to the downloaded location of index data from the indexDir specified in stateDir.

The indexDir path looks like below:
serverPrimary@ -> /var/folders/xl/8yyfg7s93k95g3vtr04b8z9jpd3vss/T/junit15797558079443604738/archiver/testresource_data/current/serverPrimary

Issue is when we run a replica/secondary on the same host we create another symlink to the same location of data:
serverSecondary@ -> /var/folders/xl/8yyfg7s93k95g3vtr04b8z9jpd3vss/T/junit15797558079443604738/archiver/testresource_data/current/serverPrimary

Thus both primary and secondary/replica now use same underlying files. This is not an issue typically since we generally spin up different containers for restoring and running primary and replica.

Add support for numeric data types to term and term in set queries

Broadcasting settings and field changes

We currently would have to apply new settings and field changes across all Primary and replica nodes individually.

Could we broadcast these changes without explicitly settings it on every node?

Delete temporary files created when uploading a snapshot throws an error

Fix flaky build

Seems like the builds are getting stalled/blocked due to some tests being blocked on a future searcherVersion not being available to the replica

 Feb 04, 2020 9:42:57 PM org.apache.platypus.server.luceneserver.SearchHandler getSearcherAndTaxonomy
    INFO: SearchHandler: now await version=5 vs currentVersion=0
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.tr

Consider merging search and storeDocValues in Field

Field is the proto object used in FieldDefRequest. Search requires docvalues, and I don't think there is any other use of docvalues. We can merge search and storeDocValues in Field and have only search option. When creating a field definition, we can just replace storeDocValues usages with search. Having a single option in Field would be simpler for users, especially novices who might not know that you need docvalues to be able to search.

Use proper analyzers for suggest instead of just using StandardAnalyzer

#40 adds support for specifying a predefined analyzer or creating a custom analyzer. We just need to plug in the same Analyzer proto object and AnalyzerCreator method calls.

Add support for boolean query

Query nodes are currently unsupported:
https://github.com/Yelp/platypus/blob/0e3f63bed6788ce655976803c661af01f00505a2/src/main/java/org/apache/platypus/server/luceneserver/SearchHandler.java#L418

Reference code for boolean query: https://github.com/mikemccand/luceneserver/blob/c82373625ce9f92d2080a2f07c7d46cbd17e6ac1/src/java/org/apache/lucene/server/handlers/SearchHandler.java#L1056

Pin grpc-gateway version

We currently get the latest code. Would be better to pin to a release version.

Support multiple contexts in one query

We should be able to filter by more than one context. For example:

context could be a list or a map <string, List> :
client could add 2 geocontext:
geocontext1: {precision:5, boost: 3}
geocontext2: {precision:4, boost: 2}

Another context, using yelp use case as an example, would be the category of the a biz:

catcontext1: { names [restaurant, cafes], boost: 3}

more on this in previous review request. https://github.com/Yelp/platypus/pull/19/files/24ed6bb3dca8e4501811d96a1962b79bf74c3b8a#diff-4e19185040990a7739472e27f3ef1846

Support specifying parameters like stopWords and stemExclusions for predefined analyzers

With #40 the predefined analyzers are only created with the no-args constructor. We should be able to specify the required params for them and use the right constructor.

UnknownHostException for us-east-1

When using a us-east-1 bucket, the process fails to bootstrap with following exception:

Exception in thread "main" com.amazonaws.SdkClientException: Unable to execute HTTP request: yelp-service-data-us-east-1.s3.US.amazonaws.com at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1175) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1121) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5052) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4992) at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:895) at com.yelp.nrtsearch.server.utils.ArchiverImpl.getResources(ArchiverImpl.java:128) at com.yelp.nrtsearch.server.luceneserver.RestoreStateHandler.restore(RestoreStateHandler.java:48) at com.yelp.nrtsearch.server.grpc.LuceneServer.start(LuceneServer.java:179) at com.yelp.nrtsearch.server.grpc.LuceneServer.main(LuceneServer.java:212) Caused by: java.net.UnknownHostException: yelp-service-data-us-east-1.s3.US.amazonaws.com at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:800) at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1495) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1354) at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1288) at com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27) at com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) at com.amazonaws.http.conn.$Proxy11.connect(Unknown Source) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1297) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)

Log S3 upload or download progress less frequently

S3 download or upload progress can currently be logged 10s of times in a second. We can make this infrequent by logging every N bytes or every M seconds.

add endpoint to startAllIndexes

This will be needed to get back all indexes before the server went down. Specially in case of a crash a k8s operator could issue this call once the server is back up to restore its state and data from the last commit point in s3.

Different behaviors for start index in primary and replica

Following are the results when nrtsearch is started with restored state and start index is called:

Primary: start index fails with index not saved or committed message in exception (correction - no segments file found), subsequent start index with restore also fails since directories were created
Replica: start index works and index is started with 0 segments. It also didn't seem like the replica was retrieving the segments from primary after this.

Add match and match phrase queries

Match and MatchPhrase are Elasticsearch query types which analyze the query text to create a boolean query consisting of terms or phrase queries. We can create equivalent queries using MultiFieldQueryParser in Lucene.