Giter VIP home page Giter VIP logo

vespa-engine / vespa Goto Github PK

View Code? Open in Web Editor NEW
5.3K 158.0 559.0 402.23 MB

AI + Data, online. https://vespa.ai

Home Page: https://vespa.ai

License: Apache License 2.0

CMake 0.87% Java 53.74% Shell 0.39% Perl 0.03% C++ 43.45% C 0.08% ANTLR 0.01% HTML 0.09% Objective-C 0.02% Emacs Lisp 0.01% Python 0.08% Makefile 0.01% GAP 0.01% Ruby 0.02% Roff 0.02% LLVM 0.01% Yacc 0.02% JavaScript 0.11% Go 1.04% Lex 0.02%
vespa search-engine big-data ai serving serving-recommendation machine-learning server tensorflow java

vespa's Introduction

#Vespa

Search, make inferences in, and organize vectors, tensors, text and structured data, at serving time and any scale.

This repository contains all the code required to build and run all of Vespa yourself, and where you can see all development as it is happening. All the content in this repository is licensed under the Apache 2.0 license.

A new release of Vespa is made from this repository's master branch every morning CET Monday through Thursday. Build status: Vespa Build Status

Table of contents

Background

Use cases such as search, recommendation and personalization need to select a subset of data in a large corpus, evaluate machine-learned models over the selected data, organize and aggregate it and return it, typically in less than 100 milliseconds, all while the data corpus is continuously changing.

This is hard to do, especially with large data sets that needs to be distributed over multiple nodes and evaluated in parallel. Vespa is a platform which performs these operations for you with high availability and performance. It has been in development for many years and is used on a number of large internet services and apps which serve hundreds of thousands of queries from Vespa per second.

Install

Deploy your Vespa applications to the cloud service: https://cloud.vespa.ai, or run your own Vespa instance: https://docs.vespa.ai/en/getting-started.html

Usage

  • The application created in the getting started guides linked above are fully functional and production ready, but you may want to add more nodes for redundancy.
  • See developing applications on adding your own Java components to your Vespa application.
  • Vespa APIs is useful to understand how to interface with Vespa
  • Explore the sample applications
  • Follow the Vespa Blog for feature updates / use cases

Full documentation is at https://docs.vespa.ai.

Contribute

We welcome contributions! See CONTRIBUTING.md to learn how to contribute.

If you want to contribute to the documentation, see https://github.com/vespa-engine/documentation

Building

You do not need to build Vespa to use it, but if you want to contribute you need to be able to build the code. This section explains how to build and test Vespa. To understand where to make changes, see Code-map.md. Some suggested improvements with pointers to code are in TODO.md.

Development environment

C++ and Java building is supported on AlmaLinux 8. The Java source can also be built on any platform having Java 17 and Maven installed. Use the following guide to set up a complete development environment using Docker for building Vespa, running unit tests and running system tests: Vespa development on AlmaLinux 8.

Build Java modules

export MAVEN_OPTS="-Xms128m -Xmx1024m"
./bootstrap.sh java
mvn install --threads 1C

Use this if you only need to build the Java modules, otherwise follow the complete development guide above.

License

Code licensed under the Apache 2.0 license. See LICENSE for terms.

vespa's People

Contributors

andreer avatar aressem avatar arnej27959 avatar baldersheim avatar bjormel avatar bjorncs avatar bratseth avatar dybis avatar ethnas avatar freva avatar geirst avatar gjoranv avatar hakonhall avatar havardpe avatar henrhoi avatar hmusum avatar jonmv avatar kkraune avatar ldalves avatar lesters avatar mpolden avatar olaaun avatar oyving avatar renovate[bot] avatar smorgrav avatar thigm85 avatar tokle avatar toregge avatar vekterli avatar yngveaasheim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vespa's Issues

Documents ignored due to content selection filter is accepted with isSuccess = true

If you have a selection filter in services.xml and your input document does not evaluate to true the document operation is still considered a success.

<document mode='index' type='foo' selection="foo.timestamp < now()"/>

The callback when feeding one such document using the java http vespa client will give you isSuccess() equals true. This is probably also the case for the other synchronous REST document apis.

@Override public void onCompletion(String docId, Result documentResult) { documentResult.isSuccess(); #True even for ignored document }

Slow document GC starves moving buckets to ideal nodes

Maintenance operations are generated with a fixed internal priority depending on their assumed importance. An operation's priority affects how many of its type may be pending at any given point in time. Higher priority operations, if present, effectively pre-empt lower priority operations.

Currently, bucket GC has priority LOW, while a move-only merge has priority VERY_LOW. This means that if a cluster is falling behind on its GC duties (usually due to a too short configured GC period or an expensive GC expression), move-only merges will end up being entirely preempted. This can cause stalls in moving documents away from retired nodes and/or onto a newly introduced node.

I'm tempted to either flip the two priorities, or bring GC down to VERY_LOW.

Vespa Installation Issue on Ubuntu 16.04

Hi,

How to install vespa in ubuntu?

As i gone through readme document. In that i have executed "docker/build-vespa.sh". But it's downloading CentOS files and due to this i am not able to use vespa. I have tried to find this things on google but I was not able to find the source of Vespa for Ubuntu and also not able to find any other packaes of Vespa for Ubuntu.

Kindly help me to resolve this issue or provide me the url where I can find the Installable of Vespa for Ubuntu.

Thanks.

Modeling Social Network Relationships

Hi guys! We've been exploring Vespa in an effort to replace our existing Elasticsearch datastore. Our primary use-case is indexing social profiles at a high throughput, while maintaining a very low query response time. Our indices contain hundreds of millions of profiles, per social network.

Given this basic profile examples:

[
  {
    "handle": "BigBob",
    "name": "Bob Stevens",
    "follows": [
      {
        "handle": "AliceInWonderland",
        "position": 1
      },
      {
        "handle": "JWhite",
        "position": 1
      }
    ]
  },
  {
    "handle": "AliceInWonderland",
    "name": "Alice Richards",
    "follows": [
      {
        "handle": "BigBob",
        "position": 1
      },
      {
        "handle": "JWhite",
        "position": 2
      }
    ]
  },
  {
    "handle": "Josh White",
    "name": "JWhite",
    "follows": [
      {
        "handle": "BigBob",
        "position": 2
      }
    ]
  }
]

In order to get BigBob's followers, sorted by when they began to follow Bob (follows.position), we'd like to write a query along the lines of:

SELECT * FROM profile
WHERE follows.handle = "BigBob"
ORDER BY follows.position WHERE follows.handle="BigBob"

There are 2 problems here. First, dot notation isn't supported, and while we tried using the search fieldset work-around, it simply didn't work (is it supposed to work for an array?). Second, is it even possible to sort by a key nested in an array?

We'd love to hear any thoughts or ideas around this usecase...

Thanks!

Running on Kubernetes

This is more of a question but I think it would also be a nice addition to the documentation.

I am interested in running vespa on a Kubernetes cluster. In the documentation, there is a list of vespa processes and which nodes they should run on. There are 4 types of nodes in the list.

However, in the Vespa start / stop section, there is a command to start all services and a command to start the config server, giving me the impression that there are only 2 types of nodes.

In addition, there is the vespa/vespa image for running vespa in docker.

  1. Is the vespa/vespa docker image suitable for production?
  2. If I am running vespa on kubernetes, can I use the vespa/vespa image to deploy multiple instances?
  3. Is it not recommended to run processes listed under different hosts on the same host? For example, running the container service on a content cluster host?

mvn generate-resources fails to generate resources

The task is to replicate a sample app in Eclipse. The config file /src/resources/configdefinitions/test-processor.cfg reads
`package=org.topicquests.vespa.test

message string default=""
There are two java definitions in the org.topicquests.vespa.test directory. mvn generate-resources` compiles both of those java classes into /target/classes, and creates /target/test-classes/ which is empty, but does not generate the Config class I need to complete the Processor. The pom.xml in play is here

pom.xml from basic-search-java fails in Eclipse Neon

It's a lifecycle issue, something about m2e connectors, but Eclipse is not able to resolve it. Eclipse did download the dependencies, but there is still an error in the pom. Eclipse says something about "Resolve Later". The error message is this:
Plugin execution not covered by lifecycle configuration: com.yahoo.vespa:bundle-plugin:6.145.90:generate-bundle-classpath-mappings (execution: default-generate-bundle-classpath-mappings, phase: process-test-resources)

I did install one m2e connector. Nothing changed.
What am I missing?

Search definition creation in runtime

Hi all,

First of all thank you for this amazing platform. I have 2 questions about Search definition.

Search definition creation in runtime

Is there any API which I can use to create a search definition (sd) in runtime.

This is a requirement, because the documents that I will index are depending on the user input in my front end application.

If it is not possible, what is the best way for me to implement/workaround this feature.

How many search definitions can I have in a Vespa cluster?

Lets say that I am creating 100000 search definitions. Is this possible? What is the impact in terms of performance?

Dynamic Partitioning support

Hey,

I searched the docs and didn't find anything. Think about the logs usecase in Elasticsearch where you create a new index dynamically for whatever you want to partition on. Looking at this: #4092 it's not available to dynamically partition.

While I can use a "special field" to filter, it would still have overhead when grouping, faceting, sorting.

Is my best case to just precreate about 250 search definitions ? (what I think I'll need totally)

Backend metrics snapshotting is not compatible with semantics expected by Prometheus

Our current backend metric aggregation implementations are built around explicit snapshotting every N minutes, where each such snapshot effectively resets the tracked metric value internally. In other words, a counter, when observed externally, is not monotonically increasing over time. It will only be monotonically increasing within a particular snapshot period.

Although this simplifies tracking of minimum and maximum values within a snapshot period, it does not match the semantics of Prometheus metrics (aside from Gauge-style metrics, which obviously cannot have monotonic properties).

Ideally, we should introduce a new metric implementation in our backends that has support for the following:

  • Counters (monotonic)
  • Gauges
  • Histograms (monotonic per bucket). Possibly also Summaries for certain latency metrics; could use HdrHistogram implementation.
  • Dimensions ("labels" in the Prometheus data model). We already support this in our current implementations.
  • Prometheus exposition, at least in text format

To support legacy metric aggregators that expect pre-derived values, we should also support some form of snapshotting behind the scenes. Note that snapshotting of monotonically increasing values should be vastly simpler than what is currently done in the backend.

Cyclic dependency edge case between bucket merge and activation logic causing neither to take place as expected

The following is an excerpt from a system that was unable to converge to zero merge pending buckets:

BucketId(0xb800004000000066): merge: [Synchronizing buckets with different checksums
node(idx=2,crc=0x35003985,docs=574/574,bytes=192635/192635,trusted=false,active=false,ready=true), 
node(idx=7,crc=0x35003985,docs=574/574,bytes=193073/193073,trusted=false,active=false,ready=true), 
node(idx=8,crc=0x35003985,docs=574/574,bytes=583509/583509,trusted=false,active=false,ready=false), 
node(idx=1,crc=0x7324ca74,docs=1/1,bytes=1413/1413,trusted=true,active=true,ready=true)]

Note that only the last replica is marked as trusted and active.

What's happening:

  • Merge is scheduled to bring first 3 replicas in sync with the last 1 (index 1). Replica on index 1 marked as source-only due to being in non-ideal location.
  • Merge completes, but since we currently don't dare auto-delete active replicas the replica on index 1 remains.
  • Activation logic does not activate any of the ideal replicas (or deactivate the non-ideal replica), as the replica on index 1 is marked trusted (despite having only 1 document). Presumably, some form of partition occurred where all replicas disappeared and the temporary replica (created from feed) was the only known one. Once the old replicas came back up, the system had no way of knowing which ones were authoritative and chose to cling to the replica it itself created.
  • Trusted status of ideal replicas does not change, as the checksums do not converge to the same value as that of the only trusted (non-ideal) replica. This is inherent to how source-only merging works, as they only function as document sources and not sinks. I.e. the merge does not cause the system to move closer to ideal state.
  • New merges are scheduled until heat death of universe, or until the distributor is restarted, whichever comes first. Restarting the distributor clears the transient trusted flags of all replicas, allowing expected activation to take place. Heat death of universe presumably also implicitly clears trusted flags and/or makes this less of an issue altogether.

Both ideal state checkers assume that the other will fix things up so that the system will converge. This assumption does not hold, so convergence is not reached.

I'm tempted to remove the limitation that we don't auto-delete active source-only replicas. This should happen very rarely, and it should be safe to assume that the merge operation has resulted in a complete, ready replica on one of the ideal nodes. Deleting the active replica will create a minor window of recall loss until the new ready replica is activated. In the case above, deleting the active replica will have a direct benefit, as it holds only a tiny fraction of the documents that the non-active replicas hold.

Starvation of bucket GC messages on content nodes causes starvation of bucket merge ops on distributors

If a content node is so loaded that lower priority bucket GC messages are unable to be processed, this may today also transitively inhibit merge operations from being sent by distributor. This is the case even if the bucket sets for GC and merge ops are entirely non-overlapping.

Background: a typical production system may have hundreds of thousands or even millions of buckets per distributor. Changes in the cluster may require a large subset of these to undergo some form of maintenance operation (e.g. merging for fixing out of sync issues). Distributors throttle the number of pending such operations based on a configurable value to avoid swamping the system.

The distributor today has an internal fixed set of priority classes to which maintenance operations are assigned when generated. This class is dependent on the "importance" of the maintenance operation, and the number of pending operations allowed increases with the priority.

The issue here is that GC ops are generated with priority class MEDIUM, which is the same class as that generated by merges for out-of-sync buckets. But the actual messages sent by the GC ops have a lower priority than both feed and merges. Reducing the GC operation priority class to LOW will allow merges to go through even though GCs are stalled, as they will have a bigger pending window.

This change will cause some minor reduction in GC parallelism, but I don't think this will matter in practice.

search-definition: ignore fields

is there some way to indicate vespa-deloy to ignore fields not mapped in search-definition file?

e.g. (in an arbitrary dataset)

{
  "ignore_field": null,
  "usable_field_1": "sd_fl_1"
}

and in search-definition

arbitdataset.sd

search simplex {
    document simplex {
        field usable_field_1 type string {
            indexing: summary
        }
    ....
    }
}

throws
Detail resultType=FATAL_ERRORexception='Could not get field "ignore_field" in the structure of type "simplex"

please point me to any document explaining this

ref:
http://docs.vespa.ai/documentation/reference/search-definitions-reference.html#match

Add support for returning docs from more than one bucket per request when visiting via Document V1 API

Visiting via the container Document V1 REST API returns documents from only 1 bucket at a time, giving back a continuation token that lets the client continue visiting from the successor bucket in its next request. There is currently no way of specifying you want the request to cover more than a single bucket.

In systems with a lot of data, this returns a chunk of 500-1000 docs per request, which is a nicely manageable amount. However, in systems with few documents each request ends up returning only a handful of documents. For example, the corpus used in the Vespa tutorial with 10 documents returns 1 document per request, requiring 10 individual requests.

I suggest adding a query parameter for overriding the minimum number of documents returned, which allows for batching up multiple buckets and reducing the number of roundtrips. Default behavior should remain 1 bucket at a time, as this is the most common production scenario. We should also enforce an upper bound on the parameter to avoid OOM-ing the container during response buffering.

Multi-application or multi-tenant

Hi guys. I am searching the documentation for clues on how to run multiple applications on the same set of nodes. How should this be modelled? multi-application in the same tenant or multi-tenant?

Please point me in the right direction.

Thanks

--Øyvind

ApplicationSuspensionResourceTest triggers JVM OoM

This test was the root cause of the build instability on Travis-CI (#3568). Application.fromApplicationPackage eats up all memory on the host and gets killed by the kernel during the second execution of the @Before initializer. The unit test is currently disabled to keep the build from failing.

JDisc injects null config objects

When a config parameter with no default value is not given a value in services.xml, the config system throws an exception with info on how to fix the issue. JDisc should propagate this exception, but is instead injecting a null config object. Of course, this causes a NullPointerException when component code uses the injected config object.

Fail to build using docker

Hi,

It's just for personal studying so I did not file a JIRA ticket and opened an issue here instead.

I was trying to build Vespa on my own box using docker.
The build environment was ubuntu 16.04 + docker 1.13.0 and I'm on the current master (e0b5fe3)

$ uname -r
4.4.0-59-generic
$ docker --version
Docker version 1.13.0, build 49bf474

I used docker/build-vespa.sh

$ cd docker; ./build-vespa.sh 6

and it failed on the install part, showing something like this:

......
[ 99%] Built target vespamalloc_testgraph_app                                                                                                                     
[ 99%] Built target vespamalloc_racemanythreads_test_app
[ 99%] Built target vespamalloc_thread_test_app
[100%] Built target vsm_charbuffer_test_app
[100%] Built target vsm_docsum_test_app
[100%] Built target vsm_document_test_app
[100%] Built target vsm_searcher_test_app
[100%] Built target vsm_textutil_test_app
Install the project...
-- Install configuration: ""
-- Installing: /root/rpmbuild/BUILDROOT/vespa-6-1.el7.centos.x86_64/opt/yahoo/vespa/lib/jars/config-model-fat.jar
-- Installing: /root/rpmbuild/BUILDROOT/vespa-6-1.el7.centos.x86_64/opt/yahoo/vespa/lib/jars/document.jar
-- Installing: /root/rpmbuild/BUILDROOT/vespa-6-1.el7.centos.x86_64/opt/yahoo/vespa/lib/jars/jdisc_jetty.jar
-- Up-to-date: /root/rpmbuild/BUILDROOT/vespa-6-1.el7.centos.x86_64/opt/yahoo/vespa/lib/jars

......

-- Installing: /root/rpmbuild/BUILDROOT/vespa-6-1.el7.centos.x86_64/opt/yahoo/vespa/var/db/vespa/config_server/serverdb/classes/messagebus.def
-- Installing: /root/rpmbuild/BUILDROOT/vespa-6-1.el7.centos.x86_64/opt/yahoo/vespa/var/db/vespa/config_server/serverdb/classes/metricsmanager.def
CMake Error at cmake_install.cmake:239 (file):
  file INSTALL cannot find
  "/root/rpmbuild/BUILD/vespa-6/node-repository/src/main/resources/configdefinitions/node-repository.def".


make: *** [install] Error
error: Bad exit status from /var/tmp/rpm-tmp.1Lgovc (%install)


RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.1Lgovc (%install)

I suspected that it's related to 0501d0d, so I tried to checkout back to 30cd111 and it successfully builds.

Introduce batch mode for SetBucketState to avoid O(|changed buckets|) generated and enqueued operations

Bucket (de-)activation is today done on a per-bucket granularity, with a message sent and received per affected bucket. If we instead allow SetBucketState commands to include sets of buckets to activate or deactivate, we can avoid significant overhead when lots of buckets change state at the same time. In particular, node up/down edges introduce many bucket state changes due to shuffling of bucket ideal states.

A caveat: if we're moving activation from one replica to another, we always activate the new replica before deactivating the old to avoid coverage loss (at the expense of some result duplication which will be de-duped in the container). This transient effect will be amplified if we move to batch activation. Can limit its impact by setting an upper bound on the number of buckets allowed to be included in any activation message.

Proton: Custom bucketing & Query

References:

id scheme

Format: id:::<key/value-pairs>:

http://docs.vespa.ai/documentation/content/buckets.html
http://docs.vespa.ai/documentation/content/idealstate.html

its possible to structure data in user defined bucketing logic by using 32 LSB in document-id format (n / g selections).

however, the query logic isn't very clear on how to route queries to a specific bucket range based on a decision taken in advance.

e.g., it is possible to split data into a time range (start-time/end-time) if i can define n (a number) compressing the range. all documents tagged such will end up in same bucket (that will follow its course of split on number of documents / size as configured).

however, how do i write a search query on data indexed in such manner? is it possible to indicate the processor to choose a specific bucket, or range of buckets (in case distribution algorithm might have moved buckets)?

SO question here:
https://stackoverflow.com/questions/46681642/vespa-proton-custom-bucketing-query

i guess i am still seeking the right community to post general helpful questions, so any pointers to that are welcome as well.

Windows Enviroment

Hi,

Is possible to configure the Vespa Engine in Windows? If yes, how? Can introduce some tutorials?

Unstable Travis build

The Travis build (both master and pull requests) seems to have become unstable over the past few days.

A few examples, all failing in the orchestrator module:

PR builds have been very stable so far, but now they fail approximately 1/3 of the time, in my experience.

Additionally, having the "build failing" badge appear on our GitHub front page is unfortunate now that the project is attracting a lot of attention.

FYI @bjorncs @hakonhall

Quickstart doesn't work on macOS

Quickstart (http://docs.vespa.ai/documentation/vespa-quick-start.html), from Step 4:

osx:vespa wiradikusuma$ docker exec vespa bash -c '/opt/vespa/bin/vespa-deploy prepare /vespa-sample-apps/basic-search/src/main/application/ && \ 
>     /opt/vespa/bin/vespa-deploy activate'

	Uploading application '/vespa-sample-apps/basic-search/src/main/application/' using http://localhost:19071/application/v2/tenant/default/session?name=application
	Session 2 for tenant 'default' created.
	Preparing session 2 using http://localhost:19071/application/v2/tenant/default/session/2/prepared
	Session 2 for tenant 'default' prepared.
	bash:  : command not found
	Activating session 2 using http://localhost:19071/application/v2/tenant/default/session/2/active
	Session 2 for tenant 'default' activated.
	Checksum:   9af875d625b18c84d3dd6daa748c9484
	Timestamp:  1506523079488
	Generation: 2

osx:vespa wiradikusuma$ curl -s --head http://localhost:8080/ApplicationStatus

	HTTP/1.1 200 OK
	Date: Wed, 27 Sep 2017 14:38:24 GMT
	Content-Type: application/json
	Transfer-Encoding: chunked

osx:vespa wiradikusuma$ curl -s -X POST --data-binary @${VESPA_SAMPLE_APPS}/basic-search/music-data-1.json \
>     http://localhost:8080/document/v1/music/music/docid/1 | python -m json.tool

	{
	    "id": "id:music:music::1",
	    "pathId": "/document/v1/music/music/docid/1"
	}

osx:vespa wiradikusuma$ curl -s -X POST --data-binary @${VESPA_SAMPLE_APPS}/basic-search/music-data-2.json \
>     http://localhost:8080/document/v1/music/music/docid/2 | python -m json.tool

	{
	    "id": "id:music:music::2",
	    "pathId": "/document/v1/music/music/docid/2"
	}

osx:vespa wiradikusuma$ curl -s http://localhost:8080/search/?query=bad | python -m json.tool

	No JSON object could be decoded

osx:vespa wiradikusuma$ curl -s http://localhost:8080/document/v1/music/music/docid/2 | python -m json.tool

	No JSON object could be decoded

Introduce priority queueing for incoming client operations on distributor

We have priority queues in several places in the content layer today:

  • Communication managers
  • Visitor scheduling
  • Persistence threads

But when a set of messages is handed off to the distributor main thread for processing it all happens entirely in FIFO order. This means that if you have a combination of small high priority operations and large & expensive lower pri bulk operations, the former may be starved by the latter.

By introducing priority queueing for operations arriving from external clients, we let high priority operations sneak past, potentially greatly lowering their latency (of course, with the usual starvation caveats that are inherent to priorities). Note that we should only prioritize client requests; internal API protocol requests and responses must not be reordered, lest we break core assumptions of how operations are ordered relative to each other.

Document Retrieval. How to integrate with Vespa in external apps

Hello Vespa!

I am looking for an overview on what is required and how to connect with Vespa for retrieving indexed data at scale.

i've run stress tests on Vespa document RESTful API and as suggested in documentation, it has an upper bound.

http://docs.vespa.ai/documentation/document-api-guide.html 
indicates the way forward but assumes a head-start on subject matter.

i can figure MessageBusDocumentAccess
and related stuff.

MessageBusDocumentApiTestCase is also a good pointer but to simply accept, it's quite large to put together fast.

The trouble is i can't find, if documented, any guide to clearly explain how to invoke vespa from an external system, or if that's not possible, clarify that it's only a fat client / has to be run as an embedded client and how it talks to vespa cluster.

please point me to if such an overview exists.

is DocumentRetriever.java the way forward? what other choices does one have?

thanks!

Expanded spatial fields/search

Currently, only points are supported, and pretty basic search based on those points. To be positioned competitively with Elastisearch and other competitors, a broader spectrum of geospatial data types and querying capabilities would be needed.

Additional spatial data types like

  • polygons
  • multi-polygons
  • lines
  • Basically anything in the geojson spec

Querying should support

  • intersects
  • within
  • radius
  • buffer
  • etc.

Metrics API Improvent

I would like to get the average metric value of past 5 minutes. However vespa does not support it.
Could you add SUM values to metrics api?

I am planning to calculate metric value of past 5 minutes from the SUM value by using prometheus.

Thanks,

Adding <document-api> to basic-search-java causes error when running mvn package

<jdisc version="1.0"> <document-api/> <processing> <chain id="default"> <processor id="com.mydomain.example.ExampleProcessor" bundle="basic-search-java"> <config name="com.mydomain.example.example-processor"> <message>Hello, services!</message> </config> </processor> </chain> </processing> <nodes> <node hostalias="node1"/> </nodes> </jdisc>

Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.27 sec <<< FAILURE! - in com.mydomain.example.ApplicationTest
requireThatResultContainsHelloWorld(com.mydomain.example.ApplicationTest) Time elapsed: 1.27 sec <<< ERROR!
java.lang.IllegalArgumentException: Could not create a component with id 'com.yahoo.document.restapi.resource.RestApi'. Tried to load class directly, since no bundle was found for spec: vespaclient-container-plugin. If a bundle with the same name is installed, there is a either a version mismatch or the installed bundle's version contains a qualifier string.
at com.mydomain.example.ApplicationTest.requireThatResultContainsHelloWorld(ApplicationTest.java:21)

Reduce common case resource overhead of update operation read-repairs by introducing metadata-only read phase

tl;dr: let us be able to trade off an extra roundtrip during updates to out-of-sync buckets for reduced network, I/O and CPU usage in the common case.

Today, when an update operation arrives on a distributor it may enter one of two code paths:

  1. If bucket replicas are in sync, the update is sent directly to the replicas for execution directly against the backend content nodes. This is known as the "fast path".
  2. Otherwise, there may be diverging versions of the document. If we send the update directly to the individual replicas we might introduce inconsistencies by applying partial updates to different versions. In this case we perform a read-repair where the document is fetched from all mutually diverging replicas and the update is performed on the distributor against the most recent version. A put operation with the result is sent to all replicas to force convergence to a shared version. This is known as the "safe/slow path".

The slow path, aside from being slower as its name implies, is highly susceptible to false positives. Since the distributor operates on a bucket-level granularity, it's enough for 1 out of 1000 docs in a bucket to be divergent for the entire bucket to be marked out of sync. Updates to the 999 other documents in the bucket will therefore trigger a slow path unnecessarily (but the distributor cannot know this a priori; for all it knows every single document in the bucket is divergent).

Today we unconditionally perform the update operation on the distributor when executing a slow path update. This because we've already expended the effort to read the document from disk, so we might as well use it instead of incurring further IO on the content nodes themselves. This works fine for small documents and/or updates, but breaks down when documents and/or updates are large. The single-threaded execution model of the distributor limits the number of operations it can perform per second, whereas the content nodes can run with arbitrarily large thread pools.

I suggest the following changes:

  1. Fast-path update handling remains unchanged
  2. Current two-phase update scheme is extended to three phases. Initial phase only fetches the document versions instead of the documents themselves. Iff all versions match, trigger fast path update. Otherwise, proceed with original two-phase slow path for replicas that have diverging document versions.

Document versions (timestamps) are kept in-memory in Proton and are therefore "free" to read. We still get an extra distributor<->content node roundtrip, but only in the case where buckets are out of sync.

yum install vespa error

Okay.....Uhm.....My Eng is pool, so i hoop you can understand what i said below.
I did this:
yum -y install yum-utils epel-release yum-config-manager --add-repo https://copr.fedorainfracloud.org/coprs/g/vespa/vespa/repo/epel-7/group_vespa-vespa-epel-7.repo yum -y install vespa

and error comes out like this:
Error: Package: vespa-6.149.44-1.el7.centos.x86_64 (group_vespa-vespa) Requires: libboost_system-gcc62-mt-1_59.so.1.59.0()(64bit) Error: Package: vespa-6.149.44-1.el7.centos.x86_64 (group_vespa-vespa) Requires: libboost_program_options-gcc62-mt-1_59.so.1.59.0()(64bit) Error: Package: vespa-6.149.44-1.el7.centos.x86_64 (group_vespa-vespa) Requires: libboost_filesystem-gcc62-mt-1_59.so.1.59.0()(64bit) Error: Package: vespa-6.149.44-1.el7.centos.x86_64 (group_vespa-vespa) Requires: libboost_thread-gcc62-mt-1_59.so.1.59.0()(64bit)

I tried to installed "Development Tools", and the error is the same one.

And if I did like this:
yum-config-manager --add-repo http://repo.enetres.net/enetres.repo yum install boost-devel

Sth came out like this:
--> Finished Dependency Resolution Error: Package: libboost_log1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicuuc.so.42()(64bit) Error: Package: libboost_locale1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicudata.so.42()(64bit) Error: Package: libboost_regex1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicudata.so.42()(64bit) Error: Package: libboost_locale1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicui18n.so.42()(64bit) Error: Package: libboost_log1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicudata.so.42()(64bit) Error: Package: libboost_graph1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicuuc.so.42()(64bit) Error: Package: libboost_regex1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicuuc.so.42()(64bit) Error: Package: libboost_graph1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicui18n.so.42()(64bit) Error: Package: libboost_locale1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicuuc.so.42()(64bit) Error: Package: libboost_regex1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicui18n.so.42()(64bit) Error: Package: libboost_log1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicui18n.so.42()(64bit) Error: Package: libboost_graph1_59_0-1.59.0-1.x86_64 (enetres) Requires: libicudata.so.42()(64bit) You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest

So, what should I do ? I have no idea.......

for help....

Uh! And the os id centos 7.0

Quick-start tutorial runs with internal timeouts

Gist with log files, ps output, netstat output, Kubernetes manifest.

I don't know if this is an actual error or not, but I get lots of log entries about not being able to connect to the config server at port 19070, lines like this:

1507078186.060	vespa-0.vespa.default.svc.cluster.local	544/1	configproxy	configproxy.com.yahoo.vespa.config.proxy.RpcConfigSourceClient	info	Could not connect to config source at tcp/localhost:19070
1507078186.061	vespa-0.vespa.default.svc.cluster.local	544/1	configproxy	configproxy.com.yahoo.vespa.config.proxy.RpcConfigSourceClient	info	Could not connect to any config source in set [tcp/localhost:19070], please make sure config server(s) are running.

I'm starting Vespa under Kubernetes, using the exact commands here. One difference is that Kubernetes is setting the host name, but I don't see why that matters. Using curl, I can confirm that connections to port 19070 hang and eventually time out.

From what I can tell, the config server has been started. The config sentinel (PID 833 in the ps output), however, keeps restarting.

Using netstat, I see that the config server process is indeed listening on port 19070.

Port 19070 is operational.

build java error

build java error below:
[ERROR] The goal you specified requires a project to execute but there is no POM in this directory (/root). Please verify you invoked Maven from the correct directory. -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MissingProjectException Building Vespa Maven plugins. [ERROR] The build could not read 1 project -> [Help 1] [ERROR] [ERROR] The project (/root/maven-plugins/pom.xml) has 1 error [ERROR] Non-readable POM /root/maven-plugins/pom.xml: /root/maven-plugins/pom.xml (No such file or directory) [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException

build go on just do like below:
cd vespa && ./bootstrap.sh java

Only send fields specified in YQL up from content nodes

As seen in VESPA-9043 (and numerous other times), it can be unnecessarily expensive (primarily network wise) when the content nodes return the full default summary class (all fields) when only a subset of the fields were requested in the YQL query.

While I understand the technical reason for this, it's not logical to the user that they need to create a separate summary class in addition to specifying this in the YQL select statement. Is it possible for us to add list of fields to be sent down along with query and get an "automatic summary class" returned?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.