vespa-engine / pyvespa Goto Github PK

View Code? Open in Web Editor NEW

69.0 69.0 20.0 51.01 MB

Python API for https://vespa.ai, the open big data serving engine

Home Page: https://pyvespa.readthedocs.io/

License: Apache License 2.0

Python 98.79% Ruby 1.21%

pyvespa's People

Contributors

Stargazers

Watchers

pyvespa's Issues

Allow Document to inherit from other Document

Required to reproduce https://github.com/vespa-engine/sample-apps/blob/master/semantic-qa-retrieval/src/main/application/schemas/sentence.sd

document sentence inherits context {

  field sentence_embedding type tensor<float>(x[512]) {
    indexing: attribute|index
    attribute {
      distance-metric:euclidean
    }
    index {
      hnsw {
        max-links-per-node: 16 
        neighbors-to-explore-at-insert: 500
      }
    }

  }
}

Fix typo: labelled_data should be labeled_data

Make it possible to specify the container image to use in the VespaDocker class

In issue #231 it would have been nice to specify the container image to use. This is not only for custom built or previews, but some users might also want to use the GitHub Docker Registry instead of Docker Hub (ghcr.io/vespa-engine/vespa) for container registry.

During batch feed, OSError: [Errno 24] Too many open files

Hi 👋
When I try to ingest data into Vespa Cloud, I get this error - OSError: [Errno 24] Too many open files.

When I select only the first few documents in my dataset, the feed works. If I use the whole dataset, I get that error. I dont see a way to reset the connections/close files. So, pyvespa wont let me upload anymore data unless I quit the python session and do it all over again. Synchronous batch feed works, but it is too slow for my usecase.
Code:

# works
app.feed_batch(schema="myschema", batch=batch_data[:1000], batch_size=1000, total_timeout=200, asynchronous=True)
# fails
app.feed_batch(schema="myschema", batch=batch_data, batch_size=1000, total_timeout=200, asynchronous=True)

Error when using onnx model on Windows

Hi,

I have a problem when trying to deploy an app with an onnx model to Docker on Windows 10.

When deploying the app on Docker, I get the following error:

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)
----------------------------------------------------------

RuntimeError: ["Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session", "Session 15 for tenant 'default' created.",
 'Preparing session 15 using http://localhost:19071/application/v2/tenant/default/session/15/prepared',
 'Request failed. HTTP status code: 400', 'Invalid application package: Error loading default.default: Could not parse schema file \'crossencoder.sd\':
 Unknown symbol: Lexical error at line -1, column 356.  Encountered: "\\\\" (92), after : ""', '']

The application package has worked on another OS, so the definition of it should not be the problem.

Return specialized Vespa connection depending on the ApplicationPackage inheritance

When deploying a specialized application package such as:

from vespa.gallery import TextSearch 

app_package = TextSearch(id_field="id", text_fields=["title", "body"])

the instance returned by the deployment method could be specific to this specialized application, such as VespaTextSearch, inheriting from the base class Vespa.

Different app use cases have different needs and pyvespa currently lacks a pattern to encode those needs. Just to give another example, this would be useful to natively support TextImageSearch and similar use cases.

Retrieving a running container with different VespaDocker instance will lead to deploy failure

It seems that previously defined volumes are not preserved. To reproduce try to deploy the same application with two different instances of VespaDocker. The first deployment will create the container. The second deployment will retrieve the already existing container but deployment will fail with

RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']

Allow pyvespa to set rank-features when defining rank-profile.

Example:

rank-profile collect_rank_features inherits default {
	first-phase {
	    expression: random
	}
	ignore-default-rank-features
	rank-features {
	    bm25(title)
	    bm25(body)
	    nativeRank(title)
	    nativeRank(body)
	}
}

This is important when collecting training data for example.

Query with an embedding and field

I'm trying to run a query using nearest neighbour search while limiting it to records with a certain value, but I get errors when I try to combine the fields in a single yql query.

Using the example from Image Search, I can run:

response = app.query(body={
    "yql": 'select * from sources * where ([{"targetNumHits":100}]nearestNeighbor(embedding_image,embedding_text));',
    "hits": 100,
    "ranking.features.query(embedding_text)": [0.632, -0.987, ..., 0.534],
    "ranking.profile": "embedding_similarity"
})

and return results.
My records also have numerical attributes "value" and "cost". I can search to filter to a specific value of "value" or "cost" individually, or both of them together, e.g.

response = app.query(body={
    "yql": 'select * from sources * where (value=100 and cost=10);',
    "hits": 100
})

but when I try to combine the embedding search while filtering to a value, I get an error

response = app.query(body={
    "yql": 'select * from sources * where (value=100 and [{"targetNumHits":100}]nearestNeighbor(embedding_image,embedding_text));',
    "hits": 100,
    "ranking.features.query(embedding_text)": [0.632, -0.987, ..., 0.534],
    "ranking.profile": "embedding_similarity"
})

The error is

mismatched input 'nearestNeighbor' expecting {<EOF>, 'select', ';'}

[DOCS] "enclidean" distance metric in Field example

Field class distance_metric parameter should be "euclidean" instead of "enclidean". Example from code:

>>> Field(name="tensor_field",       
...     type="tensor&lt;float&gt;(x[128])", 
...     indexing=["attribute"],     
...     ann=HNSW(       
...         distance_metric="enclidean",       
...         max_links_per_node=16,        
...         neighbors_to_explore_at_insert=200,        
...     ),        
... )

deploy timeout

For auto testing, it is useful to be able to set a deploy timeout

11:50:14 docs/sphinx/source/deploy-docker.ipynb
11:50:14 /usr/local/lib/python3.8/site-packages/runnb/runnb.py:28: DeprecationWarning: The notebook is NOT trusted.
11:50:14   warnings.warn('The notebook is NOT trusted.', DeprecationWarning)
11:50:47 Waiting for configuration server.
11:50:53 Waiting for configuration server.
11:50:58 Waiting for configuration server.
11:51:03 Waiting for configuration server.
11:51:08 Waiting for configuration server.
11:51:13 Waiting for configuration server.
11:51:18 Waiting for configuration server.

In https://docs.vespa.ai/en/vespa-quick-start.html we do:

vespa status deploy --wait 300

I suggest we add an optional wait parameter to the deploy-command

Add timing as argument in the app.query

Timing Vespa queries - Disconsidering network and other non-search related costs

Verbose pyvespa

debugging pyvespa pipeline problems is easier in a verbose mode where responses from select operations are output to stdout (results from requests, etc).

Maybe an env var or something to enable

get_data and update_data are missing namespace

Hi
In the document v1 api guide the put (which is used under the hood inside the batch_update) has the separation between the namespace and the schema
e.g.
http://hostname:8080/document/v1/namespace/music/docid/1
In pyvespa the code for get_data and update_data does not separate between the namespace and the schema
e.g.

end_point = "{}/document/v1/{}/{}/docid/{}?create={}".format(
            self.app.end_point, schema, schema, str(data_id), str(create).lower()
)

This causes a bug while updating the documents with pyvespa (or doesn't update the document or creates a duplicate if the create parameter is set to True

Support for document-summaries and struct datatypes

Is there a way to create a Struct data type within a schema? I have an array that I’m trying to use as an imported field, but I can’t find any reference for the struct field.

Also is there any built-in support for creating document summaries?

Replace targetNumHits by targetHits when using the nearestNeighbor operator

targetHits is the preferred name. targetNumHits is an old alias.

Misleading situation when using the recall keyword

Hi,
I think there is a misleading situation when using the recall keyword with the number of results that are returned from the app.query

When running

query_results = app.query(query=query, 
            query_model=query_model, 
            recall=recall_docs,
            )
query_results.get_hits()

The number of results is 10 (default length of hits)

I think the default hits should be the number of the docs in the recal_docs

query_results = app.query(query=query, 
            query_model=query_model, 
            recall=recall_docs,
            hits=len(_recall_docs)
            )
query_results.get_hits()

If the number of recall docs are less then 10 - I would expect to get less the 10 results, thus when the number of docs in the recall is more then 10, I would expect to get all the hits for the docs in the recall list.

Does this sound reasonable?

Implement get, put, remove and update in pyvespa

We currently implemented put with the following method:

response = app.feed_data_point(
    schema = "msmarco", 
    data_id = 1, 
    fields = {
        "id": "1", 
        "title": "This is a text", 
        "body": "This is the body of the text"
    }
)

But a better way would be to have the methods insert_data, update_data, remove_data and get_data.

Reference doc: https://docs.vespa.ai/documentation/writing-to-vespa.html

API

Since pyvespa is still under active development:
Please consider to use python3 non-blocking coroutines, e.g. to use the async / await keywords.

Many modern python frameworks ( fastapi, sanic, quart, starlette, ...) are based on async.
Integrating blocking code into into an async application is burdensome (you have to delegate to threads or processes).

Implement Tokenize operator

Expected query example:

select * from sources * where ({"grammar": "tokenize", "targetHits": 100, "defaultIndex": "default"}userInput(“this is  a test”));

Expanding the evalutation example to include the recall argument

Hi,
In the Evaluation application documentation example, there is no mentioning of the option of using recall for specific documetns.
Since this option can be very useful in an evaluation process - I'm suggesting to add an example with the recall argument

top_ids = [...]

 query_evaluation = app.evaluate_query(
        eval_metrics = eval_metrics,
        query_model = query_model,
        query_id = query_data["query_id"],
        query = query_data["query"],
        id_field = "id",
        relevant_docs = query_data["relevant_docs"],
        default_score = 0,
        recall = ("id", top_ids[1:3]))
    )

Reutilize pyvespa integration test vespa cloud instance.

We currently create a vespa cloud instance each time the pyvespa integration tests are run. We have to reutilize the pyvespa integration test instance instead. The reason is that a new TLS certificate is created every time we create a new instance on Vespa Cloud.

Overwriting application files break Docker volume mount

We create new application files every time we redeploy an application package.

Creating new files instead of modifying them in place breaks the bind-mount (see this issue), and changes in application files on the host are not propagated to the container, leading to new changes not being deployed.

Restarting the container re-establish the bind-mount and should solve the problem.

M1 app deployment issues.

Hi I am having some issues deploying the app locally on an m1 mac. Was wondering if this is a known issue?

Versions:

python: 3.9.7
pyvespa: 0.14
docker: 4.3.2
macOS 12.0.1

This is what I see in docker desktop:

When I try to create a app deployment, the application will hang as follows.

from vespa.gallery import QuestionAnswering
from vespa.deployment import VespaDocker

app_package = QuestionAnswering()
vespa_docker = VespaDocker(port=8089)
app = vespa_docker.deploy(application_package=app_package)

Output

Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
...

Thanks for all the work in creating a python wrapper for vespa!

Create VespaDocker instance from running container name

vespa_docker = VespaDocker(
    port=8080, 
    disk_folder="/User/username/sample_app", 
    container_memory="8G"
)
vespa_docker.deploy(application_package=app_package)

After deploying in a Docker container like the above, we might want to continue to use the same running container in a future python session. For that, we need a way to instantiate VespaDocker from the container name.

VespaDocker.from_container_name(app_package.name)

Make batch operations more resilient with retry and mini batches by default

Example from a use case:

@retry(wait=wait_exponential(multiplier=1), stop=stop_after_attempt(10))
def send_feed_batch(self, feed_batch, total_timeout=10000):
    feed_results = self.app.feed_batch(
        batch=feed_batch, total_timeout=total_timeout
    )
    return feed_results

def index(self, corpus: Dict[str, Dict[str, str]], batch_size=1000):
    batch_feed = [
        {
            "id": idx,
            "fields": {
                "id": idx,
                "title": corpus[idx].get("title", None),
                "body": corpus[idx].get("text", None),
            },
        }
        for idx in list(corpus.keys())
    ]
    mini_batches = [
        batch_feed[i : i + batch_size]
        for i in range(0, len(batch_feed), batch_size)
    ]
    for idx, feed_batch in enumerate(mini_batches):
        feed_results = self.send_feed_batch(feed_batch=feed_batch)
        status_code_summary = Counter([x.status_code for x in feed_results])
        print(
            "Successful documents fed: {}/{}.\nBatch progress: {}/{}.".format(
                status_code_summary[200], len(feed_batch), idx, len(mini_batches)
            )
        )
    return 0

Show progress status when evaluating query models

Similar to what was done in #25 for data collection.

Duplicate relevant docs potentially being returned in the collect_training_data method

When sampling additional documents for each relevant document, it might happen that the relevant document is included again by chance. This happens often when the number of additional documents being sampled is large compared to the expected number of documents available for sampling.

Remove data plane certificate and key management from pyvespa

pyvespa currently generates a data plane certificate and key and stores it in a file every time a deployment is made. This behavior conflicts with the workflow of using the vespa-cli to generate API key and dataplane certificate and key.

I suggest we remove this functionality from pyvespa to rely solely on vespa-cli to set-up certificates and keys for Vespa Cloud interaction.

Rename feed_data_point to insert_data

Create a new method app.insert_data and add a deprecation warning on app.feed_data_point to follow the same pattern as the other data operations

pyvespa fail when app docker container exist but is not running

pyvespa will fail if we deploy an app with VespaDocker.deploy and then try to redeploy once we externally stop the container. We need to add a check to see if the app container exist and is running. If it exist and is not running we need to start it again.

Move away from notebook tests in pyvespa

Implement integration tests in python files instead. We would like to move away from the nbdev library.

Expose pool_maxsize parameter for HTTPAdapter

On older versions of pyvespa (I was previously using 0.5.0) one had to manually create the HTTPAdapter for pyvespa. Now this is handled in the vespa library (I like the design choice) but the issue is that the maximum pool size for the HTTPAdapter's connections is no longer exposed. This destroys multithreading with Vespa, and should be exposed through the constructor:

pyvespa/vespa/application.py

Line 112 in e4b968e

adapter = HTTPAdapter(max_retries=retry_strategy)

RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']

When running documentation example from how-to https://pyvespa.readthedocs.io/en/latest/howto/deploy_app_package/deploy-docker.html#Deploy-application-package-created-with-pyvespa the script fails with error:

RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']

from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name="my_package")
from vespa.package import Field
app_package.schema.add_fields(
    Field(name = "cord_uid", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "abstract", type = "string", indexing = ["index", "summary"], index = "enable-bm25")
)

from vespa.package import FieldSet
app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "abstract"])
)

from vespa.package import RankProfile
app_package.schema.add_rank_profile(
    RankProfile(name = "bm25", first_phase = "bm25(title) + bm25(abstract)")
)

import os
from vespa.deployment import VespaDocker

disk_folder = "sample_application" # specify your desired absolute path here
vespa_docker = VespaDocker(
    port=8083,
    disk_folder=disk_folder
)

app = vespa_docker.deploy(
    application_package = app_package,
)

Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_285/3496665072.py in <module>
      8 )
      9 
---> 10 app = vespa_docker.deploy(
     11     application_package = app_package,
     12 )

/.env/lib/python3.8/site-packages/vespa/deployment.py in deploy(self, application_package)
    261         self.export_application_package(application_package=application_package)
    262 
--> 263         return self._execute_deployment(
    264             application_name=application_package.name,
    265             disk_folder=self.disk_folder,

/.env/lib/python3.8/site-packages/vespa/deployment.py in _execute_deployment(self, application_name, disk_folder, container_memory, application_folder, application_package)
    232 
    233         if not any(re.match("Generation: [0-9]+", line) for line in deployment_message):
--> 234             raise RuntimeError(deployment_message)
    235 
    236         app = Vespa(

RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']

Improve integration tests to capture unnecessary requirement of ml libraries

The requirement of installing ML libraries through pip install pyvespa[ml] should only happen when using modules that requires them such as vespa.ml or vespa.experimental.ranking.

This should avoid a new occurrence of #341.

"Import field" missing in Schema python model

I want to reproduce modeling of this schema line in pyvespa code:

https://github.com/vespa-engine/sample-apps/blob/67b925ac46e80e5e23c6a3873e58805f173fcdfd/news/app-7-parent-child/src/main/application/schemas/news.sd#L55

How can I write python Schema model that generates such a line?

Adapt deploy to vespa cloud documentation notebook to run without errors in the CI pipeline

pyvespa documentation page for Vespa Cloud deployment is currently not functional from an integration test point of view. Update the notebook to make it functional.

Take #343 into account as it simplified certificate and key management.

VespaDocker downloads many vespa images when one is not already available

To reproduce clean all local vespa images and try to deploy. It will run for a long time and docker image ls -a will show many vespa images.

Don't swallow query errors

As demonstrated in https://github.com/sha124/vespa/blob/main/VespaDocSimilarity.ipynb, if there is a parsing error in the query and the search result returns no hits + 4xx, the error is swallowed. The library should throw an error in this case.

pyvespa deploy using POST

Ref question on the public Slack:

I am trying out PyVespa and I wanted to deploy my application to docker in cloud where I am having only end point access and not complete docker daemon access. But, in pyvespa I wasn't able to find a way to deploy without accessing whole docker. I found a way using Vespa CLI using curl command:

curl --header Content-Type:application/zip --data-binary @application.zip  localhost:19071/application/v2/tenant/default/prepareandactivate

I want to know is there any way we can do end-point deployment using pyvespa?

pyvespa now makes it mandatory to include ml extra packages

Hi!
While using the latest version, I'm now forced to install the ml dependencies when using Vespa/VespaAsync.

Confusing naming in pyvespa: query.RankProfile and package.RankProfile

We have a class named RankProfile in the vespa.package module to create rank-profiles in the application package. However, we also have a class named RankProfile in the vespa.query module to define which rank-profile should be used in the query model.

This is unnecessarily confusing. I suggest we use vespa.query.Ranking instead of vespa.query.RankProfile to clarify the different use cases.

Accessing deployed docker container from another IP?

I've deployed my Vespa app using pyvespa VespaDocker, which I can connect to on localhost on the same machine, but trying to connect to it from another machine results in a timeout. Do we need to run the application on 0.0.0.0 (as with flask, for example), to enable connection from external machines? If so, looking at the source code "localhost" is hardcoded in several places, so I guess it's not currently possible?

Move container config arguments from deploy to VespaDocker

Current usage:

vespa_docker = VespaDocker(port=8080)
app = vespa_docker.deploy(
    application_package=app_package, 
    container_memory="8G", 
    disk_folder="/Users/username/app_folder" 
)

Suggested usage:

vespa_docker = VespaDocker(
    port=8080, 
    container_memory="8G", 
    disk_folder="/Users/username/app_folder" 
)
app = vespa_docker.deploy(
    application_package=app_package, 
)

Reason: Container config parameters belong to the initialization method as we should only specify them once. The deploy method will be called every time we need to redeploy our application package and it makes no sense to repeat container config args such as container_memory every time we redeploy to the same container.

Adding a field should overwrite existing field content

Currently, the code below adds two fields named title to the application package schema instead of update the existing title field.

from vespa.package import ApplicationPackage, Field

app_package = ApplicationPackage(name="news")
#
# Add title field  
#
app_package.schema.add_fields(  
    Field(name="title", type="string", indexing=["index", "summary"])
)
#
# Update title field
#
app_package.schema.add_fields(  
    Field(name="title", type="string", indexing=["index", "summary"], index=["enable-bm25"])
)

pyvespa model integration

Are there any plans to make it simpler for users to utilize any type of ml model (cnn, rnn, gnn, etc.) they want with Vespa during inference?

pyvespa is a useful tool to integrate ml models with the Vespa engine, but it still feels limited. For example in sequence-classification-task-with-vespa-cloud it is only able to load huggingface text models. Are there any plans in the future to create a wrapper for users to implement their own models with customizability for pre/postprocessing in pyvespa? I say this because having ml developers rewrite pre/postprocessing in java is not a fun experience. Could this also be possible when the model is running inference within the content cluster? I am finding that loading embeddings directly into Vespa is painless but trying to load models into Vespa causes some pains.

Thanks.

References

Fix collect training data doc

The latest changes broke the collect training data doc. Fix doc and add code to an integration test.

Improve collect_training_data method speed and stability

Items to work on

Run in async mode
add retry and mini-batch

Add linkcheck to our CI

Current documentation is built by readthedocs once a PR gets merged to master. However, we should add link check and check that the documentation can be built in our CI to catch documentation errors before they hit the master branch.

Instructions on deploying to Vespa Cloud Prod

It seems like the process for deploying to Prod is different from dev deployments. Prod deployment page asks for 2 zip files. I downloaded the application zip file from Vespa Cloud, but I see that it's possible to do this with to_files call. However, I'm not sure how to generate the test package. Please help me understand the process.

vespa-engine / pyvespa Goto Github PK

pyvespa's People

Contributors

Stargazers

Watchers

Forkers

pyvespa's Issues

Recommend Projects

Recommend Topics

Recommend Org