Giter VIP home page Giter VIP logo

pyvespa's People

Contributors

andreer avatar aressem avatar ausnews avatar baldersheim avatar bjormel avatar bjorncs avatar bratseth avatar dependabot[bot] avatar filippo82 avatar freva avatar frodelu avatar hmusum avatar jeanextreme002 avatar jobergum avatar johans1 avatar jonmv avatar kevinhu avatar kkraune avatar lesters avatar maxice8 avatar msminhas93 avatar rejasupotaro avatar renovate[bot] avatar sephiartlist avatar thigm85 avatar thomasht86 avatar tkaessmann avatar tmaregge avatar tremamiguel avatar whiteh4tdude avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyvespa's Issues

During batch feed, OSError: [Errno 24] Too many open files

Hi 👋
When I try to ingest data into Vespa Cloud, I get this error - OSError: [Errno 24] Too many open files.

When I select only the first few documents in my dataset, the feed works. If I use the whole dataset, I get that error. I dont see a way to reset the connections/close files. So, pyvespa wont let me upload anymore data unless I quit the python session and do it all over again. Synchronous batch feed works, but it is too slow for my usecase.
Code:

# works
app.feed_batch(schema="myschema", batch=batch_data[:1000], batch_size=1000, total_timeout=200, asynchronous=True)
# fails
app.feed_batch(schema="myschema", batch=batch_data, batch_size=1000, total_timeout=200, asynchronous=True)

Error when using onnx model on Windows

Hi,

I have a problem when trying to deploy an app with an onnx model to Docker on Windows 10.

When deploying the app on Docker, I get the following error:

vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)
----------------------------------------------------------

RuntimeError: ["Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session", "Session 15 for tenant 'default' created.",
 'Preparing session 15 using http://localhost:19071/application/v2/tenant/default/session/15/prepared',
 'Request failed. HTTP status code: 400', 'Invalid application package: Error loading default.default: Could not parse schema file \'crossencoder.sd\':
 Unknown symbol: Lexical error at line -1, column 356.  Encountered: "\\\\" (92), after : ""', '']

The application package has worked on another OS, so the definition of it should not be the problem.

Return specialized Vespa connection depending on the ApplicationPackage inheritance

When deploying a specialized application package such as:

from vespa.gallery import TextSearch 

app_package = TextSearch(id_field="id", text_fields=["title", "body"])

the instance returned by the deployment method could be specific to this specialized application, such as VespaTextSearch, inheriting from the base class Vespa.

Different app use cases have different needs and pyvespa currently lacks a pattern to encode those needs. Just to give another example, this would be useful to natively support TextImageSearch and similar use cases.

Retrieving a running container with different VespaDocker instance will lead to deploy failure

It seems that previously defined volumes are not preserved. To reproduce try to deploy the same application with two different instances of VespaDocker. The first deployment will create the container. The second deployment will retrieve the already existing container but deployment will fail with

RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']

Allow pyvespa to set rank-features when defining rank-profile.

Example:

rank-profile collect_rank_features inherits default {
	first-phase {
	    expression: random
	}
	ignore-default-rank-features
	rank-features {
	    bm25(title)
	    bm25(body)
	    nativeRank(title)
	    nativeRank(body)
	}
}

This is important when collecting training data for example.

Query with an embedding and field

I'm trying to run a query using nearest neighbour search while limiting it to records with a certain value, but I get errors when I try to combine the fields in a single yql query.

Using the example from Image Search, I can run:

response = app.query(body={
    "yql": 'select * from sources * where ([{"targetNumHits":100}]nearestNeighbor(embedding_image,embedding_text));',
    "hits": 100,
    "ranking.features.query(embedding_text)": [0.632, -0.987, ..., 0.534],
    "ranking.profile": "embedding_similarity"
})

and return results.
My records also have numerical attributes "value" and "cost". I can search to filter to a specific value of "value" or "cost" individually, or both of them together, e.g.

response = app.query(body={
    "yql": 'select * from sources * where (value=100 and cost=10);',
    "hits": 100
})

but when I try to combine the embedding search while filtering to a value, I get an error

response = app.query(body={
    "yql": 'select * from sources * where (value=100 and [{"targetNumHits":100}]nearestNeighbor(embedding_image,embedding_text));',
    "hits": 100,
    "ranking.features.query(embedding_text)": [0.632, -0.987, ..., 0.534],
    "ranking.profile": "embedding_similarity"
})

The error is

mismatched input 'nearestNeighbor' expecting {<EOF>, 'select', ';'}

[DOCS] "enclidean" distance metric in Field example

Field class distance_metric parameter should be "euclidean" instead of "enclidean". Example from code:

>>> Field(name="tensor_field",       
...     type="tensor&lt;float&gt;(x[128])", 
...     indexing=["attribute"],     
...     ann=HNSW(       
...         distance_metric="enclidean",       
...         max_links_per_node=16,        
...         neighbors_to_explore_at_insert=200,        
...     ),        
... )        

deploy timeout

For auto testing, it is useful to be able to set a deploy timeout

11:50:14 docs/sphinx/source/deploy-docker.ipynb
11:50:14 /usr/local/lib/python3.8/site-packages/runnb/runnb.py:28: DeprecationWarning: The notebook is NOT trusted.
11:50:14   warnings.warn('The notebook is NOT trusted.', DeprecationWarning)
11:50:47 Waiting for configuration server.
11:50:53 Waiting for configuration server.
11:50:58 Waiting for configuration server.
11:51:03 Waiting for configuration server.
11:51:08 Waiting for configuration server.
11:51:13 Waiting for configuration server.
11:51:18 Waiting for configuration server.

In https://docs.vespa.ai/en/vespa-quick-start.html we do:

vespa status deploy --wait 300

I suggest we add an optional wait parameter to the deploy-command

Verbose pyvespa

debugging pyvespa pipeline problems is easier in a verbose mode where responses from select operations are output to stdout (results from requests, etc).

Maybe an env var or something to enable

get_data and update_data are missing namespace

Hi
In the document v1 api guide the put (which is used under the hood inside the batch_update) has the separation between the namespace and the schema
e.g.
http://hostname:8080/document/v1/namespace/music/docid/1
In pyvespa the code for get_data and update_data does not separate between the namespace and the schema
e.g.

end_point = "{}/document/v1/{}/{}/docid/{}?create={}".format(
            self.app.end_point, schema, schema, str(data_id), str(create).lower()
)

This causes a bug while updating the documents with pyvespa (or doesn't update the document or creates a duplicate if the create parameter is set to True

Support for document-summaries and struct datatypes

Is there a way to create a Struct data type within a schema? I have an array that I’m trying to use as an imported field, but I can’t find any reference for the struct field.

Also is there any built-in support for creating document summaries?

Misleading situation when using the recall keyword

Hi,
I think there is a misleading situation when using the recall keyword with the number of results that are returned from the app.query

When running

query_results = app.query(query=query, 
            query_model=query_model, 
            recall=recall_docs,
            )
query_results.get_hits()

The number of results is 10 (default length of hits)

I think the default hits should be the number of the docs in the recal_docs

query_results = app.query(query=query, 
            query_model=query_model, 
            recall=recall_docs,
            hits=len(_recall_docs)
            )
query_results.get_hits()

If the number of recall docs are less then 10 - I would expect to get less the 10 results, thus when the number of docs in the recall is more then 10, I would expect to get all the hits for the docs in the recall list.

Does this sound reasonable?

API

Since pyvespa is still under active development:
Please consider to use python3 non-blocking coroutines, e.g. to use the async / await keywords.

Many modern python frameworks ( fastapi, sanic, quart, starlette, ...) are based on async.
Integrating blocking code into into an async application is burdensome (you have to delegate to threads or processes).

Implement Tokenize operator

Expected query example:

select * from sources * where ({"grammar": "tokenize", "targetHits": 100, "defaultIndex": "default"}userInput(“this is  a test”));

Expanding the evalutation example to include the recall argument

Hi,
In the Evaluation application documentation example, there is no mentioning of the option of using recall for specific documetns.
Since this option can be very useful in an evaluation process - I'm suggesting to add an example with the recall argument

top_ids = [...]

 query_evaluation = app.evaluate_query(
        eval_metrics = eval_metrics,
        query_model = query_model,
        query_id = query_data["query_id"],
        query = query_data["query"],
        id_field = "id",
        relevant_docs = query_data["relevant_docs"],
        default_score = 0,
        recall = ("id", top_ids[1:3]))
    )

Reutilize pyvespa integration test vespa cloud instance.

We currently create a vespa cloud instance each time the pyvespa integration tests are run. We have to reutilize the pyvespa integration test instance instead. The reason is that a new TLS certificate is created every time we create a new instance on Vespa Cloud.

Overwriting application files break Docker volume mount

We create new application files every time we redeploy an application package.

Creating new files instead of modifying them in place breaks the bind-mount (see this issue), and changes in application files on the host are not propagated to the container, leading to new changes not being deployed.

Restarting the container re-establish the bind-mount and should solve the problem.

M1 app deployment issues.

Hi I am having some issues deploying the app locally on an m1 mac. Was wondering if this is a known issue?

Versions:

  • python: 3.9.7
  • pyvespa: 0.14
  • docker: 4.3.2
  • macOS 12.0.1

This is what I see in docker desktop:

image

When I try to create a app deployment, the application will hang as follows.

from vespa.gallery import QuestionAnswering
from vespa.deployment import VespaDocker

app_package = QuestionAnswering()
vespa_docker = VespaDocker(port=8089)
app = vespa_docker.deploy(application_package=app_package)
Output

Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
...

Thanks for all the work in creating a python wrapper for vespa!

Create VespaDocker instance from running container name

vespa_docker = VespaDocker(
    port=8080, 
    disk_folder="/User/username/sample_app", 
    container_memory="8G"
)
vespa_docker.deploy(application_package=app_package)

After deploying in a Docker container like the above, we might want to continue to use the same running container in a future python session. For that, we need a way to instantiate VespaDocker from the container name.

VespaDocker.from_container_name(app_package.name)

Make batch operations more resilient with retry and mini batches by default

Example from a use case:

@retry(wait=wait_exponential(multiplier=1), stop=stop_after_attempt(10))
def send_feed_batch(self, feed_batch, total_timeout=10000):
    feed_results = self.app.feed_batch(
        batch=feed_batch, total_timeout=total_timeout
    )
    return feed_results

def index(self, corpus: Dict[str, Dict[str, str]], batch_size=1000):
    batch_feed = [
        {
            "id": idx,
            "fields": {
                "id": idx,
                "title": corpus[idx].get("title", None),
                "body": corpus[idx].get("text", None),
            },
        }
        for idx in list(corpus.keys())
    ]
    mini_batches = [
        batch_feed[i : i + batch_size]
        for i in range(0, len(batch_feed), batch_size)
    ]
    for idx, feed_batch in enumerate(mini_batches):
        feed_results = self.send_feed_batch(feed_batch=feed_batch)
        status_code_summary = Counter([x.status_code for x in feed_results])
        print(
            "Successful documents fed: {}/{}.\nBatch progress: {}/{}.".format(
                status_code_summary[200], len(feed_batch), idx, len(mini_batches)
            )
        )
    return 0

Remove data plane certificate and key management from pyvespa

pyvespa currently generates a data plane certificate and key and stores it in a file every time a deployment is made. This behavior conflicts with the workflow of using the vespa-cli to generate API key and dataplane certificate and key.

I suggest we remove this functionality from pyvespa to rely solely on vespa-cli to set-up certificates and keys for Vespa Cloud interaction.

pyvespa fail when app docker container exist but is not running

pyvespa will fail if we deploy an app with VespaDocker.deploy and then try to redeploy once we externally stop the container. We need to add a check to see if the app container exist and is running. If it exist and is not running we need to start it again.

Expose pool_maxsize parameter for HTTPAdapter

On older versions of pyvespa (I was previously using 0.5.0) one had to manually create the HTTPAdapter for pyvespa. Now this is handled in the vespa library (I like the design choice) but the issue is that the maximum pool size for the HTTPAdapter's connections is no longer exposed. This destroys multithreading with Vespa, and should be exposed through the constructor:

adapter = HTTPAdapter(max_retries=retry_strategy)

RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']

When running documentation example from how-to https://pyvespa.readthedocs.io/en/latest/howto/deploy_app_package/deploy-docker.html#Deploy-application-package-created-with-pyvespa the script fails with error:

RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']

from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name="my_package")
from vespa.package import Field
app_package.schema.add_fields(
    Field(name = "cord_uid", type = "string", indexing = ["attribute", "summary"]),
    Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
    Field(name = "abstract", type = "string", indexing = ["index", "summary"], index = "enable-bm25")
)

from vespa.package import FieldSet
app_package.schema.add_field_set(
    FieldSet(name = "default", fields = ["title", "abstract"])
)

from vespa.package import RankProfile
app_package.schema.add_rank_profile(
    RankProfile(name = "bm25", first_phase = "bm25(title) + bm25(abstract)")
)

import os
from vespa.deployment import VespaDocker

disk_folder = "sample_application" # specify your desired absolute path here
vespa_docker = VespaDocker(
    port=8083,
    disk_folder=disk_folder
)

app = vespa_docker.deploy(
    application_package = app_package,
)

Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_285/3496665072.py in <module>
      8 )
      9 
---> 10 app = vespa_docker.deploy(
     11     application_package = app_package,
     12 )

/.env/lib/python3.8/site-packages/vespa/deployment.py in deploy(self, application_package)
    261         self.export_application_package(application_package=application_package)
    262 
--> 263         return self._execute_deployment(
    264             application_name=application_package.name,
    265             disk_folder=self.disk_folder,

/.env/lib/python3.8/site-packages/vespa/deployment.py in _execute_deployment(self, application_name, disk_folder, container_memory, application_folder, application_package)
    232 
    233         if not any(re.match("Generation: [0-9]+", line) for line in deployment_message):
--> 234             raise RuntimeError(deployment_message)
    235 
    236         app = Vespa(

RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']

pyvespa deploy using POST

Ref question on the public Slack:

I am trying out PyVespa and I wanted to deploy my application to docker in cloud where I am having only end point access and not complete docker daemon access. But, in pyvespa I wasn't able to find a way to deploy without accessing whole docker. I found a way using Vespa CLI using curl command:

curl --header Content-Type:application/zip --data-binary @application.zip  localhost:19071/application/v2/tenant/default/prepareandactivate

I want to know is there any way we can do end-point deployment using pyvespa?

Confusing naming in pyvespa: query.RankProfile and package.RankProfile

We have a class named RankProfile in the vespa.package module to create rank-profiles in the application package. However, we also have a class named RankProfile in the vespa.query module to define which rank-profile should be used in the query model.

This is unnecessarily confusing. I suggest we use vespa.query.Ranking instead of vespa.query.RankProfile to clarify the different use cases.

Accessing deployed docker container from another IP?

I've deployed my Vespa app using pyvespa VespaDocker, which I can connect to on localhost on the same machine, but trying to connect to it from another machine results in a timeout. Do we need to run the application on 0.0.0.0 (as with flask, for example), to enable connection from external machines? If so, looking at the source code "localhost" is hardcoded in several places, so I guess it's not currently possible?

Move container config arguments from deploy to VespaDocker

Current usage:

vespa_docker = VespaDocker(port=8080)
app = vespa_docker.deploy(
    application_package=app_package, 
    container_memory="8G", 
    disk_folder="/Users/username/app_folder" 
)

Suggested usage:

vespa_docker = VespaDocker(
    port=8080, 
    container_memory="8G", 
    disk_folder="/Users/username/app_folder" 
)
app = vespa_docker.deploy(
    application_package=app_package, 
)

Reason: Container config parameters belong to the initialization method as we should only specify them once. The deploy method will be called every time we need to redeploy our application package and it makes no sense to repeat container config args such as container_memory every time we redeploy to the same container.

Adding a field should overwrite existing field content

Currently, the code below adds two fields named title to the application package schema instead of update the existing title field.

from vespa.package import ApplicationPackage, Field

app_package = ApplicationPackage(name="news")
#
# Add title field  
#
app_package.schema.add_fields(  
    Field(name="title", type="string", indexing=["index", "summary"])
)
#
# Update title field
#
app_package.schema.add_fields(  
    Field(name="title", type="string", indexing=["index", "summary"], index=["enable-bm25"])
)

pyvespa model integration

Are there any plans to make it simpler for users to utilize any type of ml model (cnn, rnn, gnn, etc.) they want with Vespa during inference?

pyvespa is a useful tool to integrate ml models with the Vespa engine, but it still feels limited. For example in sequence-classification-task-with-vespa-cloud it is only able to load huggingface text models. Are there any plans in the future to create a wrapper for users to implement their own models with customizability for pre/postprocessing in pyvespa? I say this because having ml developers rewrite pre/postprocessing in java is not a fun experience. Could this also be possible when the model is running inference within the content cluster? I am finding that loading embeddings directly into Vespa is painless but trying to load models into Vespa causes some pains.

Thanks.

References

Add linkcheck to our CI

Current documentation is built by readthedocs once a PR gets merged to master. However, we should add link check and check that the documentation can be built in our CI to catch documentation errors before they hit the master branch.

Instructions on deploying to Vespa Cloud Prod

It seems like the process for deploying to Prod is different from dev deployments. Prod deployment page asks for 2 zip files. I downloaded the application zip file from Vespa Cloud, but I see that it's possible to do this with to_files call. However, I'm not sure how to generate the test package. Please help me understand the process.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.