weaviate / weaviate-python-client Goto Github PK

View Code? Open in Web Editor NEW

149.0 26.0 64.0 7.17 MB

A python native client for easy interaction with a Weaviate instance.

Home Page: https://weaviate.io/developers/weaviate/current/client-libraries/python.html

License: BSD 3-Clause "New" or "Revised" License

Python 99.91% Shell 0.09%

python vector-search weaviate

weaviate-python-client's Introduction

Weaviate python client

A python native client for easy interaction with a Weaviate instance.

The client is tested for python 3.8 and higher.

Visit the official Weaviate website for more information about the Weaviate and how to use it in production.

Articles

Here are some articles on Weaviate:

Documentation

Support

Use our Forum for support or any other question.
Use our Slack Channel for discussions or any other question.
Use the weaviate tag on StackOverflow for questions.
For bugs or problems, submit a GitHub issue.

Contributing

To contribute, read How to Contribute.

weaviate-python-client's People

Contributors

Stargazers

Watchers

weaviate-python-client's Issues

Querying weaviate using GQL pattern discussion

Currently the client allows a very simple query that is basically just a POST of a graphql query to weaviate. Users might want to build more complex GQL queries and might not want to use just (hard coded) strings. Creating string based queries also has the risk of creating queries vulnerable to injections.

The problem here is that while the REST API is very easily useable in python, GQL is not. Thats because json's can be represented seamlessly as native python objects while GQL can not. I would like to have a discussion on how to approach this.

One suggestion would be to recommend a GQL python client to the users and add support for that in our client. This would prevent us from implementing any GQL specific standards ourself.

Any opinions about this? @bobvanluijt @etiennedi @laura-ham

Input checking of weaviate url

A tailing slash in the url leads to the instance not being reachable through the client.

client = weaviate.Client("https://my-instance.com/")

Add proper code documentation

All functions in the Client class Weaviate should be commented using the consistent scheme:

""" Description

:param thing: Thing to be added
:type thing: dict
:return: value of ...
:raises: Exception
"""

Add data-import functionality

Refactor client class

The client is defined in one file in one class. This is getting messy. It might make sense to break the client up more into different concerns like schema and data.

Reference Batch loads do not check every elements status

When references are batch loaded then every element has its own status code. Currently only the status code of the entire request is checked.

add_reference_to_thing doesn't need URL while batching

Currently, a batch is done like this: client.add_reference_to_thing("2db436b5-0557-5016-9c5f-531412adf9c6", "members", "b36268d4-a6b5-5274-985f-45f13ce0c642")

However, in all other situations, a beacon (or CREF) is always set as a URL which might be confusing.

The suggestion is to add the function: add_reference() which behaves the exact same way but with the URLs: client.add_reference("weaviate://localhost/things/2db436b5-0557-5016-9c5f-531412adf9c6", "members", "weaviate://localhost/things/b36268d4-a6b5-5274-985f-45f13ce0c642")

In the documentation, I would like then also to focus on add_reference() for consistency.

Validation error when passing valid domains in client

This throws an error:

client = weaviate.Client("http://zeus:8080")

Cause:
validators package does not accept this:

validators.url("http://zeus:8080")

First issued in: weaviate/weaviate#1216

Examples in docs not aligned with other docs

For consistency, all documentation is based on the news article dataset. The current Python docs however aren't.

It would be great if they can be aligned with the other docs.

REMOVE: ClientConfig class

ClientConfig class should be removed since it holds only a tuple of ints and does not provide any methods. This seems like a Connection attribute feature. Also it is very annoying to create a whole class in order to just pass the tuple configuration to the Client and Connection directly. This requires to change Client.__init__ and Connection.__init__.

login to dockerhub to avoid being rate limited

title says it all

Batchobjects input checks do not have helpful error messages

Throw exception when no/wrong auth information is given

When weaviate requires authentication and no/wrong credentials are given then a complex exception is thrown. Replace this by an expected and documented exception

create schema doesn't work

I tried to create schema with

client = weaviate.Client(WEAVIATE_URL)
r = client.create_schema(body)

It returned r as None but later when i tried to get schema and add anything under defined class, it threw error saying

weaviate.exceptions.UnexpectedStatusCodeException: Creating thing {'error': [{'message': "invalid thing: class 'Articlev1' not present in schema"}]}

Use ValueErrors and TypeErrors in the right way

Currently often ValueErrors are used when actually TypeErrors are the case. In many cases the methods do not throw either one of them if the user provides flawed input.

Create unit tests

Current base code is untested.

API review

The client has received quite some functionalities. It is now important to review its API to have a consistent UX.
The goal is the road map to an API that can be published as version 1.

Some things to review:

Naming of functions and arguments
Coherence with our other products
Error handling and exceptions
Documentation within code (see #9)
Documentation at the website?
accepted types (e.g. url, file, dict)

Main focus are the user facing classes. Mainly Weaviate and helper classes e.g. used for batching.

Please also link all issues that result from this review in #3

Generate schema from Sample data

This issue is responsible for the backend functionality discussed in weaviate/weaviate-cli#47

Add different authentication methods

Add various authentication methods such as Username and password

Add delete_thing to client

Allow for basic CRUD and DDL functionality to work along example or tutorial.

https://travis-ci.org/github/WolfgangFahl/DgraphAndWeaviateTest has test code for Dgraph and Weaviate. While it was relatively straight forward to get the Dgraph unit tests at https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testDgraph.py working based on the tutorial https://dgraph.io/tour/ and sample code https://github.com/dgraph-io/pydgraph/blob/master/examples/simple/simple.py the unit tests in:

https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testWeaviate.py

do not work out of the box yet. The port will have to be modified since it's also used by dgraph - the docker image will have to be pulled and started and there is a need for a drop_all functionality as in the dgraph example. Otherwise it should be possible to write a weaviate simple/wrapper code as https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/dg/simple.py was derived from the https://github.com/dgraph-io/pydgraph/blob/master/examples/simple/simple.py code. '

For the startup via docker issues see also https://stackoverflow.com/questions/63260073/starting-zero-alpha-and-ratel-in-a-single-command-e-g-in-macosx-and-other-envir

Implement methods for actions

So far the client is only focused on things, it should now also support actions.

Add load schema into weaviate

Return UUID on successful batch request

Batch loading does not return UUIDs of the created things.

The check whether classification is still running sometimes throws an exception. See the message below

{'basedOnProperties': ['shortDescription'], 'class': 'Product', 'classifyProperties': ['ofFlavour'], 'id': 'e4ad5c2e-3e3e-470f-b942-93df9cd26d0e', 'k': 7, 'meta': {'completed': '0001-01-01T00:00:00.000Z', 'started': '2020-12-11T08:41:16.188Z'}, 'status': 'running', 'type': 'knn'}
Classifying All.Traceback (most recent call last):
File "./validate.py", line 28, in
main()
File "./validate.py", line 19, in main
validateClassification(config, data)
File "PATH/modules/API.py", line 96, in validateClassification
executeClassification(client, config)
File "PATH/modules/ClassifyData.py", line 83, in executeClassification
print_classification_status(client, status, "All")
File "PATH/modules/ClassifyData.py", line 16, in print_classification_status
while client.classification.is_running(status['id']):
File "PATH/Envs/env1/lib/python3.8/site-packages/weaviate/classification/classify.py", line 67, in is_running
return self._check_status(classification_uuid, "running")
File "PATH/Envs/env1/lib/python3.8/site-packages/weaviate/classification/classify.py", line 71, in _check_status
response = self.get(classification_uuid)
File "PATH/Envs/env1/lib/python3.8/site-packages/weaviate/classification/classify.py", line 45, in get
raise UnexpectedStatusCodeException("Get classification status", response)
weaviate.exceptions.UnexpectedStatusCodeException: Unexpected status code: 502, with response body: None

Accept url to json schema parameter in "schema.contains()"

schema.create() accepts a dict as well as url to a json file. A similar function schema.contains() does not support an url to a json schema file as parameter. Suggested is to accept also an url in the schema.contains() function for consistency and for convenience. For example:

import weaviate
w = weaviate.Client("http://localhost:8080")

contains_people_schema = w.schema.contains("https://raw.githubusercontent.com/semi-technologies/weaviate-python-client/master/documentation/getting_started/people_schema.json")
print(contains_people_schema)

# similar to
w.schema.create("https://raw.githubusercontent.com/semi-technologies/weaviate-python-client/master/documentation/getting_started/people_schema.json")

Async classification polling in Python client

Can we have async poling functionality while doing classification? Currently it gives a UUID against which we have to trigger another endpoint and keep checking status.

add reference only works for `many` cardinality.

References can not be added through the client for atMostOne cardinalities.

UPGRADE dependencies.

Create schema import for rdf triples

Allow the import of RDF tripels in the form of Subject Predicate Object. The function should not be part of the client it self but rather be a different sub package like batch where the user gets some extended functionality to handle common data representations.

is_reachable does not check if all components are up and running

Currently just get /v1/meta
todo:

GET /v1/things/?limit=1
Whats with authentication?
Add tests

FIX: Python Client's TIMEOUT docstring and validation

The Current implementation of the weaviate python client has a wrong assumption of the requests library's timeout parameter. Currently is assumes only tuples of ints where it looks like this (retries, timeout seconds) which is wrong. The timeout parameter can be an float or a tuple of two floats (float, float). The difference is described on the official readthedocs page, here.

Efficient requests

Make use of sessions and other more bare-bone functionalities to improve request speed.

See performance: https://realpython.com/python-requests/

MILESTONE: Get to version 1

Have all API features discussed with implemented @michaverhagen please add list
Have all API features discussed with @laura-ham for UX quality
Unit tests for all features
CI pipeline for project
Publish version 1.0.0 on GitHub
Publish version 1.0.0 on pip

Create CI pipeline

For running tests and releasing new versions.

Documentation

All documentation should go here a ./weaviate-clients/python/current.

cc: @laura-ham is responsible for docs.

UPDATE: Use requests.Sessions to make the requests connection persistent

Possible backward incompatibility

If I try to import this dataset with python client 2.4.0 I get the following error:

ValueError: Not valid 'uuid' or 'uuid' can not be extracted from value

However, if I downgrade to the client version specified in the requirements.txt (2.0.2), the import works fine.

I don't now if this is an issue with the client or with the repository. Thanks for investigationg.

EDIT: tests

Make Unit Tests and Integration Tests' code more readable and more consistent. For Unit Tests each function/method should have its own test function and each class should be a separate unittest class. For Integration test make it more clear what is being tested.

Discussion: Handling Timeouts more gracefully

Background

The most likely request to ever run into a timeout is a batch request as that is basically a collection of import requests. With some request of the data-classification (cc @michaverhagen, @antas-marcin) repo we have seen that a batch size of 1000 can lead to timeouts depending on the data set. (Note that "batch size" in this case refers to the number of objects in a POST v1/batch/objects request. Not to be confused with the classification batching which the repo does).

If such a timeout occurs, the reason for the timeout (which in my opinion is the batch size which is too large) is very opaque to the user. I think we should achieve the following goals. I don't know how feasible they are in the client, hence this discussion.

Goals

1. Handle gracefully, make user aware

The current output when a timeout occurs looks something like this. Note that I have shortened the snippet to remove potentially sensitive data:

.... shortened to remove sensitive stuff ....
 File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 102, in create_objects
    return self.create(
  File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 64, in create
    response = self._connection.run_rest(
  File "/usr/local/lib/python3.9/site-packages/weaviate/connect/connection.py", line 221, in run_rest
    response = requests.post(
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 119, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='REMOVED.semi.network', port=443): Read timed out. (read timeout=20)

Technically, all of the info is present in the above. However, it's very hard to find the correct info and there is no suggestion how the user can fix it. I think it would be nicer if a clear error message (if possible without any kind of stack trace?) is printed such as

The request was cancelled because it took longer than the configured timeout of 20s. Try reducing the load or increasing the timeout if the load cannot be reduced any further.

Not sure if we should encourage the user to try to increase the timeout. Long-running HTTP requests are problematic, because at some point we will run into hard-coded (or default) limits in load balancers and proxies. From my experience these are usually around the 1min mark. Since the default is already 20s - which is a lot - I don't think we should encourage the user to just increase the timeout further, which brings me to the next point

2. Print current batch size and suggest reducing it

I have no idea if this is possible in the client at all, but it would be really nice if we could catch the fact that the timed out request was a batch request and make the user aware, such as

The batch request was cancelled because it took longer than the configured timeout of 20s. Try reducing the batch size (currently 1000) to a lower value. Aim to on average complete batch request within less than half (10s) of the currently configured timeout (20s).

The most important part is about reducing the batch size. The second sentence encourages the user the think about monitoring their requests. The idea behind that is that you could be in a very dangerous situation where your request take on average 19s. Then you change a tiny thing (maybe add slightly longer descriptions) and suddenly all requests run into timeouts - seemingly without changing anything considerable.

Feedback

Happy to hear your feedback on the UX suggestions and the feasibility of implementing both 1 and 2 in the client.

I think we should also implement something like this in the other clients, but given that the python client receives 99% of the usage we're aware of at the moment, it's where we should start :)

UPDATE global/class config in schema.

add schema.update_config(...) function to update mutable configurations of a class or global configurations.

Transform common RDF based schemas into a corresponding weaviate schema

Schema definitions like RDFS and OWL should be transformed into a comprehensive weaviate schema. This requires the creation things like containers and collections as weaviate classes. Property classes should be transformed into primitive properties if that is what they represent in the source schema. rdfs:comment or rdfs:label may be used as class descriptions instead of or additional to weaviate class properties.

Replace references

References are so far only appended using POST however they can also be replaced using PUT

Missing cardinality key gives error even if the field is not needed

The client gives an error if the "cardinality" field isn't set on all fields. However, the cardinality field only has to be set for references.

Traceback (most recent call last):
  File "./import.py", line 12, in <module>
    CLIENT.create_schema('schema.json')
  File "/usr/local/lib/python3.7/site-packages/weaviate/client.py", line 444, in create_schema
    loaded_schema[SCHEMA_CLASS_TYPE_THINGS]["classes"])
  File "/usr/local/lib/python3.7/site-packages/weaviate/client.py", line 484, in _create_class_with_primitives
    schema_class["properties"] = self._get_primitive_properties(weaviate_class["properties"])
  File "/usr/local/lib/python3.7/site-packages/weaviate/client.py", line 514, in _get_primitive_properties
    "cardinality": property_["cardinality"],
KeyError: 'cardinality'

ConnectionError is missleading

ConnectionError is documented as the error that can be thrown by the client. However it is not very well documented that it is actually requests.exceptions.ConnectionError and not builtins.ConnectionError. Further more ConnectionError does not include any timeout errors.

Replace ConnectionError with RequestException as the more general exception.

License and Readme

Copy this license.
Copy this look and feel.

Please tag me in the PR.

Allow setting custom vector in single and batch create object

Background

With the changes for v1 we have introduced the module system in Weaviate, which also allows for setting the vectorizer on a per-class basis. For example, with the app-wide config DEFAULT_VECTORIZER_MODULE=text2vec-contextionary - which is currently present on all docker-compose files we publish - the vectorizer is always set to text2vec-contextionary. However it can be overwritten.

It is also possible to explicitly set the vectorizer to none. In this case, Weaviate doesn't do any vectorization, but expects the user to provide the vectors. This is one way of using Weaviate with vectors coming from any ML model - without having to write a full module for it.

Goals

The user can specify a custom vector when creating a single or batch object.
Note: Weaviate will only accept the vector if the vectorizer is set to none. However, the client does not need to validate this explicitly and can just pass through any errors coming from Weaviate

Nice to Have

Support typical python standards
I think there are a few constructs in some of the common libraries for vectors, such as pytorch.Tensor and an eqivalent in numpy. It would be cool if the client could support the most common out of the box

Weaviate API

You can simply specify the vector at the root level, here's an example with curl

curl -s localhost:8080/v1/schema -H 'content-type: application/json' \
  -d '{"class": "TestClass", "vectorizer":"none", "properties":[{"name": "name", "dataType": ["string"]}]}' 

curl -s localhost:8080/v1/objects -H 'content-type: application/json' \
  -d '{"class": "TestClass", "vector": [0,1], "properties": {"name":"hello world"}}'

If we then check the object with the ?include=vector option, the custom vector is present:

weaviate / weaviate-python-client Goto Github PK

weaviate-python-client's Introduction

Weaviate python client

Articles

Documentation

Support

Contributing

weaviate-python-client's People

Contributors

Stargazers

Watchers

Forkers

weaviate-python-client's Issues

Background

Goals

1. Handle gracefully, make user aware

2. Print current batch size and suggest reducing it

Feedback

Background

Goals

Nice to Have

Weaviate API

Recommend Projects

Recommend Topics

Recommend Org