Giter VIP home page Giter VIP logo

weaviate-python-client's Introduction

Weaviate python client

Weaviate logo

Build Status PyPI version Documentation Status

A python native client for easy interaction with a Weaviate instance.

The client is tested for python 3.8 and higher.

Visit the official Weaviate website for more information about the Weaviate and how to use it in production.

Articles

Here are some articles on Weaviate:

Documentation

Support

  • Use our Forum for support or any other question.
  • Use our Slack Channel for discussions or any other question.
  • Use the weaviate tag on StackOverflow for questions.
  • For bugs or problems, submit a GitHub issue.

Contributing

To contribute, read How to Contribute.

weaviate-python-client's People

Contributors

abhishek-compro avatar aliszka avatar antas-marcin avatar apetresc avatar bobvanluijt avatar cdpierse avatar dandv avatar databyjp avatar daveatweaviate avatar dennis-ge avatar dependabot[bot] avatar dirkkul avatar dudanogueira avatar dvanderrijst avatar edugonza avatar etiennedi avatar fefi42 avatar fzowl avatar halilbilgin avatar hsm207 avatar iangregson avatar jfrancoa avatar nilskulawiak avatar parkerduckworth avatar redouan-rhazouani avatar samos123 avatar sky-2002 avatar stefanbogdan avatar trengrj avatar tsmith023 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weaviate-python-client's Issues

Querying weaviate using GQL pattern discussion

Currently the client allows a very simple query that is basically just a POST of a graphql query to weaviate. Users might want to build more complex GQL queries and might not want to use just (hard coded) strings. Creating string based queries also has the risk of creating queries vulnerable to injections.

The problem here is that while the REST API is very easily useable in python, GQL is not. Thats because json's can be represented seamlessly as native python objects while GQL can not. I would like to have a discussion on how to approach this.

One suggestion would be to recommend a GQL python client to the users and add support for that in our client. This would prevent us from implementing any GQL specific standards ourself.

Any opinions about this? @bobvanluijt @etiennedi @laura-ham

Input checking of weaviate url

A tailing slash in the url leads to the instance not being reachable through the client.

client = weaviate.Client("https://my-instance.com/")

Add proper code documentation

All functions in the Client class Weaviate should be commented using the consistent scheme:

""" Description

:param thing: Thing to be added
:type thing: dict
:return: value of ...
:raises: Exception
"""

Refactor client class

The client is defined in one file in one class. This is getting messy. It might make sense to break the client up more into different concerns like schema and data.

add_reference_to_thing doesn't need URL while batching

Currently, a batch is done like this: client.add_reference_to_thing("2db436b5-0557-5016-9c5f-531412adf9c6", "members", "b36268d4-a6b5-5274-985f-45f13ce0c642")

However, in all other situations, a beacon (or CREF) is always set as a URL which might be confusing.

The suggestion is to add the function: add_reference() which behaves the exact same way but with the URLs: client.add_reference("weaviate://localhost/things/2db436b5-0557-5016-9c5f-531412adf9c6", "members", "weaviate://localhost/things/b36268d4-a6b5-5274-985f-45f13ce0c642")

In the documentation, I would like then also to focus on add_reference() for consistency.

REMOVE: ClientConfig class

ClientConfig class should be removed since it holds only a tuple of ints and does not provide any methods. This seems like a Connection attribute feature. Also it is very annoying to create a whole class in order to just pass the tuple configuration to the Client and Connection directly. This requires to change Client.__init__ and Connection.__init__.

create schema doesn't work

I tried to create schema with

client = weaviate.Client(WEAVIATE_URL)
r = client.create_schema(body)

It returned r as None but later when i tried to get schema and add anything under defined class, it threw error saying

weaviate.exceptions.UnexpectedStatusCodeException: Creating thing {'error': [{'message': "invalid thing: class 'Articlev1' not present in schema"}]}

API review

The client has received quite some functionalities. It is now important to review its API to have a consistent UX.
The goal is the road map to an API that can be published as version 1.

Some things to review:

  • Naming of functions and arguments
  • Coherence with our other products
  • Error handling and exceptions
  • Documentation within code (see #9)
  • Documentation at the website?
  • accepted types (e.g. url, file, dict)

Main focus are the user facing classes. Mainly Weaviate and helper classes e.g. used for batching.

Please also link all issues that result from this review in #3

Allow for basic CRUD and DDL functionality to work along example or tutorial.

https://travis-ci.org/github/WolfgangFahl/DgraphAndWeaviateTest has test code for Dgraph and Weaviate. While it was relatively straight forward to get the Dgraph unit tests at https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testDgraph.py working based on the tutorial https://dgraph.io/tour/ and sample code https://github.com/dgraph-io/pydgraph/blob/master/examples/simple/simple.py the unit tests in:

https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/tests/testWeaviate.py

do not work out of the box yet. The port will have to be modified since it's also used by dgraph - the docker image will have to be pulled and started and there is a need for a drop_all functionality as in the dgraph example. Otherwise it should be possible to write a weaviate simple/wrapper code as https://github.com/WolfgangFahl/DgraphAndWeaviateTest/blob/master/dg/simple.py was derived from the https://github.com/dgraph-io/pydgraph/blob/master/examples/simple/simple.py code. '

For the startup via docker issues see also https://stackoverflow.com/questions/63260073/starting-zero-alpha-and-ratel-in-a-single-command-e-g-in-macosx-and-other-envir

The check whether classification is still running sometimes throws an exception. See the message below

{'basedOnProperties': ['shortDescription'], 'class': 'Product', 'classifyProperties': ['ofFlavour'], 'id': 'e4ad5c2e-3e3e-470f-b942-93df9cd26d0e', 'k': 7, 'meta': {'completed': '0001-01-01T00:00:00.000Z', 'started': '2020-12-11T08:41:16.188Z'}, 'status': 'running', 'type': 'knn'}
Classifying All.Traceback (most recent call last):
File "./validate.py", line 28, in
main()
File "./validate.py", line 19, in main
validateClassification(config, data)
File "PATH/modules/API.py", line 96, in validateClassification
executeClassification(client, config)
File "PATH/modules/ClassifyData.py", line 83, in executeClassification
print_classification_status(client, status, "All")
File "PATH/modules/ClassifyData.py", line 16, in print_classification_status
while client.classification.is_running(status['id']):
File "PATH/Envs/env1/lib/python3.8/site-packages/weaviate/classification/classify.py", line 67, in is_running
return self._check_status(classification_uuid, "running")
File "PATH/Envs/env1/lib/python3.8/site-packages/weaviate/classification/classify.py", line 71, in _check_status
response = self.get(classification_uuid)
File "PATH/Envs/env1/lib/python3.8/site-packages/weaviate/classification/classify.py", line 45, in get
raise UnexpectedStatusCodeException("Get classification status", response)
weaviate.exceptions.UnexpectedStatusCodeException: Unexpected status code: 502, with response body: None

Accept url to json schema parameter in "schema.contains()"

schema.create() accepts a dict as well as url to a json file. A similar function schema.contains() does not support an url to a json schema file as parameter. Suggested is to accept also an url in the schema.contains() function for consistency and for convenience. For example:

import weaviate
w = weaviate.Client("http://localhost:8080")

contains_people_schema = w.schema.contains("https://raw.githubusercontent.com/semi-technologies/weaviate-python-client/master/documentation/getting_started/people_schema.json")
print(contains_people_schema)

# similar to
w.schema.create("https://raw.githubusercontent.com/semi-technologies/weaviate-python-client/master/documentation/getting_started/people_schema.json")

Create schema import for rdf triples

Allow the import of RDF tripels in the form of Subject Predicate Object. The function should not be part of the client it self but rather be a different sub package like batch where the user gets some extended functionality to handle common data representations.

FIX: Python Client's TIMEOUT docstring and validation

The Current implementation of the weaviate python client has a wrong assumption of the requests library's timeout parameter. Currently is assumes only tuples of ints where it looks like this (retries, timeout seconds) which is wrong. The timeout parameter can be an float or a tuple of two floats (float, float). The difference is described on the official readthedocs page, here.

MILESTONE: Get to version 1

  • Have all API features discussed with implemented @michaverhagen please add list
  • Have all API features discussed with @laura-ham for UX quality
  • Unit tests for all features
  • CI pipeline for project
  • Publish version 1.0.0 on GitHub
  • Publish version 1.0.0 on pip

Possible backward incompatibility

If I try to import this dataset with python client 2.4.0 I get the following error:

ValueError: Not valid 'uuid' or 'uuid' can not be extracted from value

However, if I downgrade to the client version specified in the requirements.txt (2.0.2), the import works fine.

I don't now if this is an issue with the client or with the repository. Thanks for investigationg.

EDIT: tests

Make Unit Tests and Integration Tests' code more readable and more consistent. For Unit Tests each function/method should have its own test function and each class should be a separate unittest class. For Integration test make it more clear what is being tested.

Discussion: Handling Timeouts more gracefully

Background

The most likely request to ever run into a timeout is a batch request as that is basically a collection of import requests. With some request of the data-classification (cc @michaverhagen, @antas-marcin) repo we have seen that a batch size of 1000 can lead to timeouts depending on the data set. (Note that "batch size" in this case refers to the number of objects in a POST v1/batch/objects request. Not to be confused with the classification batching which the repo does).

If such a timeout occurs, the reason for the timeout (which in my opinion is the batch size which is too large) is very opaque to the user. I think we should achieve the following goals. I don't know how feasible they are in the client, hence this discussion.

Goals

1. Handle gracefully, make user aware

The current output when a timeout occurs looks something like this. Note that I have shortened the snippet to remove potentially sensitive data:

.... shortened to remove sensitive stuff ....
 File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 102, in create_objects
    return self.create(
  File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 64, in create
    response = self._connection.run_rest(
  File "/usr/local/lib/python3.9/site-packages/weaviate/connect/connection.py", line 221, in run_rest
    response = requests.post(
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 119, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='REMOVED.semi.network', port=443): Read timed out. (read timeout=20)

Technically, all of the info is present in the above. However, it's very hard to find the correct info and there is no suggestion how the user can fix it. I think it would be nicer if a clear error message (if possible without any kind of stack trace?) is printed such as

The request was cancelled because it took longer than the configured timeout of 20s. Try reducing the load or increasing the timeout if the load cannot be reduced any further.

Not sure if we should encourage the user to try to increase the timeout. Long-running HTTP requests are problematic, because at some point we will run into hard-coded (or default) limits in load balancers and proxies. From my experience these are usually around the 1min mark. Since the default is already 20s - which is a lot - I don't think we should encourage the user to just increase the timeout further, which brings me to the next point

2. Print current batch size and suggest reducing it

I have no idea if this is possible in the client at all, but it would be really nice if we could catch the fact that the timed out request was a batch request and make the user aware, such as

The batch request was cancelled because it took longer than the configured timeout of 20s. Try reducing the batch size (currently 1000) to a lower value. Aim to on average complete batch request within less than half (10s) of the currently configured timeout (20s). 

The most important part is about reducing the batch size. The second sentence encourages the user the think about monitoring their requests. The idea behind that is that you could be in a very dangerous situation where your request take on average 19s. Then you change a tiny thing (maybe add slightly longer descriptions) and suddenly all requests run into timeouts - seemingly without changing anything considerable.

Feedback

Happy to hear your feedback on the UX suggestions and the feasibility of implementing both 1 and 2 in the client.

I think we should also implement something like this in the other clients, but given that the python client receives 99% of the usage we're aware of at the moment, it's where we should start :)

Transform common RDF based schemas into a corresponding weaviate schema

Schema definitions like RDFS and OWL should be transformed into a comprehensive weaviate schema. This requires the creation things like containers and collections as weaviate classes. Property classes should be transformed into primitive properties if that is what they represent in the source schema. rdfs:comment or rdfs:label may be used as class descriptions instead of or additional to weaviate class properties.

Replace references

References are so far only appended using POST however they can also be replaced using PUT

Missing cardinality key gives error even if the field is not needed

The client gives an error if the "cardinality" field isn't set on all fields. However, the cardinality field only has to be set for references.

Traceback (most recent call last):
  File "./import.py", line 12, in <module>
    CLIENT.create_schema('schema.json')
  File "/usr/local/lib/python3.7/site-packages/weaviate/client.py", line 444, in create_schema
    loaded_schema[SCHEMA_CLASS_TYPE_THINGS]["classes"])
  File "/usr/local/lib/python3.7/site-packages/weaviate/client.py", line 484, in _create_class_with_primitives
    schema_class["properties"] = self._get_primitive_properties(weaviate_class["properties"])
  File "/usr/local/lib/python3.7/site-packages/weaviate/client.py", line 514, in _get_primitive_properties
    "cardinality": property_["cardinality"],
KeyError: 'cardinality'

ConnectionError is missleading

ConnectionError is documented as the error that can be thrown by the client. However it is not very well documented that it is actually requests.exceptions.ConnectionError and not builtins.ConnectionError. Further more ConnectionError does not include any timeout errors.

Replace ConnectionError with RequestException as the more general exception.

Allow setting custom vector in single and batch create object

Background

With the changes for v1 we have introduced the module system in Weaviate, which also allows for setting the vectorizer on a per-class basis. For example, with the app-wide config DEFAULT_VECTORIZER_MODULE=text2vec-contextionary - which is currently present on all docker-compose files we publish - the vectorizer is always set to text2vec-contextionary. However it can be overwritten.

It is also possible to explicitly set the vectorizer to none. In this case, Weaviate doesn't do any vectorization, but expects the user to provide the vectors. This is one way of using Weaviate with vectors coming from any ML model - without having to write a full module for it.

Goals

  • The user can specify a custom vector when creating a single or batch object.
    Note: Weaviate will only accept the vector if the vectorizer is set to none. However, the client does not need to validate this explicitly and can just pass through any errors coming from Weaviate

Nice to Have

  • Support typical python standards
    I think there are a few constructs in some of the common libraries for vectors, such as pytorch.Tensor and an eqivalent in numpy. It would be cool if the client could support the most common out of the box

Weaviate API

You can simply specify the vector at the root level, here's an example with curl

curl -s localhost:8080/v1/schema -H 'content-type: application/json' \
  -d '{"class": "TestClass", "vectorizer":"none", "properties":[{"name": "name", "dataType": ["string"]}]}' 

curl -s localhost:8080/v1/objects -H 'content-type: application/json' \
  -d '{"class": "TestClass", "vector": [0,1], "properties": {"name":"hello world"}}' 

If we then check the object with the ?include=vector option, the custom vector is present:

image

Create primitive weaviate schema from RDF triples.

Add comfort utility to transform an RDF triple into a weaviate schema. The schema should consider every Subject as a class, every Predicate as a property and every Object as the referenced class.

It should not be part of the core client but rather part of an extra sub package.

Add references refactor

Add references is not tested to the degree it should.
Add references could use some more functions for convenience, especially one that extracts the semantic type directly from the url.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.