Background
The most likely request to ever run into a timeout is a batch request as that is basically a collection of import requests. With some request of the data-classification
(cc @michaverhagen, @antas-marcin) repo we have seen that a batch size of 1000 can lead to timeouts depending on the data set. (Note that "batch size" in this case refers to the number of objects in a POST v1/batch/objects
request. Not to be confused with the classification batching which the repo does).
If such a timeout occurs, the reason for the timeout (which in my opinion is the batch size which is too large) is very opaque to the user. I think we should achieve the following goals. I don't know how feasible they are in the client, hence this discussion.
Goals
1. Handle gracefully, make user aware
The current output when a timeout occurs looks something like this. Note that I have shortened the snippet to remove potentially sensitive data:
.... shortened to remove sensitive stuff ....
File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 102, in create_objects
return self.create(
File "/usr/local/lib/python3.9/site-packages/weaviate/batch/crud_batch.py", line 64, in create
response = self._connection.run_rest(
File "/usr/local/lib/python3.9/site-packages/weaviate/connect/connection.py", line 221, in run_rest
response = requests.post(
File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 119, in post
return request('post', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='REMOVED.semi.network', port=443): Read timed out. (read timeout=20)
Technically, all of the info is present in the above. However, it's very hard to find the correct info and there is no suggestion how the user can fix it. I think it would be nicer if a clear error message (if possible without any kind of stack trace?) is printed such as
The request was cancelled because it took longer than the configured timeout of 20s. Try reducing the load or increasing the timeout if the load cannot be reduced any further.
Not sure if we should encourage the user to try to increase the timeout. Long-running HTTP requests are problematic, because at some point we will run into hard-coded (or default) limits in load balancers and proxies. From my experience these are usually around the 1min mark. Since the default is already 20s - which is a lot - I don't think we should encourage the user to just increase the timeout further, which brings me to the next point
2. Print current batch size and suggest reducing it
I have no idea if this is possible in the client at all, but it would be really nice if we could catch the fact that the timed out request was a batch request and make the user aware, such as
The batch request was cancelled because it took longer than the configured timeout of 20s. Try reducing the batch size (currently 1000) to a lower value. Aim to on average complete batch request within less than half (10s) of the currently configured timeout (20s).
The most important part is about reducing the batch size. The second sentence encourages the user the think about monitoring their requests. The idea behind that is that you could be in a very dangerous situation where your request take on average 19s. Then you change a tiny thing (maybe add slightly longer descriptions) and suddenly all requests run into timeouts - seemingly without changing anything considerable.
Feedback
Happy to hear your feedback on the UX suggestions and the feasibility of implementing both 1 and 2 in the client.
I think we should also implement something like this in the other clients, but given that the python client receives 99% of the usage we're aware of at the moment, it's where we should start :)