This is amazing work! Props to you! A lot of ideas are really future looking such as a

Thank you for the link <a class="user-mention notranslate" data-hovercard-type="user"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

For streaming, Chainlit already supports streaming with <code class="notr

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Update to real Async and Streaming about chainlit HOT 19 CLOSED

chainlit commented on June 14, 2024 2

Update to real Async and Streaming

from chainlit.

Comments (19)

willydouhard commented on June 14, 2024 1

You are correct we are not leveraging async implementations at the moment. The main reason is that I feel like most code bases are not written in the async paradigm and it is quite hard and not always possible to transition from sync to async.

To mitigate this we currently run agents in different threads so at least one agent will not block the whole app.

As we move forward I would love to see Chainlit support async implementations :)

from chainlit.

willydouhard commented on June 14, 2024 1

For streaming, Chainlit already supports streaming with openai, langchain and any python code. See https://docs.chainlit.io/concepts/streaming/python :)

from chainlit.

willydouhard commented on June 14, 2024 1

Thank you for the link @hooman-bayer , I pretty much agree with you and we also want to see where the community wants Chainlit to go between staying a rapid prototyping tool or deploy to production and scale.

As for streaming final responses in LangChain @segevtomer I found this interesting issue hwchase17/langchain#2483. I'll dig more into it!

from chainlit.

willydouhard commented on June 14, 2024 1

@segevtomer For clarity, all the intermediary steps are already streamed, including the last one which is the final response. Then the final response is sent as a stand alone message (not an intermediary step) in the UI without any overhead so the user can see it.

What I am saying here is that the only improvement we can do is to stream the last tokens after your stop token (usually Final answer:) directly without waiting for the completion to end. This is what hwchase17/langchain#2483 does.

While this would be a win for the user experience, the actual time gain will be very limited, since it only impacts a few tokens at the very end of the whole process.

from chainlit.

hooman-bayer commented on June 14, 2024

@willydouhard happy to contribute at some point in the near future, in case it becomes part of your roadmap, I have been building an app that is fully extending langchain to async including tools ( using their class signature that offers arun). But you are 100% right that most of libraries are only offering sync APIs.

from chainlit.

hooman-bayer commented on June 14, 2024

For streaming, Chainlit already supports streaming with openai, langchain and any python code. See https://docs.chainlit.io/concepts/streaming/python :)

Correct, I saw it but its again kind of faking :) it and the client needs to wait anyway till the response is completed from OpenAI endpoint that might not be desired. For instance openai.ChatCompletion.acreate creates an SSE and directly passes the response to the client token by token as its generated by ChatGPT so the latency is way smaller.

Also imagine in case of your WebSocket for action agents this can bring a lot better experience for the user.

from chainlit.

willydouhard commented on June 14, 2024

Interesting, in my understanding, openai.ChatCompletion.create was not waiting for the whole response to be generated to start streaming tokens. Do you happen to have a link to a resource covering that in more details?

from chainlit.

segevtomer commented on June 14, 2024

To add to the conversation, I tried to use a few different chain classes and couldn't get streaming to work on any of them (only once the response was complete it was updated on screen).

from chainlit.

willydouhard commented on June 14, 2024

For LangChain, only the intermediary steps are streamed at the moment. If you configured your LLM with streaming=True you should see the intermediary steps being streamed if you unfold them in the UI (click on the Using... button).

I will take a look on how to also stream the final response!

from chainlit.

hooman-bayer commented on June 14, 2024

@willydouhard see this issue from openai python sdk for more details. In general if you want to keep the tool as a simple POC for only multiple users I see it great as is with sync but what if we want to scale to 100 users or so? I think running all of this on different threads is not so realistic and modern ( user experience with sync also wont be great), probably async is the way to go.

from chainlit.

hooman-bayer commented on June 14, 2024

@willydouhard that is using AsyncCallbackHandler I mentioned above. Using that you get access to on_llm_new_token(self, token: str, **kwargs: Any) -> None: that then you could customize the return the output token by token to the client

from chainlit.

segevtomer commented on June 14, 2024

Thanks for the update @willydouhard.
I wouldn't say it's "very limited", I agree that there will still be delay because we have to wait for the final prompt in the chain to occur, however it is still very valuable to stream it. Let's say the final response is over 1k tokens long, streaming that will still be significant for UX.

from chainlit.

Banbury commented on June 14, 2024

I have two problems with this:

There is absolute no indication, that the LLM is doing something, not even a wait cursor
If the text generation runs longer than a few seconds, the UI loses connection to the server, and the message is never displayed.

So what I would need, is either streaming of the final result, or a configurable timeout before the UI loses connection to the server and some spinner to indicate, that something is happening. Preferably both.

from chainlit.

willydouhard commented on June 14, 2024

@Banbury what is your setup? Are your running open source models like gpt4all locally or are you using openai api?

from chainlit.

Banbury commented on June 14, 2024

I have been trying to run Vicuna locally with langchain. It does work more or less, but only for short texts.

from chainlit.

willydouhard commented on June 14, 2024

So I have seen issues for local models and we are investing them. For API's models everything should work fine. It would be helpful If you can share a code snippet so I can try to reproduce.

from chainlit.

Banbury commented on June 14, 2024

This is the code I have been working on. It's simple enough.

import chainlit as cl
from llama_cpp import Llama
from langchain.llms import LlamaCpp
from langchain.embeddings import LlamaCppEmbeddings
from langchain import PromptTemplate, LLMChain

llm = LlamaCpp(model_path="Wizard-Vicuna-13B-Uncensored.ggmlv3.q5_0.bin", seed=0, n_ctx=2048, max_tokens=512, temperature=0.1, streaming=True)

template = """
### Instruction: 
{message}
### Response:
"""

@cl.langchain_factory
def factory():
    prompt = PromptTemplate(template=template, input_variables=["message"])
    llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)

    return llm_chain

I have been using the same model with llama.cpp and llama-cpp-python without problems.

from chainlit.

willydouhard commented on June 14, 2024

Thank you, I am going to prioritize this!

from chainlit.

willydouhard commented on June 14, 2024

Here is the proposal to move chainlit to async by default #40. Feedback wanted!

from chainlit.

Update to real Async and Streaming about chainlit HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent