This repo contains end-to-end examples of GenAI/LLM applications and evaluation pipelines set up using continuous-eval.
Checkout continuous-eval repo and documentation for more information.
Example Name | App Framework | Eval Framework | Description |
---|---|---|---|
Simple RAG | Langchain | continuous-eval | Simple QA chatbot over select Paul Graham essays |
Complex RAG | Langchain | continuous-eval | Complex QA chatbot over select Paul Graham essays |
Simple Tools | LlamaIndex | continuous-eval | Math question solver using simple tools |
Context Augmentation Agent | LlamaIndex | continuous-eval | QA over Uber financial dataset using agents |
Sentiment Classification | LlamaIndex | continuous-eval | Single label classification of sentence sentiment |
git clone https://github.com/relari-ai/examples.git && cd examples
poetry install
Add LLM API keys in .env (reference .env.example) for select applications.
COHERE API_KEY
for Cohere Rerankers in RAG examplesGOOGLE_API_KEY
for all LLM calls
In each application folder (examples/[langchain|llamaindex]/APP_NAME/
), here are the files:
pipeline.py
defines the application pipeline and the evaluation metrics / tests.app.py
contains the LLM application. Run the application to get the outputs (saved asresults.jsonl
)eval.py
runs the metrics / tests defined bypipeline.py
(saved asmetrics_results.json
andtest_results.json
)
Depending on the application, the source data for the application (documents and embeddings in Chroma vectorstore) and evaluation (golden dataset) is also provided. Note that for the evaluation golden dataset, there are always two files:
dataset.jsonl
contains the inputs (questions) and reference module outputs (ground truths)manifest.yaml
defines the structure of the dataset for the evaluators.
Tweak metrics and tests in pipeline.py
to try out different metrics.