hwchase17 / notion-qa Goto Github PK

View Code? Open in Web Editor NEW

2.1K 21.0 375.0 48.27 MB

License: MIT License

Python 100.00%

notion-qa's Introduction

Notion Question-Answering

🤖Ask questions to your Notion database in natural language🤖

💪 Built with LangChain

🌲 Environment Setup

In order to set your environment up to run the code here, first install all requirements:

pip install -r requirements.txt

Then set your OpenAI API key (if you don't have one, get one here)

export OPENAI_API_KEY=....

📄 What is in here?

Example data from Blendle
Python script to query Notion with a question
Code to deploy on StreamLit
Instructions for ingesting your own dataset

📊 Example Data

This repo uses the Blendle Employee Handbook as an example. It was downloaded October 18th so may have changed slightly since then!

💬 Ask a question

In order to ask a question, run a command like:

python qa.py "is there food in the office?"

You can switch out is there food in the office? for any question of your liking!

This exposes a chat interface for interacting with a Notion database. IMO, this is a more natural and convenient interface for getting information.

🚀 Code to deploy on StreamLit

The code to run the StreamLit app is in main.py. Note that when setting up your StreamLit app you should make sure to add OPENAI_API_KEY as a secret environment variable.

🧑 Instructions for ingesting your own dataset

Export your dataset from Notion. You can do this by clicking on the three dots in the upper right hand corner and then clicking Export.

When exporting, make sure to select the Markdown & CSV format option.

This will produce a .zip file in your Downloads folder. Move the .zip file into this repository.

Run the following command to unzip the zip file (replace the Export... with your own file name as needed).

unzip Export-d3adfe0f-3131-4bf3-8987-a52017fc1bae.zip -d Notion_DB

Run the following command to ingest the data.

python ingest.py

Boom! Now you're done, and you can ask it questions like:

python qa.py "is there food in the office?"

notion-qa's People

Contributors

Stargazers

Watchers

Forkers

devdoshi boriscergol thecooltechguy cameronccohen chrisboden touristshaun arvesolland piotrlnordea kevinbean micheas-yimam introspect-ai niklastr ndehouche bsevriukov-apriwell vishalsachdev apprikatai kandy22 galsor advantch mcutler81 henrypoydar jaytoday normandmickey giancarlolelli alexandme c0dersmash mbyanfei jakefuentes tushar11raj davidpatterson-cole lorenzotomaz dbuos barnjamin aarushik93 kenndanielso jimmylv blackary nloui gtracer henry-zeng tien-cheng hamidkm9 ashok-kallara sanjayk0508 autogyro faz-cxr mustafaneemuchwala v1nc3ntlaw alexanderzhou39 yirenkeji555 parkersm1th altafr bobronium rostehea unclejim21 sidmohan0 boxabirds estar-app eisukehirata lforalaska ilyavenger oskarhulter bigboat888 newmedia2 howethomas lhu-l10n nyquistdata rezafaizarahman hyojunguy ss18 nairaditya99 bcnbru m8e pikki622 stephenw310 diostam markshust snakecy alejandrocrosa gandharvbakshi julienvanbeveren liemeister tuantranf lijameshao st-tree nishantb06 chatopensource aktraiser filipecalegario zyuanlim arpand9 indrekanda vpatel85 sokojh libratiger williamtran29 babyblue26 avsolatorio anildwarepo saksijas

notion-qa's Issues

Iterating over LLM models does not work in LangChain

Can LLMChain objects be stored and iterated over?

llms = [{'name': 'OpenAI', 'model': OpenAI(temperature=0)},
        {'name': 'Flan', 'model':  HuggingFaceHub(repo_id="google/flan-t5-xl", model_kwargs={"temperature": 1e-10})}]

for llm_dict in llms:
    llm_name = llm_dict['name']
    llm_model = llm_dict['model']
    chain = LLMChain(llm=llm_model, prompt=prompt)

The first LLM model runs well, but for the second iteration, gives following error:

    chain = LLMChain(llm=llm_model, prompt=prompt)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pydantic\main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for LLMChain
prompt
  value is not a valid dict (type=type_error.dict)

Am I missing something? in dictionary declarations?

More details at https://stackoverflow.com/questions/76110329/iterating-over-llm-models-does-not-work-in-langchain

Ingesting your own dataset

Hi, when ingesting my own dataset, I understand that I must remove/replace Export*.zip and Notion_DB.
Should I also remove faiss_store.pkl? (I've never used faiss)
Are there other things that need to be removed/reset first?
Thanks!

faiss.loader: Could not load library with AVX2 support due to

Hi All,

I have encountered an issue with running the code. Was wondering if someone already has an answer.

Using Visual Studio Code, python.

Any help will be greatly appreciated.

I have pip installed faiss cpu
=======

2023-04-07 19:01:39.162 INFO faiss.loader: Loading faiss with AVX2 support.
2023-04-07 19:01:39.162 INFO faiss.loader: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2023-04-07 19:01:39.163 INFO faiss.loader: Loading faiss.
2023-04-07 19:01:39.179 INFO faiss.loader: Successfully loaded faiss.
Traceback (most recent call last):
File "e:\notion-qa\main.py", line 15, in
store = pickle.load(f)
^^^^^^^^^^^^^^
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.752.0_x64__qbz5n2kfra8p0\Lib\pathlib.py", line 873, in new
raise NotImplementedError("cannot instantiate %r on your system"
NotImplementedError: cannot instantiate 'PosixPath' on your system

Unable to clone complete repository

Cloning into 'notion-qa'...
remote: Enumerating objects: 160, done.
remote: Counting objects: 100% (26/26), done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 160 (delta 11), reused 14 (delta 11), pack-reused 134
Receiving objects: 100% (160/160), 48.35 MiB | 1.18 MiB/s, done.
Resolving deltas: 100% (13/13), done.
error: unable to create file Notion_DB/Blendle's Employee Handbook a834d55573614857a48a9ce9ec4194e3/Blendle Social Code 1178dd7f18cf49cea04bf9efcb2d84b2/General disciplinary measures f8396878804a4eab9277ba5d2cd4cd08.md: Filename too long
fatal: cannot create directory at 'Notion_DB/Blendle's Employee Handbook a834d55573614857a48a9ce9ec4194e3/Diversity and Inclusion b8d3907631944696a4e76c2a41e757a5/#letstalkaboutstress 26c5eee20f854bb9ac6921f35c9a50b3': Filename too long
warning: Clone succeeded, but checkout failed.

Removal of sources

I am brand new to FAiSS, but is there a way that we can integrate a new option to remove items previously ingested?

Is it possible to inject a set of Markdown files not from Notion?

If so, what preprocessing is needed?

openai.error.RateLimitError: Rate limit reached for default-global-with-image-limits in organization org-8Ch0kYQaMzstRX0szWKe1GLd on requests per min. Limit: 60 / min. Please try again in 1s. Contact [email protected] if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method.

what should i do ? How does my program avoid exceptions that limit flow, for example, such as sleep 1s？

do you have a better suggestion？

Ingestion error (and solution): UnicodeDecodeError: 'charmap' codec can't decode byte

I received this error on a Windows 11 setup while trying to ingest exported notion files.

If anyone else gets this, you can modify ingest.py:
with open(p) as f:
to
with codecs.open(p, encoding='utf-8') as f:

(or whichever encoding you need) to solve things.

Add tiktoken to requirements.txt

On a fresh container with minimal prereqs, first call results in:

`Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/langchain/llms/openai.py", line 233, in get_num_tokens
import tiktoken
ModuleNotFoundError: No module named 'tiktoken'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jovyan/work/notion-qa/qa.py", line 20, in
result = chain({"question": args.question})
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/base.py", line 146, in call
raise e
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/base.py", line 142, in call
outputs = self._call(inputs)
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/qa_with_sources/base.py", line 97, in _call
answer, _ = self.combine_document_chain.combine_docs(docs, **inputs)
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/combine_documents/map_reduce.py", line 150, in combine_docs
num_tokens = length_func(result_docs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 79, in prompt_length
return self.llm_chain.llm.get_num_tokens(prompt)
File "/opt/conda/lib/python3.10/site-packages/langchain/llms/openai.py", line 235, in get_num_tokens
raise ValueError(
ValueError: Could not import tiktoken python package. This is needed in order to calculate get_num_tokens. Please it install it with `pip install tiktoken``

Installing fixes. Probably should be part of the requirements.txt

Error when trying to deploy app on Streamlit (pydantic.error_wrappers.ValidationError)

Getting the following error when trying to deploy app on streamlit

pydantic.error_wrappers.ValidationError

Traceback:
File "/home/appuser/venv/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "/app/notion-qa/main.py", line 16, in
chain = VectorDBQAWithSourcesChain.from_llm(llm=OpenAI(temperature=0), vectorstore=store)
File "pydantic/main.py", line 341, in pydantic.main.BaseModel._init

Anyone know how to resolve this?

Error when deploy to streamlit cloud

I was try to deploy the app to the streamlit cloud.
It works fine at localhost, but raised an error when trying to call the langchain vectordatabase with source:

Traceback (most recent call last):
  File "/home/appuser/venv/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
  File "/app/chatrtc/main.py", line 50, in <module>
    result = chain({"question": user_input})
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/chains/base.py", line 146, in __call__
    raise e
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/chains/base.py", line 142, in __call__
    outputs = self._call(inputs)
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/chains/qa_with_sources/base.py", line 96, in _call
    docs = self._get_docs(inputs)
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/chains/qa_with_sources/vector_db.py", line 20, in _get_docs
    return self.vectorstore.similarity_search(question, k=self.k)
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/vectorstores/faiss.py", line 91, in similarity_search
    _, indices = self.index.search(np.array([embedding], dtype=np.float32), k)
TypeError: search() missing 3 required positional arguments: 'k', 'distances', and 'labels'

I believe this is some kind of problem when the cloud server try to load the pickle file at the linux os, which is different from my localhost in window. Anyone know how to solve this?

Rate Limit problem while running python ingest.py

Hi!

When trying to include my own notion page I am facing a ratelimit problem. I know that i can deal with it outside the system, but is there a way to limit the rate inside the ingest.py in order to solve it internally?

error when ingesting

Getting this error with ingesting

[''] is not valid under any of the given schemas

Traceback (most recent call last):
  File "/Users/xingfanxia/projects/notion-qa/ingest.py", line 32, in <module>
    store = FAISS.from_texts(docs, OpenAIEmbeddings(), metadatas=metadatas)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/faiss.py", line 168, in from_texts
    embeddings = embedding.embed_documents(texts)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/embeddings/openai.py", line 87, in embed_documents
    responses = [
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/embeddings/openai.py", line 88, in <listcomp>
    self._embedding_func(text, engine=self.document_model_name)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/embeddings/openai.py", line 76, in _embedding_func
    return self.client.create(input=[text], engine=engine)["data"][0]["embedding"]
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_requestor.py", line 226, in request
    resp, got_stream = self._interpret_response(result, stream)
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_requestor.py", line 619, in _interpret_response
    self._interpret_response_line(
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_requestor.py", line 682, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: [''] is not valid under any of the given schemas - 'input'

When running qa.py, with correct API key getting rate limiting behavior on the first question

with errors such as the following: "Retrying langchain.embeddings.openai.embed_with_retry.._embed_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details.."

Understanding embeddings

Can we add general info to the readme about how embeddings work here? Or if you can point me to something I can read up on, I'm happy to update it and PR.
Questions coming to mind:

Do we hit the OpenAI embeddings API with our data, and that's why it incurs a usage charge at OpenAI?
The files that are outputted as a result (docs.index and faiss_store.pkl) - what does each file represent? Do we know if OpenAI keeps a copy of all embeddings made? Or are those effectively "things you own locally"?
Is there a way to use multiple vector embeddings in a QA prompt? And the system would decide where to look?

Thanks for any insight.

How to split document in a smarter way

Hello,

I understand that we need to split document in smaller piece because OpenAI can not get the whole texts as input.

However, my challenge is to cut texts in a smart way so it does not cut the text in the middle of a sentence.

Any luck about that?

The result returned by the query of Chinese data is English

The Chinese data I used was carried out and executed ingest.py, and the vector database was updated.
And when asking questions, I also use Chinese to ask questions, but the results returned by the chain are indeed English results, although the results are indeed correct

Ingest - Longer than the specified 1500

Is this a problem with ingesting?

❯ python3 ingest.py                                                                                                                                                              ─╯
Created a chunk of size 1738, which is longer than the specified 1500
Created a chunk of size 1698, which is longer than the specified 1500
Created a chunk of size 1568, which is longer than the specified 1500
Created a chunk of size 2845, which is longer than the specified 1500
Created a chunk of size 2181, which is longer than the specified 1500
Created a chunk of size 2042, which is longer than the specified 1500
Created a chunk of size 2387, which is longer than the specified 1500
Created a chunk of size 2305, which is longer than the specified 1500
Created a chunk of size 2076, which is longer than the specified 1500
Created a chunk of size 2316, which is longer than the specified 1500
Created a chunk of size 2400, which is longer than the specified 1500
Created a chunk of size 1520, which is longer than the specified 1500
Created a chunk of size 1718, which is longer than the specified 1500
Created a chunk of size 2526, which is longer than the specified 1500
Created a chunk of size 1874, which is longer than the specified 1500
Created a chunk of size 2946, which is longer than the specified 1500
Created a chunk of size 2362, which is longer than the specified 1500
Created a chunk of size 1753, which is longer than the specified 1500
Created a chunk of size 2821, which is longer than the specified 1500
Created a chunk of size 1579, which is longer than the specified 1500
Created a chunk of size 1824, which is longer than the specified 1500
Created a chunk of size 2026, which is longer than the specified 1500
Created a chunk of size 2981, which is longer than the specified 1500
Created a chunk of size 2865, which is longer than the specified 1500
Created a chunk of size 97940, which is longer than the specified 1500
Created a chunk of size 53894, which is longer than the specified 1500
Created a chunk of size 1821, which is longer than the specified 1500
Created a chunk of size 2861, which is longer than the specified 1500
Created a chunk of size 5088, which is longer than the specified 1500
Created a chunk of size 1865, which is longer than the specified 1500
Created a chunk of size 1538, which is longer than the specified 1500
Created a chunk of size 1542, which is longer than the specified 1500
Created a chunk of size 2046, which is longer than the specified 1500
Created a chunk of size 2558, which is longer than the specified 1500
Created a chunk of size 1882, which is longer than the specified 1500
Created a chunk of size 13716, which is longer than the specified 1500
Created a chunk of size 2183, which is longer than the specified 1500
Created a chunk of size 3526, which is longer than the specified 1500
Created a chunk of size 1702, which is longer than the specified 1500
Created a chunk of size 2313, which is longer than the specified 1500
Created a chunk of size 1775, which is longer than the specified 1500
Created a chunk of size 5809, which is longer than the specified 1500
Created a chunk of size 4280, which is longer than the specified 1500
Created a chunk of size 3446, which is longer than the specified 1500
Created a chunk of size 2207, which is longer than the specified 1500
Created a chunk of size 2502, which is longer than the specified 1500
Created a chunk of size 2465, which is longer than the specified 1500
Created a chunk of size 16323, which is longer than the specified 1500
Created a chunk of size 2277, which is longer than the specified 1500
Created a chunk of size 8885, which is longer than the specified 1500
Created a chunk of size 3056, which is longer than the specified 1500
Created a chunk of size 14512, which is longer than the specified 1500
Created a chunk of size 2896, which is longer than the specified 1500
Created a chunk of size 10896, which is longer than the specified 1500
Created a chunk of size 1504, which is longer than the specified 1500
Created a chunk of size 3937, which is longer than the specified 1500
Created a chunk of size 1643, which is longer than the specified 1500
Created a chunk of size 2604, which is longer than the specified 1500
Created a chunk of size 4257, which is longer than the specified 1500
Created a chunk of size 2635, which is longer than the specified 1500
Created a chunk of size 1907, which is longer than the specified 1500
Created a chunk of size 2010, which is longer than the specified 1500
Created a chunk of size 1832, which is longer than the specified 1500
Created a chunk of size 1517, which is longer than the specified 1500
Created a chunk of size 2795, which is longer than the specified 1500
Created a chunk of size 1795, which is longer than the specified 1500
Created a chunk of size 1862, which is longer than the specified 1500
Created a chunk of size 2764, which is longer than the specified 1500
Created a chunk of size 1596, which is longer than the specified 1500
Created a chunk of size 3146, which is longer than the specified 1500
Created a chunk of size 1655, which is longer than the specified 1500
Created a chunk of size 1563, which is longer than the specified 1500
Created a chunk of size 1547, which is longer than the specified 1500
Created a chunk of size 1570, which is longer than the specified 1500
Created a chunk of size 1688, which is longer than the specified 1500
Created a chunk of size 1626, which is longer than the specified 1500
Created a chunk of size 2224, which is longer than the specified 1500
Created a chunk of size 3120, which is longer than the specified 1500
Created a chunk of size 2197, which is longer than the specified 1500

Why do we need to use OpenAI?

Is there a plan to be able to run some local LLM?

It is very easy to raise the maximum context length error

openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4688 tokens. Please reduce the length of the messages.

Understanding the question/answering process and its costs

Can someone explain to me what the process is behind the scenes when calling the OpenAI API?

I understand how embedding works (#1). But how much text from the embedding is included in following requests? And why there are for example 2 requests for one question or even 5 when using ChatOpenAI?

Example:

I tried simple question (in Czech because my embeddings are in Czech) "How old must the camp leader be at least?". The chain made two API calls with 5565 tokens in total. And the response was "The minimum age for the camp leader is 18 according to Junák – český skaut." It's not very cost effective when using text-davinci. For one simple question I pay around 0,11 USD.

I simply tried replace OpenAI() with ChatOpenAI() which uses gpt-3.5-turbo-0301. The chain made 5 requests (4,643 prompt + 278 completion = 4,921 tokens). The price is 10x lower and also less tokens are used.

chain = VectorDBQAWithSourcesChain.from_llm(llm=ChatOpenAI(temperature=0), vectorstore=store)

Is it possible to affect how long "embeddings" will be included in the request?

Thanks for any information.

NotImplementedError: cannot instantiate 'PosixPath' on your system

Traceback (most recent call last):
File "notion-qa-hwchase17\main.py", line 13, in
store = pickle.load(f)
File "..\AppData\Local\Continuum\anaconda3\envs\langchain\lib\pathlib.py", line 1084, in new
raise NotImplementedError("cannot instantiate %r on your system"
NotImplementedError: cannot instantiate 'PosixPath' on your system

Other language support

Is it possible to interact with chatbot in other languages with a notion db of posts in other languages in English?
I tried Chinese it won't find any answers