Giter VIP home page Giter VIP logo

notion-qa's Introduction

Notion Question-Answering

🤖Ask questions to your Notion database in natural language🤖

💪 Built with LangChain

🌲 Environment Setup

In order to set your environment up to run the code here, first install all requirements:

pip install -r requirements.txt

Then set your OpenAI API key (if you don't have one, get one here)

export OPENAI_API_KEY=....

📄 What is in here?

  • Example data from Blendle
  • Python script to query Notion with a question
  • Code to deploy on StreamLit
  • Instructions for ingesting your own dataset

📊 Example Data

This repo uses the Blendle Employee Handbook as an example. It was downloaded October 18th so may have changed slightly since then!

💬 Ask a question

In order to ask a question, run a command like:

python qa.py "is there food in the office?"

You can switch out is there food in the office? for any question of your liking!

This exposes a chat interface for interacting with a Notion database. IMO, this is a more natural and convenient interface for getting information.

🚀 Code to deploy on StreamLit

The code to run the StreamLit app is in main.py. Note that when setting up your StreamLit app you should make sure to add OPENAI_API_KEY as a secret environment variable.

🧑 Instructions for ingesting your own dataset

Export your dataset from Notion. You can do this by clicking on the three dots in the upper right hand corner and then clicking Export.

export

When exporting, make sure to select the Markdown & CSV format option.

export-format

This will produce a .zip file in your Downloads folder. Move the .zip file into this repository.

Run the following command to unzip the zip file (replace the Export... with your own file name as needed).

unzip Export-d3adfe0f-3131-4bf3-8987-a52017fc1bae.zip -d Notion_DB

Run the following command to ingest the data.

python ingest.py

Boom! Now you're done, and you can ask it questions like:

python qa.py "is there food in the office?"

notion-qa's People

Contributors

hwchase17 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

notion-qa's Issues

Ingest - Longer than the specified 1500

Is this a problem with ingesting?

❯ python3 ingest.py                                                                                                                                                              ─╯
Created a chunk of size 1738, which is longer than the specified 1500
Created a chunk of size 1698, which is longer than the specified 1500
Created a chunk of size 1568, which is longer than the specified 1500
Created a chunk of size 2845, which is longer than the specified 1500
Created a chunk of size 2181, which is longer than the specified 1500
Created a chunk of size 2042, which is longer than the specified 1500
Created a chunk of size 2387, which is longer than the specified 1500
Created a chunk of size 2305, which is longer than the specified 1500
Created a chunk of size 2076, which is longer than the specified 1500
Created a chunk of size 2316, which is longer than the specified 1500
Created a chunk of size 2400, which is longer than the specified 1500
Created a chunk of size 1520, which is longer than the specified 1500
Created a chunk of size 1718, which is longer than the specified 1500
Created a chunk of size 2526, which is longer than the specified 1500
Created a chunk of size 1874, which is longer than the specified 1500
Created a chunk of size 2946, which is longer than the specified 1500
Created a chunk of size 2362, which is longer than the specified 1500
Created a chunk of size 1753, which is longer than the specified 1500
Created a chunk of size 2821, which is longer than the specified 1500
Created a chunk of size 1579, which is longer than the specified 1500
Created a chunk of size 1824, which is longer than the specified 1500
Created a chunk of size 2026, which is longer than the specified 1500
Created a chunk of size 2981, which is longer than the specified 1500
Created a chunk of size 2865, which is longer than the specified 1500
Created a chunk of size 97940, which is longer than the specified 1500
Created a chunk of size 53894, which is longer than the specified 1500
Created a chunk of size 1821, which is longer than the specified 1500
Created a chunk of size 2861, which is longer than the specified 1500
Created a chunk of size 5088, which is longer than the specified 1500
Created a chunk of size 1865, which is longer than the specified 1500
Created a chunk of size 1538, which is longer than the specified 1500
Created a chunk of size 1542, which is longer than the specified 1500
Created a chunk of size 2046, which is longer than the specified 1500
Created a chunk of size 2558, which is longer than the specified 1500
Created a chunk of size 1882, which is longer than the specified 1500
Created a chunk of size 13716, which is longer than the specified 1500
Created a chunk of size 2183, which is longer than the specified 1500
Created a chunk of size 3526, which is longer than the specified 1500
Created a chunk of size 1702, which is longer than the specified 1500
Created a chunk of size 2313, which is longer than the specified 1500
Created a chunk of size 1775, which is longer than the specified 1500
Created a chunk of size 5809, which is longer than the specified 1500
Created a chunk of size 4280, which is longer than the specified 1500
Created a chunk of size 3446, which is longer than the specified 1500
Created a chunk of size 2207, which is longer than the specified 1500
Created a chunk of size 2502, which is longer than the specified 1500
Created a chunk of size 2465, which is longer than the specified 1500
Created a chunk of size 16323, which is longer than the specified 1500
Created a chunk of size 2277, which is longer than the specified 1500
Created a chunk of size 8885, which is longer than the specified 1500
Created a chunk of size 3056, which is longer than the specified 1500
Created a chunk of size 14512, which is longer than the specified 1500
Created a chunk of size 2896, which is longer than the specified 1500
Created a chunk of size 10896, which is longer than the specified 1500
Created a chunk of size 1504, which is longer than the specified 1500
Created a chunk of size 3937, which is longer than the specified 1500
Created a chunk of size 1643, which is longer than the specified 1500
Created a chunk of size 2604, which is longer than the specified 1500
Created a chunk of size 4257, which is longer than the specified 1500
Created a chunk of size 2635, which is longer than the specified 1500
Created a chunk of size 1907, which is longer than the specified 1500
Created a chunk of size 2010, which is longer than the specified 1500
Created a chunk of size 1832, which is longer than the specified 1500
Created a chunk of size 1517, which is longer than the specified 1500
Created a chunk of size 2795, which is longer than the specified 1500
Created a chunk of size 1795, which is longer than the specified 1500
Created a chunk of size 1862, which is longer than the specified 1500
Created a chunk of size 2764, which is longer than the specified 1500
Created a chunk of size 1596, which is longer than the specified 1500
Created a chunk of size 3146, which is longer than the specified 1500
Created a chunk of size 1655, which is longer than the specified 1500
Created a chunk of size 1563, which is longer than the specified 1500
Created a chunk of size 1547, which is longer than the specified 1500
Created a chunk of size 1570, which is longer than the specified 1500
Created a chunk of size 1688, which is longer than the specified 1500
Created a chunk of size 1626, which is longer than the specified 1500
Created a chunk of size 2224, which is longer than the specified 1500
Created a chunk of size 3120, which is longer than the specified 1500
Created a chunk of size 2197, which is longer than the specified 1500

NotImplementedError: cannot instantiate 'PosixPath' on your system

Traceback (most recent call last):
File "notion-qa-hwchase17\main.py", line 13, in
store = pickle.load(f)
File "..\AppData\Local\Continuum\anaconda3\envs\langchain\lib\pathlib.py", line 1084, in new
raise NotImplementedError("cannot instantiate %r on your system"
NotImplementedError: cannot instantiate 'PosixPath' on your system

Understanding embeddings

Can we add general info to the readme about how embeddings work here? Or if you can point me to something I can read up on, I'm happy to update it and PR.
Questions coming to mind:

  • Do we hit the OpenAI embeddings API with our data, and that's why it incurs a usage charge at OpenAI?
  • The files that are outputted as a result (docs.index and faiss_store.pkl) - what does each file represent? Do we know if OpenAI keeps a copy of all embeddings made? Or are those effectively "things you own locally"?
  • Is there a way to use multiple vector embeddings in a QA prompt? And the system would decide where to look?

Thanks for any insight.

How to split document in a smarter way

Hello,

I understand that we need to split document in smaller piece because OpenAI can not get the whole texts as input.

However, my challenge is to cut texts in a smart way so it does not cut the text in the middle of a sentence.

Any luck about that?

Removal of sources

I am brand new to FAiSS, but is there a way that we can integrate a new option to remove items previously ingested?

faiss.loader: Could not load library with AVX2 support due to

Hi All,

I have encountered an issue with running the code. Was wondering if someone already has an answer.

Using Visual Studio Code, python.

Any help will be greatly appreciated.

  • I have pip installed faiss cpu
    =======

2023-04-07 19:01:39.162 INFO faiss.loader: Loading faiss with AVX2 support.
2023-04-07 19:01:39.162 INFO faiss.loader: Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2023-04-07 19:01:39.163 INFO faiss.loader: Loading faiss.
2023-04-07 19:01:39.179 INFO faiss.loader: Successfully loaded faiss.
Traceback (most recent call last):
File "e:\notion-qa\main.py", line 15, in
store = pickle.load(f)
^^^^^^^^^^^^^^
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.752.0_x64__qbz5n2kfra8p0\Lib\pathlib.py", line 873, in new
raise NotImplementedError("cannot instantiate %r on your system"
NotImplementedError: cannot instantiate 'PosixPath' on your system

Error when deploy to streamlit cloud

I was try to deploy the app to the streamlit cloud.
It works fine at localhost, but raised an error when trying to call the langchain vectordatabase with source:

Traceback (most recent call last):
  File "/home/appuser/venv/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
  File "/app/chatrtc/main.py", line 50, in <module>
    result = chain({"question": user_input})
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/chains/base.py", line 146, in __call__
    raise e
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/chains/base.py", line 142, in __call__
    outputs = self._call(inputs)
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/chains/qa_with_sources/base.py", line 96, in _call
    docs = self._get_docs(inputs)
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/chains/qa_with_sources/vector_db.py", line 20, in _get_docs
    return self.vectorstore.similarity_search(question, k=self.k)
  File "/home/appuser/venv/lib/python3.9/site-packages/langchain/vectorstores/faiss.py", line 91, in similarity_search
    _, indices = self.index.search(np.array([embedding], dtype=np.float32), k)
TypeError: search() missing 3 required positional arguments: 'k', 'distances', and 'labels'

I believe this is some kind of problem when the cloud server try to load the pickle file at the linux os, which is different from my localhost in window. Anyone know how to solve this?

Error when trying to deploy app on Streamlit (pydantic.error_wrappers.ValidationError)

Getting the following error when trying to deploy app on streamlit

pydantic.error_wrappers.ValidationError

Traceback:
File "/home/appuser/venv/lib/python3.9/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 552, in _run_script
exec(code, module.dict)
File "/app/notion-qa/main.py", line 16, in
chain = VectorDBQAWithSourcesChain.from_llm(llm=OpenAI(temperature=0), vectorstore=store)
File "pydantic/main.py", line 341, in pydantic.main.BaseModel._init

Anyone know how to resolve this?

Rate Limit problem while running python ingest.py

Hi!

When trying to include my own notion page I am facing a ratelimit problem. I know that i can deal with it outside the system, but is there a way to limit the rate inside the ingest.py in order to solve it internally?

Other language support

Is it possible to interact with chatbot in other languages with a notion db of posts in other languages in English?
I tried Chinese it won't find any answers

Unable to clone complete repository

Cloning into 'notion-qa'...
remote: Enumerating objects: 160, done.
remote: Counting objects: 100% (26/26), done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 160 (delta 11), reused 14 (delta 11), pack-reused 134
Receiving objects: 100% (160/160), 48.35 MiB | 1.18 MiB/s, done.
Resolving deltas: 100% (13/13), done.
error: unable to create file Notion_DB/Blendle's Employee Handbook a834d55573614857a48a9ce9ec4194e3/Blendle Social Code 1178dd7f18cf49cea04bf9efcb2d84b2/General disciplinary measures f8396878804a4eab9277ba5d2cd4cd08.md: Filename too long
fatal: cannot create directory at 'Notion_DB/Blendle's Employee Handbook a834d55573614857a48a9ce9ec4194e3/Diversity and Inclusion b8d3907631944696a4e76c2a41e757a5/#letstalkaboutstress 26c5eee20f854bb9ac6921f35c9a50b3': Filename too long
warning: Clone succeeded, but checkout failed.

The result returned by the query of Chinese data is English

The Chinese data I used was carried out and executed ingest.py, and the vector database was updated.
And when asking questions, I also use Chinese to ask questions, but the results returned by the chain are indeed English results, although the results are indeed correct

error when ingesting

Getting this error with ingesting

[''] is not valid under any of the given schemas
Traceback (most recent call last):
  File "/Users/xingfanxia/projects/notion-qa/ingest.py", line 32, in <module>
    store = FAISS.from_texts(docs, OpenAIEmbeddings(), metadatas=metadatas)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/vectorstores/faiss.py", line 168, in from_texts
    embeddings = embedding.embed_documents(texts)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/embeddings/openai.py", line 87, in embed_documents
    responses = [
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/embeddings/openai.py", line 88, in <listcomp>
    self._embedding_func(text, engine=self.document_model_name)
  File "/opt/homebrew/lib/python3.10/site-packages/langchain/embeddings/openai.py", line 76, in _embedding_func
    return self.client.create(input=[text], engine=engine)["data"][0]["embedding"]
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_requestor.py", line 226, in request
    resp, got_stream = self._interpret_response(result, stream)
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_requestor.py", line 619, in _interpret_response
    self._interpret_response_line(
  File "/opt/homebrew/lib/python3.10/site-packages/openai/api_requestor.py", line 682, in _interpret_response_line
    raise self.handle_error_response(
openai.error.InvalidRequestError: [''] is not valid under any of the given schemas - 'input'

Add tiktoken to requirements.txt

On a fresh container with minimal prereqs, first call results in:

`Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/langchain/llms/openai.py", line 233, in get_num_tokens
import tiktoken
ModuleNotFoundError: No module named 'tiktoken'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/jovyan/work/notion-qa/qa.py", line 20, in
result = chain({"question": args.question})
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/base.py", line 146, in call
raise e
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/base.py", line 142, in call
outputs = self._call(inputs)
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/qa_with_sources/base.py", line 97, in _call
answer, _ = self.combine_document_chain.combine_docs(docs, **inputs)
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/combine_documents/map_reduce.py", line 150, in combine_docs
num_tokens = length_func(result_docs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 79, in prompt_length
return self.llm_chain.llm.get_num_tokens(prompt)
File "/opt/conda/lib/python3.10/site-packages/langchain/llms/openai.py", line 235, in get_num_tokens
raise ValueError(
ValueError: Could not import tiktoken python package. This is needed in order to calculate get_num_tokens. Please it install it with `pip install tiktoken``

Installing fixes. Probably should be part of the requirements.txt

Iterating over LLM models does not work in LangChain

Can LLMChain objects be stored and iterated over?

llms = [{'name': 'OpenAI', 'model': OpenAI(temperature=0)},
        {'name': 'Flan', 'model':  HuggingFaceHub(repo_id="google/flan-t5-xl", model_kwargs={"temperature": 1e-10})}]

for llm_dict in llms:
    llm_name = llm_dict['name']
    llm_model = llm_dict['model']
    chain = LLMChain(llm=llm_model, prompt=prompt)

The first LLM model runs well, but for the second iteration, gives following error:

    chain = LLMChain(llm=llm_model, prompt=prompt)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pydantic\main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for LLMChain
prompt
  value is not a valid dict (type=type_error.dict)

Am I missing something? in dictionary declarations?

More details at https://stackoverflow.com/questions/76110329/iterating-over-llm-models-does-not-work-in-langchain

Ingesting your own dataset

Hi, when ingesting my own dataset, I understand that I must remove/replace Export*.zip and Notion_DB.
Should I also remove faiss_store.pkl? (I've never used faiss)
Are there other things that need to be removed/reset first?
Thanks!

Understanding the question/answering process and its costs

Can someone explain to me what the process is behind the scenes when calling the OpenAI API?

I understand how embedding works (#1). But how much text from the embedding is included in following requests? And why there are for example 2 requests for one question or even 5 when using ChatOpenAI?

Example:

I tried simple question (in Czech because my embeddings are in Czech) "How old must the camp leader be at least?". The chain made two API calls with 5565 tokens in total. And the response was "The minimum age for the camp leader is 18 according to Junák – český skaut." It's not very cost effective when using text-davinci. For one simple question I pay around 0,11 USD.

req

costs

I simply tried replace OpenAI() with ChatOpenAI() which uses gpt-3.5-turbo-0301. The chain made 5 requests (4,643 prompt + 278 completion = 4,921 tokens). The price is 10x lower and also less tokens are used.

chain = VectorDBQAWithSourcesChain.from_llm(llm=ChatOpenAI(temperature=0), vectorstore=store)

Is it possible to affect how long "embeddings" will be included in the request?

Thanks for any information.

openai.error.RateLimitError: Rate limit reached for default-global-with-image-limits in organization org-8Ch0kYQaMzstRX0szWKe1GLd on requests per min. Limit: 60 / min. Please try again in 1s. Contact [email protected] if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method.

what should i do ? How does my program avoid exceptions that limit flow, for example, such as sleep 1s?

do you have a better suggestion?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.