fynnfluegge / codeqai Goto Github PK

View Code? Open in Web Editor NEW

370.0 9.0 45.0 459 KB

Local first semantic code search and chat powered by vector embeddings and LLMs

License: Apache License 2.0

Python 100.00%

faiss langchain llamacpp llm openai huggingface llama2 gpt sentence-transformers codellama

codeqai's Introduction

💻 I am a Software Engineer from Hamburg.
👨‍💻 The mouse is lava, so I code with vim.

codeqai's People

Contributors

Stargazers

Watchers

codeqai's Issues

Indexing Error with codeqai on Conda Environment: Continuous Indexing Without Completion

While using the codeqai tool within a conda environment, I encountered an issue during the indexing process where it continuously attempts to index without completion. This problem occurred when I tried to utilize codeqai's search functionality in my project directory. Specifically, the error IndexError: list index out of range was thrown, indicating an issue with handling the document vector indexing. Below are the detailed steps to reproduce, along with the specific environment setup.

Steps to Reproduce:

Installed codeqai using pip within a conda environment.
Ran codeqai configure and configured the tool with the following settings:
- Selected "y" for using local embedding models.
- Chose "Instructor-Large" for the local embedding model.
- Selected "N" for using local chat models and chose "OpenAI" with "gpt-4" as the remote LLM.
Attempted to start the codeqai search by navigating to my project directory (2-006) that includes .m, .mat, .txt. files. Running codeqai search in the terminal.
Received a message indicating no vector store was found for 2-006 and that initial indexing may take a few minutes. Shortly after, the indexing process started but then failed with an IndexError: list index out of range.

Expected Behavior:

The indexing process should be completed, allowing for subsequent searches within the codebase using codeqai.

Actual Behavior:

The application failed to complete the indexing process due to an IndexError in the vector indexing step, specifically indicating a problem with handling the document vectors.

Environment:

codeqai version: 0.0.14
langchain-community version: 0.0.17
sentence-transformers version: 2.3.1
Python version: 3.11
Conda version: 4.12.0
Operating System: Windows (with Conda environment)

Full Terminal Output and Error

{GenericDirectory>}conda activate condaqai-env

(condaqai-env) {GenericDirectory>}codeqai search
Not a git repository. Exiting.

(condaqai-env) {GenericDirectory>}ls
'ls' is not recognized as an internal or external command,
operable program or batch file.

(condaqai-env) {GenericDirectory>}cd 2-006

(condaqai-env) {GenericDirectory}\2-006>codeqai search
No vector store found for 2-006. Initial indexing may take a few minutes.
⠋ 💾 Indexing vector store...Traceback (most recent call last):
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\Scripts\codeqai.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\codeqai\__main__.py", line 5, in main
    app.run()
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\codeqai\app.py", line 146, in run
    vector_store.index_documents(documents)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\codeqai\vector_store.py", line 34, in index_documents
    self.db = FAISS.from_documents(documents, self.embeddings)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\langchain_core\vectorstores.py", line 508, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\langchain_community\vectorstores\faiss.py", line 960, in from_texts
    return cls.__from(
  File "C:\Users\Edge\anaconda3\envs\condaqai-env\lib\site-packages\langchain_community\vectorstores\faiss.py", line 919, in __from
    index = faiss.IndexFlatL2(len(embeddings[0]))
IndexError: list index out of range
⠴ 💾 Indexing vector store...

Additional Context:

This issue seems to stem from the vector indexing process within the langchain-community package, possibly due to an empty or malformed document set being processed for vectorization. Given the configuration steps and the use of a conda environment, there might be specific dependencies or configurations that contribute to this problem.

How to use it on Windows?

Various issues and fixes on Windows

Hi, I tried to run your project using Windows 11 Powershell 7.4 and ran into various issues. I was able to debug some of them, so I thought I'd jot down the steps I took:

1) `pipx run --spec codeqai codeqai configure`

This didn't work for me, the setup launched but Codeqai was unavailable after. (My understanding of pipx run is that it's a temporary, run-once sandbox venv only.)

pipx install codeqai, followed by codeqai configure worked instead.

2) UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1137: character maps to

Whenever you're using open(), I believe you should add encoding='utf-8'. This solves this issue.
For example: with open(env_path, "w", encoding='utf-8') as env_f: in app.py.

3) `Command '['C:\\Users\\<USER>\\.local\\pipx\\venvs\\codeqai\\Scripts\\python.exe', '-m', 'pip', 'install', 'faiss-gpu (Only if your system supports CUDA))']' returned non-zero exit status 1.`

Unless I'm mistaken, on line 170 of vector_store.py, you're passing the literal string faiss-gpu (Only if your system supports CUDA) to pip install. You'd want fiass-gpu instead. However, I still couldn't install fiass-gpu as it returned a no compatible packages error. fiass-cpu worked fine.

4) When I reran codeqai search/sync/etc I get "IndexError: list index out of range". in `C:\Users\<USER>\.local\pipx\venvs\codeqai\lib\site-packages\codeqai\vector_store.py", line 34,`.

This seems to be because "documents" is empty. Going back to app.py, the files var after files = repo.load_files() has an array of docs, but documents does not after documents = codeparser.parse_code_files(files)

After some debugging, this seems to be because treesitterNodes in codeparser.py by line 36 is empty. However, programming_language has content (Language.JAVASCRIPT /n Language.JAVASCRIPT), TreesitterMethodNode has <codeqai.treesitter.treesitter_js.TreesitterJavascript object at .... (x2), and file_bytes also has the expected file data.

I'm unfamiliar with Treesitter to be able to debug any further as to why treesitter_parser.parse(file_bytes) is returning an empty array in this case.

Hope this can help.

P.S.
Didn't include this in the list as it may be my local ENV, but for some reason I was unable to run codeqai via pipx in python 3.10.5. It repeatedly wanted to use pyenv-win 3.9.6, even though that was nowhere on my system. I had to install 3.9.6 to be able to continue. This may be a local env issue from an old installation however.

Move `pytest` to dev dependency group

pytest is currently defined in default dependency group:

[tool.poetry.dependencies]
python = "^3.9"
tiktoken = "^0.4.0"
yaspin = "^3.0.0"
pytest = "^7.4.0"

Should be moved to a separate dev dependency group to exclude it from build.

add python-dotenv

The OpenAI and Azure API Key should be isolated in the pipx installation. The user should be prompted to enter the necessary environment variables if not available in the isolated environment. These variables should be stored in an .env file.
Use https://github.com/theskumar/python-dotenv

Assertion error in faiss

Very cool project, trying to get it to work I face some issue -

Using local embeddings (INSTRUCTOR_Transformer) and llamacpp model, any search/chat ends up in the following assertion:

load INSTRUCTOR_Transformer
max_seq_length  512
🔎 Enter a search pattern: preprocessing
⠹ 🤖 Processing...Traceback (most recent call last):
  File "/Users/dinari/.local/bin/codeqai", line 10, in <module>
    sys.exit(main())
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/codeqai/__main__.py", line 5, in main
    app.run()
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/codeqai/app.py", line 177, in run
    similarity_result = vector_store.similarity_search(search_pattern)
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/codeqai/vector_store.py", line 131, in similarity_search
    return self.db.similarity_search(query, k=4)
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/langchain_community/vectorstores/faiss.py", line 544, in similarity_search
    docs_and_scores = self.similarity_search_with_score(
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/langchain_community/vectorstores/faiss.py", line 417, in similarity_search_with_score
    docs = self.similarity_search_with_score_by_vector(
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/langchain_community/vectorstores/faiss.py", line 302, in similarity_search_with_score_by_vector
    scores, indices = self.index.search(vector, k if filter is None else fetch_k)
  File "/Users/dinari/Library/Application Support/pipx/venvs/codeqai/lib/python3.10/site-packages/faiss/class_wrappers.py", line 329, in replacement_search
    assert d == self.d
AssertionError

Using apple silicon arm64 arch.

Wrong line numbers in Semantic search result

The code snippets displayed as the result of the semantic search has wrong line numbers. The line numbers should be the same as in the relative file. Currently line numbers starts always at 1.

This can be fixed by finding the occurrence of the code snippet in the file. The file name is present in the metadata of the vector search result.

Error running `codeqai search , app` and `chat` : Unexpected keyword argument `token` in `INSTRUCTOR._load_sbert_model()`

Description

When attempting to run the codeqai app command on my project directory, I encountered a TypeError related to an unexpected keyword argument 'token' in the INSTRUCTOR._load_sbert_model() method. This occurred after configuring codeqai to use local embedding models (Instructor-Large) and selecting gpt-4 as the remote LLM for chat functionalities.

Steps to Reproduce

Installed codeqai using pip.
Ran codeqai configure and configured the tool as follows:
- Selected "y" for using local embedding models.
- Chose "Instructor-Large" for the local embeddings model.
- Selected "N" for using local chat models and chose "OpenAI" with "gpt-4" as the remote LLM.
Attempted to start the codeqai search by running codeqai search in the terminal.
Encountered the following error:

  Traceback (most recent call last):
    File "/usr/local/bin/codeqai", line 8, in <module>
      sys.exit(main())
               ^^^^^^
    File "/usr/local/lib/python3.11/site-packages/codeqai/__main__.py", line 5, in main
      app.run()
    File "/usr/local/lib/python3.11/site-packages/codeqai/app.py", line 121, in run
      embeddings_model = Embeddings(
                         ^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/codeqai/embeddings.py", line 42, in __init__
      self.embeddings = HuggingFaceInstructEmbeddings()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/langchain_community/embeddings/huggingface.py", line 149, in __init__
      self.client = INSTRUCTOR(
                    ^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/sentence_transformers/SentenceTransformer.py", line 194, in __init__
      modules = self._load_sbert_model(
                ^^^^^^^^^^^^^^^^^^^^^^^
  TypeError: INSTRUCTOR._load_sbert_model() got an unexpected keyword argument 'token'

Expected Behavior

I expected the codeqai search to launch successfully and allow me to interact with my codebase through the bash terminal.

Actual Behavior

The application failed to start due to a TypeError in the INSTRUCTOR._load_sbert_model() method.

Environment

codeqai version: 0.0.14
langchain-community version: 0.0.17
sentence-transformers version: 2.3.1
Python version: 3.11
Operating System: Linux c9b1c6e240f6 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64 GNU/Linux (Docker Container)

Additional Context

The issue seems to be related to the integration between codeqai, the langchain-community package, and sentence-transformers. Given that all components are up to date, it appears there might be an incompatibility or a bug in the way codeqai is utilizing the sentence-transformers library, specifically with the INSTRUCTOR model configuration.

Unknown Issue after entering API Key

Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Scripts\codeqai.exe_main.py", line 7, in
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Lib\site-packages\codeqai_main.py", line 5, in main
app.run()
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Lib\site-packages\codeqai\app.py", line 108, in run
repo_name = repo.get_git_root(os.getcwd()).split("/")[-1]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Lib\site-packages\codeqai\repo.py", line 8, in get_git_root
git_repo = Repo(path, search_parent_directories=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\pierr\AppData\Local\Programs\Python\Python311\Lib\site-packages\git\repo\base.py", line 276, in init
raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError: C:\Users\pierr

Azure Open API Issue

Hi,

My Setup is as follows regarding Embedding and Chat LLM:

[?] Which local embeddings model do you want to use?:
Instructor-Large
[?] Do you want to use local chat models? (y/N): N
[?] Which remote LLM do you want to use?:
Azure-OpenAI

In such a set up , simple questions about the codebase give response like below:

I'm sorry, I cannot determine the answer to your question as there is not enough context provided to identify a specific codebase.
Can you provide more information or code snippets?

Is there anything wrong with above set-up?

Support for reStructuredText, and Markdown

Treesitter was a brilliant addition. Any chance we might see support added for reStructuredText (rst) and Markdown (md) files?

impossible to update openai api key

dug through files for hours on windows, it just isn't happening.

Got this when trying to configure codeqai

Am I missing something?

Throwing Error When Reading .venv

Installed and tried to run on my current VS code project, but since I have a virtual environment setup, it threw an error. I was able to set iup and run with a new version of the same repository without issues. Is there a way I can skip the .venv or other files/folders on ingestion?

Changing embeddings model should delete faiss index

If an embedding model is configured after the current repo was indexed with faiss before, the related faiss index should be deleted from .cache/ folder automatically. Afterwards it can be recreated with the newly configured model.