neuml / codequestion Goto Github PK

View Code? Open in Web Editor NEW

523.0 16.0 47.0 4.66 MB

🔎 Semantic search for developers

License: Apache License 2.0

Python 97.79% Makefile 2.21%

python machine-learning search nlp txtai

codequestion's Introduction

Semantic search for developers

codequestion is a semantic search application for developer questions.

Developers typically have a web browser window open while they work and run web searches as questions arise. With codequestion, this can be done from a local context. This application executes similarity queries to find similar questions to the input query.

The default model for codequestion is built off the Stack Exchange Dumps on archive.org. Once a model is installed, codequestion runs locally, no network connection is required.

codequestion is built with Python 3.8+ and txtai.

Installation

The easiest way to install is via pip and PyPI

pip install codequestion

Python 3.8+ is supported. Using a Python virtual environment is recommended.

codequestion can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/codequestion

See this link for environment-specific troubleshooting.

Download a model

Once codequestion is installed, a model needs to be downloaded.

python -m codequestion.download

The model will be stored in ~/.codequestion/

The model can also be manually installed if the machine doesn't have direct internet access. The default model is pulled from the GitHub release page

unzip cqmodel.zip ~/.codequestion

Search

Start up a codequestion shell to get started.

codequestion

A prompt will appear. Queries can be typed into the console. Type help to see all available commands.

Topics

The latest release integrates txtai 5.0, which has support for semantic graphs.

Semantic graphs add support for topic modeling and path traversal. Topics organize questions into groups with similar concepts. Path traversal uses the semantic graph to show how two potentially disparate entries are connected. An example covering both topic and path traversal is shown below.

VS Code

A codequestion prompt can be started within Visual Studio Code. This enables asking coding questions right from your IDE.

Run Ctrl+` to open a new terminal then type codequestion.

API service

codequestion builds a standard txtai embeddings index. As such, it supports hosting the index via a txtai API service.

Running the following:

app.yml

path: /home/user/.codequestion/models/stackexchange/
embeddings:

# Install API extra
pip install txtai[api]

# Start API
CONFIG=app.yml uvicorn "txtai.api:app"

# Test API
curl "http://127.0.0.1:8000/search?query=python+query+sqlite&limit=1"

Outputs:

[{
    "id":"616429",
    "text":"How to fetch data from sqlite using python? stackoverflow python sqlite",
    "score":0.8401689529418945
}]

Additional metadata fields can be pulled back with SQL statements.

curl
    --get
    --data-urlencode "query=select id, date, tags, question, score from txtai where similar('python query sqlite')"
    --data-urlencode "limit=1"
    "http://127.0.0.1:8000/search"

[{
    "id":"616429",
    "date":"2022-05-23T10:45:40.397",
    "tags":"python sqlite",
    "question":"How to fetch data from sqlite using python?",
    "score":0.8401689529418945
}]

Tech overview

The following is an overview covering how this project works.

Process the raw data dumps

The raw 7z XML dumps from Stack Exchange are processed through a series of steps (see building a model). Only highly scored questions with accepted answers are retrieved for storage in the model. Questions and answers are consolidated into a single SQLite file called questions.db. The schema for questions.db is below.

questions.db schema

Id INTEGER PRIMARY KEY
Source TEXT
SourceId INTEGER
Date DATETIME
Tags TEXT
Question TEXT
QuestionUser TEXT
Answer TEXT
AnswerUser TEXT
Reference TEXT

Index

codequestion builds a txtai embeddings index for questions.db. Each question in the questions.db schema is vectorized with a sentence-transformers model. Once questions.db is converted to a collection of sentence embeddings, the embeddings are normalized and stored in Faiss, which enables fast similarity searches.

Query

codequestion tokenizes each query using the same method as during indexing. Those tokens are used to build a sentence embedding. That embedding is queried against the Faiss index to find the most similar questions.

Build a model

The following steps show how to build a codequestion model using Stack Exchange archives.

This is not necessary if using the default model from the GitHub release page

1.) Download files from Stack Exchange: https://archive.org/details/stackexchange

2.) Place selected files into a directory structure like shown below (current process requires all these files).

stackexchange/ai/ai.stackexchange.com.7z
stackexchange/android/android.stackexchange.com.7z
stackexchange/apple/apple.stackexchange.com.7z
stackexchange/arduino/arduino.stackexchange.com.7z
stackexchange/askubuntu/askubuntu.com.7z
stackexchange/avp/avp.stackexchange.com.7z
stackexchange/codereview/codereview.stackexchange.com.7z
stackexchange/cs/cs.stackexchange.com.7z
stackexchange/datascience/datascience.stackexchange.com.7z
stackexchange/dba/dba.stackexchange.com.7z
stackexchange/devops/devops.stackexchange.com.7z
stackexchange/dsp/dsp.stackexchange.com.7z
stackexchange/raspberrypi/raspberrypi.stackexchange.com.7z
stackexchange/reverseengineering/reverseengineering.stackexchange.com.7z
stackexchange/scicomp/scicomp.stackexchange.com.7z
stackexchange/security/security.stackexchange.com.7z
stackexchange/serverfault/serverfault.com.7z
stackexchange/stackoverflow/stackoverflow.com-Posts.7z
stackexchange/stats/stats.stackexchange.com.7z
stackexchange/superuser/superuser.com.7z
stackexchange/unix/unix.stackexchange.com.7z
stackexchange/vi/vi.stackexchange.com.7z
stackexchange/wordpress/wordpress.stackexchange.com.7z

3.) Run the ETL process

python -m codequestion.etl.stackexchange.execute stackexchange

This will create the file stackexchange/questions.db

4.) OPTIONAL: Build word vectors - only necessary if using a word vectors model. If using word vector models, make sure to run pip install txtai[similarity]

python -m codequestion.vectors stackexchange/questions.db

This will create the file ~/.codequestion/vectors/stackexchange-300d.magnitude

5.) Build embeddings index

python -m codequestion.index index.yml stackexchange/questions.db

The default index.yml file is found on GitHub. Settings can be changed to customize how the index is built.

After this step, the index is created and all necessary files are ready to query.

Model accuracy

The following sections show test results for codequestion v2 and codequestion v1 using the latest Stack Exchange dumps. Version 2 uses a sentence-transformers model. Version 1 uses a word vectors model with BM25 weighting. BM25 and TF-IDF are shown to establish a baseline score.

StackExchange Query

Models are scored using Mean Reciprocal Rank (MRR).

Model	MRR
all-MiniLM-L6-v2	85.0
SE 300d - BM25	77.1
BM25	67.7
TF-IDF	61.7

STS Benchmark

Models are scored using Pearson Correlation. Note that the word vectors model is only trained on Stack Exchange data, so it isn't expected to generalize as well against the STS dataset.

Model	Supervision	Dev	Test
all-MiniLM-L6-v2	Train	87.0	82.7
SE 300d - BM25	Train	74.0	67.4

Tests

To reproduce the tests above, run the following. Substitute $TEST_PATH with any local path.

mkdir -p $TEST_PATH
wget https://raw.githubusercontent.com/neuml/codequestion/master/test/stackexchange/query.txt -P $TEST_PATH/stackexchange
wget http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz
tar -C $TEST_PATH -xvzf Stsbenchmark.tar.gz
python -m codequestion.evaluate -s test -p $TEST_PATH

codequestion's People

Contributors

Stargazers

Watchers

codequestion's Issues

Test performance (accuracy and speed) of using a transformer model vs word embedding model

Build embeddings via sentence transformers and compare accuracy/speed vs BM25-fastText.

OMP: Error #15: Initializing libomp140.x86_64.dll, but found libiomp5md.dll already initialized.

I'm able to get codequestion running but as soon as I begin to query it crashes with the following error:

OMP: Error #15: Initializing libomp140.x86_64.dll, but found libiomp5md.dll already initialized.

I'm running in Windows 10, venv, Python 3.11.3, Windows Powershell.

I already tried uninstalling numby and torch and reinstalling them, but same results.

Vector model file not found (cord19-300d.magnitude)

Hi,

I get the following error when running python -m paperai.index

raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: 'C:\\Users\\x\\.cord19\\vectors\\cord19-300d.magnitude'

PS. I am quite new to all this; so, apologies if the mistake is on my end.

ImportError: Faiss library is not installed

Trying to configure on Windows 10, I seem to have gotten everything installed but get this traceback when I run it:

(keras-gpu-2) C:\Users\bbate>codequestion
The system cannot find the path specified.
2020-09-15 13:57:44.515137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Loading model from C:\Users\bbate\.codequestion\models\stackexchange
Traceback (most recent call last):
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\bbate\Miniconda3\envs\keras-gpu-2\Scripts\codequestion.exe\__main__.py", line 7, in <module>
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\shell.py", line 48, in main
    Shell().cmdloop()
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\cmd.py", line 105, in cmdloop
    self.preloop()
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\shell.py", line 22, in preloop
    self.embeddings, self.db = Query.load()
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\query.py", line 127, in load
    embeddings.load(path)
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\txtai\embeddings.py", line 258, in load
    self.embeddings = ANN.create(self.config)
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\txtai\ann.py", line 51, in create
    raise ImportError("Faiss library is not installed")
ImportError: Faiss library is not installed

Update models to use latest Stack Exchange data

Pull latest data down and build new models

52177 segmentation fault codequestion

➜ python3.10 -m pip install codequestion

sformers, torch, txtai, codequestion
Successfully installed MarkupSafe-2.1.2 codequestion-2.0.0 faiss-cpu-1.7.3 html2markdown-0.1.7 huggingface-hub-0.13.4 jinja2-3.1.2 mpmath-1.3.0 networkx-3.1 python-louvain-0.16 scipy-1.10.1 sympy-1.11.1 tokenizers-0.13.3 torch-2.0.0 transformers-4.28.1 txtai-5.5.0

➜ python3.10 -m codequestion.download

Downloading model from https://github.com/neuml/codequestion/releases/download/v2.0.0/cqmodel.zip to /var/folders/4b/fykz7dvx2fj550ml_6t5qkww0000gn/T/cqmodel.zip
100%|
Decompressing model to /Users/tonis/.codequestion
Download complete

➜ codequestion

Loading model from /Users/tonis/.codequestion/models/stackexchange
[1]    58256 segmentation fault  codequestion

I also tried in venv. But I'm not a Python expert

file not found /home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude

root@0497bd526f2b:/# codequestion
Loading model from /root/.codequestion/models/stackexchange
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.22.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/bin/codequestion", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/codequestion/shell.py", line 35, in main
Shell().cmdloop()
File "/usr/local/lib/python3.8/cmd.py", line 105, in cmdloop
self.preloop()
File "/usr/local/lib/python3.8/site-packages/codequestion/shell.py", line 21, in preloop
self.embeddings, self.db = Query.load()
File "/usr/local/lib/python3.8/site-packages/codequestion/query.py", line 127, in load
embeddings.load(path)
File "/usr/local/lib/python3.8/site-packages/codequestion/embeddings.py", line 332, in load
self.vectors = self.loadVectors(self.config["path"])
File "/usr/local/lib/python3.8/site-packages/codequestion/embeddings.py", line 104, in loadVectors
raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: '/home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude'

Ca u please tell me how to create indexes for covid json files like u mentioned in ur notebook

Single character in shell causes error

From reddit:

https://www.reddit.com/r/MachineLearning/comments/irauuz/p_codequestion_ask_coding_questions_directly_from/g4zjvg6/?utm_source=reddit&utm_medium=web2x&context=3

pip install results No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)

System: Windows 10 (x64) running Python 3.8.1 and pip 20.2.3.

ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from txtai>=1.2.0->codequestion) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)

(env) D:\code\codequestion>python -m pip install --upgrade pip
Collecting pip
  Using cached https://files.pythonhosted.org/packages/4e/5f/528232275f6509b1fff703c9280e58951a81abe24640905de621c9f81839/pip-20.2.3-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 19.2.3
    Uninstalling pip-19.2.3:
      Successfully uninstalled pip-19.2.3
Successfully installed pip-20.2.3

Here's what the full run looks like.

(env) D:\code\codequestion>pip install codequestion
Collecting codequestion
  Using cached codequestion-1.1.0-py3-none-any.whl (17 kB)
Collecting tqdm==4.48.0
  Using cached tqdm-4.48.0-py2.py3-none-any.whl (67 kB)
Collecting txtai>=1.2.0
  Using cached txtai-1.2.0-py3-none-any.whl (20 kB)
Collecting mdv>=1.7.4
  Using cached mdv-1.7.4.tar.gz (54 kB)
Collecting html2text>=2020.1.16
  Using cached html2text-2020.1.16-py3-none-any.whl (32 kB)
Collecting numpy>=1.18.4
  Downloading numpy-1.19.2-cp38-cp38-win_amd64.whl (13.0 MB)
     |████████████████████████████████| 13.0 MB 6.4 MB/s
Collecting annoy>=1.16.3
  Downloading annoy-1.16.3.tar.gz (644 kB)
     |████████████████████████████████| 644 kB 6.4 MB/s
Collecting pymagnitude-lite>=0.1.43
  Downloading pymagnitude_lite-0.1.143-py3-none-any.whl (34 kB)
Collecting nltk>=3.5
  Using cached nltk-3.5.zip (1.4 MB)
Collecting sentence-transformers>=0.3.3
  Using cached sentence-transformers-0.3.6.tar.gz (62 kB)
Collecting fasttext>=0.9.2
  Downloading fasttext-0.9.2.tar.gz (68 kB)
     |████████████████████████████████| 68 kB 4.8 MB/s
Collecting hnswlib>=0.4.0
  Downloading hnswlib-0.4.0.tar.gz (17 kB)
Collecting scikit-learn>=0.23.1
  Downloading scikit_learn-0.23.2-cp38-cp38-win_amd64.whl (6.8 MB)
     |████████████████████████████████| 6.8 MB 3.3 MB/s
Collecting regex>=2020.5.14
  Using cached regex-2020.7.14-cp38-cp38-win_amd64.whl (264 kB)
Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
     |████████████████████████████████| 769 kB 6.4 MB/s
ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from txtai>=1.2.0->codequestion) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)

Update to txtai 3.2

Update to txtai 3.2 which now has some components as optional.

Feature Request: Code language preference config

It is would be nice to be able to list coding language preference so you are more likly to answers relevant to you when you fail to specify the lang

How to Speed Up Embeddings 10 Million Text

I have 10 Million text documents

# Build embeddings index
embeddings.index(Index.stream(texts))

This part of code run too slowly
How i can run faster this part

Migrate embeddings logic to txtai

Currently, there is duplicate logic for building embeddings logic in this project. Migrate this project to use txtai.

We need more

Integrate FastAPI for model serving

Similar to neuml/txtai#12 - allow serving codequestion models via FastAPI. This will help speed up calls via the command line (#4), along with the possibility of remote service calls.

Sync with txtai 2.0

Update to txtai 2.0

Vector model file not found

Hello,

Thank you very much for the project. But I have one small issue, right now it seems that when you run python -m codequestion.download it downloads a configuration file that will be used by codequestion to load the model.

The path to the model seems hardcoded to /home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude
How can we specify to codequestion to use our home or modify the config file?

Best

Add code quality checks

Add the following standard processes and procedures.

Unit tests
Test coverage
GitHub actions workflow
Pre-commit code quality checks

Migrate from word vector models to sentence transformers models

Since the original release in January 2020 there has been a lot of progress! sentence-transformers models now perform better than the models currently in codequestion with similar speed (even on CPUs!).

Models in codequestion 2.0 should move to sentence-transformers.

UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.23.1 when using version 0.23.2

~/miniconda3/envs/deepl/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(
codequestion query shell

Received the warning above when launching codequestion shell after a fresh install.

System details:

Ubuntu 18.04
Miniconda Python 3.8
CPU-only pytorch 1.6