Giter VIP home page Giter VIP logo

codequestion's Introduction

Semantic search for developers

Version GitHub last commit GitHub issues Build Status Coverage Status


codequestion is a semantic search application for developer questions.

demo

Developers typically have a web browser window open while they work and run web searches as questions arise. With codequestion, this can be done from a local context. This application executes similarity queries to find similar questions to the input query.

The default model for codequestion is built off the Stack Exchange Dumps on archive.org. Once a model is installed, codequestion runs locally, no network connection is required.

architecture architecture

codequestion is built with Python 3.8+ and txtai.

Installation

The easiest way to install is via pip and PyPI

pip install codequestion

Python 3.8+ is supported. Using a Python virtual environment is recommended.

codequestion can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/codequestion

See this link for environment-specific troubleshooting.

Download a model

Once codequestion is installed, a model needs to be downloaded.

python -m codequestion.download

The model will be stored in ~/.codequestion/

The model can also be manually installed if the machine doesn't have direct internet access. The default model is pulled from the GitHub release page

unzip cqmodel.zip ~/.codequestion

Search

Start up a codequestion shell to get started.

codequestion

A prompt will appear. Queries can be typed into the console. Type help to see all available commands.

demo

Topics

The latest release integrates txtai 5.0, which has support for semantic graphs.

Semantic graphs add support for topic modeling and path traversal. Topics organize questions into groups with similar concepts. Path traversal uses the semantic graph to show how two potentially disparate entries are connected. An example covering both topic and path traversal is shown below.

topics

VS Code

A codequestion prompt can be started within Visual Studio Code. This enables asking coding questions right from your IDE.

Run Ctrl+` to open a new terminal then type codequestion.

vscode

API service

codequestion builds a standard txtai embeddings index. As such, it supports hosting the index via a txtai API service.

Running the following:

app.yml

path: /home/user/.codequestion/models/stackexchange/
embeddings:
# Install API extra
pip install txtai[api]

# Start API
CONFIG=app.yml uvicorn "txtai.api:app"

# Test API
curl "http://127.0.0.1:8000/search?query=python+query+sqlite&limit=1"

Outputs:

[{
    "id":"616429",
    "text":"How to fetch data from sqlite using python? stackoverflow python sqlite",
    "score":0.8401689529418945
}]

Additional metadata fields can be pulled back with SQL statements.

curl
    --get
    --data-urlencode "query=select id, date, tags, question, score from txtai where similar('python query sqlite')"
    --data-urlencode "limit=1"
    "http://127.0.0.1:8000/search"
[{
    "id":"616429",
    "date":"2022-05-23T10:45:40.397",
    "tags":"python sqlite",
    "question":"How to fetch data from sqlite using python?",
    "score":0.8401689529418945
}]

Tech overview

The following is an overview covering how this project works.

Process the raw data dumps

The raw 7z XML dumps from Stack Exchange are processed through a series of steps (see building a model). Only highly scored questions with accepted answers are retrieved for storage in the model. Questions and answers are consolidated into a single SQLite file called questions.db. The schema for questions.db is below.

questions.db schema

Id INTEGER PRIMARY KEY
Source TEXT
SourceId INTEGER
Date DATETIME
Tags TEXT
Question TEXT
QuestionUser TEXT
Answer TEXT
AnswerUser TEXT
Reference TEXT

Index

codequestion builds a txtai embeddings index for questions.db. Each question in the questions.db schema is vectorized with a sentence-transformers model. Once questions.db is converted to a collection of sentence embeddings, the embeddings are normalized and stored in Faiss, which enables fast similarity searches.

Query

codequestion tokenizes each query using the same method as during indexing. Those tokens are used to build a sentence embedding. That embedding is queried against the Faiss index to find the most similar questions.

Build a model

The following steps show how to build a codequestion model using Stack Exchange archives.

This is not necessary if using the default model from the GitHub release page

1.) Download files from Stack Exchange: https://archive.org/details/stackexchange

2.) Place selected files into a directory structure like shown below (current process requires all these files).

  • stackexchange/ai/ai.stackexchange.com.7z
  • stackexchange/android/android.stackexchange.com.7z
  • stackexchange/apple/apple.stackexchange.com.7z
  • stackexchange/arduino/arduino.stackexchange.com.7z
  • stackexchange/askubuntu/askubuntu.com.7z
  • stackexchange/avp/avp.stackexchange.com.7z
  • stackexchange/codereview/codereview.stackexchange.com.7z
  • stackexchange/cs/cs.stackexchange.com.7z
  • stackexchange/datascience/datascience.stackexchange.com.7z
  • stackexchange/dba/dba.stackexchange.com.7z
  • stackexchange/devops/devops.stackexchange.com.7z
  • stackexchange/dsp/dsp.stackexchange.com.7z
  • stackexchange/raspberrypi/raspberrypi.stackexchange.com.7z
  • stackexchange/reverseengineering/reverseengineering.stackexchange.com.7z
  • stackexchange/scicomp/scicomp.stackexchange.com.7z
  • stackexchange/security/security.stackexchange.com.7z
  • stackexchange/serverfault/serverfault.com.7z
  • stackexchange/stackoverflow/stackoverflow.com-Posts.7z
  • stackexchange/stats/stats.stackexchange.com.7z
  • stackexchange/superuser/superuser.com.7z
  • stackexchange/unix/unix.stackexchange.com.7z
  • stackexchange/vi/vi.stackexchange.com.7z
  • stackexchange/wordpress/wordpress.stackexchange.com.7z

3.) Run the ETL process

python -m codequestion.etl.stackexchange.execute stackexchange

This will create the file stackexchange/questions.db

4.) OPTIONAL: Build word vectors - only necessary if using a word vectors model. If using word vector models, make sure to run pip install txtai[similarity]

python -m codequestion.vectors stackexchange/questions.db

This will create the file ~/.codequestion/vectors/stackexchange-300d.magnitude

5.) Build embeddings index

python -m codequestion.index index.yml stackexchange/questions.db

The default index.yml file is found on GitHub. Settings can be changed to customize how the index is built.

After this step, the index is created and all necessary files are ready to query.

Model accuracy

The following sections show test results for codequestion v2 and codequestion v1 using the latest Stack Exchange dumps. Version 2 uses a sentence-transformers model. Version 1 uses a word vectors model with BM25 weighting. BM25 and TF-IDF are shown to establish a baseline score.

StackExchange Query

Models are scored using Mean Reciprocal Rank (MRR).

Model MRR
all-MiniLM-L6-v2 85.0
SE 300d - BM25 77.1
BM25 67.7
TF-IDF 61.7

STS Benchmark

Models are scored using Pearson Correlation. Note that the word vectors model is only trained on Stack Exchange data, so it isn't expected to generalize as well against the STS dataset.

Model Supervision Dev Test
all-MiniLM-L6-v2 Train 87.0 82.7
SE 300d - BM25 Train 74.0 67.4

Tests

To reproduce the tests above, run the following. Substitute $TEST_PATH with any local path.

mkdir -p $TEST_PATH
wget https://raw.githubusercontent.com/neuml/codequestion/master/test/stackexchange/query.txt -P $TEST_PATH/stackexchange
wget http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz
tar -C $TEST_PATH -xvzf Stsbenchmark.tar.gz
python -m codequestion.evaluate -s test -p $TEST_PATH

Further reading

codequestion's People

Contributors

davidmezzetti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codequestion's Issues

Vector model file not found (cord19-300d.magnitude)

Hi,

I get the following error when running python -m paperai.index

raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: 'C:\\Users\\x\\.cord19\\vectors\\cord19-300d.magnitude'

PS. I am quite new to all this; so, apologies if the mistake is on my end.

ImportError: Faiss library is not installed

Trying to configure on Windows 10, I seem to have gotten everything installed but get this traceback when I run it:

(keras-gpu-2) C:\Users\bbate>codequestion
The system cannot find the path specified.
2020-09-15 13:57:44.515137: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Loading model from C:\Users\bbate\.codequestion\models\stackexchange
Traceback (most recent call last):
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\bbate\Miniconda3\envs\keras-gpu-2\Scripts\codequestion.exe\__main__.py", line 7, in <module>
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\shell.py", line 48, in main
    Shell().cmdloop()
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\cmd.py", line 105, in cmdloop
    self.preloop()
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\shell.py", line 22, in preloop
    self.embeddings, self.db = Query.load()
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\codequestion\query.py", line 127, in load
    embeddings.load(path)
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\txtai\embeddings.py", line 258, in load
    self.embeddings = ANN.create(self.config)
  File "c:\users\bbate\miniconda3\envs\keras-gpu-2\lib\site-packages\txtai\ann.py", line 51, in create
    raise ImportError("Faiss library is not installed")
ImportError: Faiss library is not installed

52177 segmentation fault codequestion

โžœ python3.10 -m pip install codequestion

sformers, torch, txtai, codequestion
Successfully installed MarkupSafe-2.1.2 codequestion-2.0.0 faiss-cpu-1.7.3 html2markdown-0.1.7 huggingface-hub-0.13.4 jinja2-3.1.2 mpmath-1.3.0 networkx-3.1 python-louvain-0.16 scipy-1.10.1 sympy-1.11.1 tokenizers-0.13.3 torch-2.0.0 transformers-4.28.1 txtai-5.5.0

โžœ python3.10 -m codequestion.download

Downloading model from https://github.com/neuml/codequestion/releases/download/v2.0.0/cqmodel.zip to /var/folders/4b/fykz7dvx2fj550ml_6t5qkww0000gn/T/cqmodel.zip
100%|
Decompressing model to /Users/tonis/.codequestion
Download complete

โžœ codequestion

Loading model from /Users/tonis/.codequestion/models/stackexchange
[1]    58256 segmentation fault  codequestion

I also tried in venv. But I'm not a Python expert

file not found /home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude

root@0497bd526f2b:/# codequestion
Loading model from /root/.codequestion/models/stackexchange
/usr/local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.22.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
warnings.warn(
Traceback (most recent call last):
File "/usr/local/bin/codequestion", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/codequestion/shell.py", line 35, in main
Shell().cmdloop()
File "/usr/local/lib/python3.8/cmd.py", line 105, in cmdloop
self.preloop()
File "/usr/local/lib/python3.8/site-packages/codequestion/shell.py", line 21, in preloop
self.embeddings, self.db = Query.load()
File "/usr/local/lib/python3.8/site-packages/codequestion/query.py", line 127, in load
embeddings.load(path)
File "/usr/local/lib/python3.8/site-packages/codequestion/embeddings.py", line 332, in load
self.vectors = self.loadVectors(self.config["path"])
File "/usr/local/lib/python3.8/site-packages/codequestion/embeddings.py", line 104, in loadVectors
raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: '/home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude'

pip install results No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)

System: Windows 10 (x64) running Python 3.8.1 and pip 20.2.3.

ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from txtai>=1.2.0->codequestion) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)
(env) D:\code\codequestion>python -m pip install --upgrade pip
Collecting pip
  Using cached https://files.pythonhosted.org/packages/4e/5f/528232275f6509b1fff703c9280e58951a81abe24640905de621c9f81839/pip-20.2.3-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 19.2.3
    Uninstalling pip-19.2.3:
      Successfully uninstalled pip-19.2.3
Successfully installed pip-20.2.3

Here's what the full run looks like.

(env) D:\code\codequestion>pip install codequestion
Collecting codequestion
  Using cached codequestion-1.1.0-py3-none-any.whl (17 kB)
Collecting tqdm==4.48.0
  Using cached tqdm-4.48.0-py2.py3-none-any.whl (67 kB)
Collecting txtai>=1.2.0
  Using cached txtai-1.2.0-py3-none-any.whl (20 kB)
Collecting mdv>=1.7.4
  Using cached mdv-1.7.4.tar.gz (54 kB)
Collecting html2text>=2020.1.16
  Using cached html2text-2020.1.16-py3-none-any.whl (32 kB)
Collecting numpy>=1.18.4
  Downloading numpy-1.19.2-cp38-cp38-win_amd64.whl (13.0 MB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 13.0 MB 6.4 MB/s
Collecting annoy>=1.16.3
  Downloading annoy-1.16.3.tar.gz (644 kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 644 kB 6.4 MB/s
Collecting pymagnitude-lite>=0.1.43
  Downloading pymagnitude_lite-0.1.143-py3-none-any.whl (34 kB)
Collecting nltk>=3.5
  Using cached nltk-3.5.zip (1.4 MB)
Collecting sentence-transformers>=0.3.3
  Using cached sentence-transformers-0.3.6.tar.gz (62 kB)
Collecting fasttext>=0.9.2
  Downloading fasttext-0.9.2.tar.gz (68 kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 68 kB 4.8 MB/s
Collecting hnswlib>=0.4.0
  Downloading hnswlib-0.4.0.tar.gz (17 kB)
Collecting scikit-learn>=0.23.1
  Downloading scikit_learn-0.23.2-cp38-cp38-win_amd64.whl (6.8 MB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 6.8 MB 3.3 MB/s
Collecting regex>=2020.5.14
  Using cached regex-2020.7.14-cp38-cp38-win_amd64.whl (264 kB)
Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
     |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 769 kB 6.4 MB/s
ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from txtai>=1.2.0->codequestion) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from txtai>=1.2.0->codequestion)

Vector model file not found

Hello,

Thank you very much for the project. But I have one small issue, right now it seems that when you run python -m codequestion.download it downloads a configuration file that will be used by codequestion to load the model.

The path to the model seems hardcoded to /home/dmezzett/.codequestion/vectors/stackexchange-300d.magnitude
How can we specify to codequestion to use our home or modify the config file?

Best

Add code quality checks

Add the following standard processes and procedures.

  • Unit tests
  • Test coverage
  • GitHub actions workflow
  • Pre-commit code quality checks

Migrate from word vector models to sentence transformers models

Since the original release in January 2020 there has been a lot of progress! sentence-transformers models now perform better than the models currently in codequestion with similar speed (even on CPUs!).

Models in codequestion 2.0 should move to sentence-transformers.

UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.23.1 when using version 0.23.2

~/miniconda3/envs/deepl/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TruncatedSVD from version 0.23.1 when using version 0.23.2. This might lead to breaking code or invalid results. Use at your own risk.
  warnings.warn(
codequestion query shell

Received the warning above when launching codequestion shell after a fresh install.

System details:

  • Ubuntu 18.04
  • Miniconda Python 3.8
  • CPU-only pytorch 1.6

test requires to specify source

The example in the readme simply has python -m codequestion.evaluate but that give an error of missing -s {SOME SOURCE} or --source {SOME SOURCE}.

I was able to run it with python -m codequestion.evaluate -s test (assuming it ran after following the rest of the steps).

Upgrade to txtai 5.x

txtai 5.0 was recently released and much has happened since the last version of codequestion!

The next release of codequestion should replace questions.db with storing content directly in the index. Topics and path traversal should also be added via semantic graphs.

Upgrade to txtai 6.0

This change will update the minimum dependency for codequestion to txtai 6.0.

The main code change needed here is with the scoring package. With the addition of term indexing, checks need to be added to determine if a scoring index is for term indexing or word vectors weighting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.