Giter VIP home page Giter VIP logo

db-hub-fastapi's Introduction

DBHub

Boilerplate for async ingestion and querying of DBs

This repo aims to provide working code and reproducible setups for bulk data ingestion and querying from numerous databases via their Python clients. Wherever possible, async database client APIs are utilized for data ingestion. The query interface to the data is exposed via async FastAPI endpoints. To enable reproducibility across environments, Dockerfiles are provided as well.

The docker-compose.yml does the following:

  1. Set up a local DB server in a container
  2. Set up local volume mounts to persist the data
  3. Set up a FastAPI server in another container
  4. Set up a network bridge such that the DB server can be accessed from the FastAPI server
  5. Tear down all the containers once development and testing is complete

Currently implemented

  • Neo4j
  • Elasticsearch
  • Meilisearch
  • Qdrant
  • Weaviate
  • LanceDB

Goals

The main goals of this repo are explained as follows.

  1. Ease of setup: There are tons of databases and client APIs out there, so it's useful to have a clean, efficient and reproducible workflow to experiment with a range of datasets, as well as databases for the problem at hand.

  2. Ease of distribution: We may want to expose (potentially sensitive) data to downstream client applications, so building an API on top of the database can be a very useful tool to share the data in a controlled manner

  3. Ease of testing advanced use cases: Search databases (either full-text keyword search or vector DBs) can be important "sources of truth" for contextual querying via LLMs like ChatGPT, allowing us to ground our model's results with factual data.

Pre-requisites

  • Python 3.10+
  • Docker
  • A passion to learn more about and experiment with databases!

db-hub-fastapi's People

Contributors

lint-action avatar prrao87 avatar sanders41 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

db-hub-fastapi's Issues

Meilisearch with Pydantic v2

I saw you updated Neo4j to use Pydantic 2, meilisearch-python-async has also been updated with support for Pydantic 2. I have been meaning to try that out and see what difference it makes here, but just haven't gotten to it yet.

Refactor to separate API and ETL schemas

Move FastAPI Pydantic schemas to the api directory, and move the ETL Pydantic schemas to the scripts directory for each database. This keeps things cleaner, as the schema requirements for upstream/downstream processes are rather different, so they don't need to be together (as I originally thought).

FileNotFoundError when running builk_import.py

In following the README instruction I ran python bulk_import.py and get an error.

Traceback (most recent call last):
  File "/home/paul/development/python/async-db-fastapi/dbs/meilisearch/scripts/bulk_index.py", line 191, in <module>
    files = get_json_files("winemag-data", FILE_PATH)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paul/development/python/async-db-fastapi/dbs/meilisearch/scripts/bulk_index.py", line 42, in get_json_files
    raise FileNotFoundError(
FileNotFoundError: No .jsonl files with prefix `winemag-data` found in `/home/paul/development/python/async-db-fastapi/data/winemag-data-130k-v2-jsonl`

This happens because the file is zipped and needs to be extracted first.

Rename filenames for schemas/routers in APIs

To improve clarity, it makes sense to reduce verbosity of the router names

  • The filename wine.py is vague -- it's better to rename the file to be more specific as to what it does, such as retriever.py
  • wine_router can be renamed to simply router, and called using retriever.router from main.py in each FastAPI section

Fix missing column value in LanceDB

Currently, the data once read into LanceDB via arrow conversion is missing the price column. Maybe something is happening during type coercion, causing the column to be ignored? In any case, price is an important variable and needs to be included in the table for downstream filtering purposes, so this needs to be diagnosed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.