sfu-db / connector-x Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 140.0 323.59 MB

Fastest library to load data from DB to DataFrames in Rust and Python

Home Page: https://sfu-db.github.io/connector-x/intro.html

License: MIT License

Python 23.14% Rust 75.67% Shell 0.05% PLpgSQL 0.55% Just 0.58%

database dataframe python rust sql

connector-x's People

Stargazers

Watchers

Forkers

nick-zrymiak yizhou150 wukkkinz-0725 knoguchi rajeshpyne mrtztg ritchie46 wseaton vishalbelsare beihuifeng wanjohiwanjohi laplacekorea tkforks ssahgal isgasho goryszewskig yeohoonyun ives9638 yutiansut kongscn nagatomo-aiq quambene pablojmoreno dustinpartain rraju26 valluriglg lenamax2355 greenlake-io glennpierce shism2 readall ankopainting songlinshu nuttanai datatrekkers zemelleong kensuke-hinata universalmind303 kayhoogland xihuishawpy yssource roapi 1dragoon schakhboz zhexuanzhou webclinic017 alswang18 lbilali johnallen3d phanindra-ramesh alexander-beedie therealhieu yanhuanjin auyer splitgraph kevinheavey tsaoyu akjoshi009 wonb168 yhm-amber gg-big-org matthias-q kmatt fathollahzadeh marianoguerra placeboooo pandinosaurus amar1729 rutvik-b rmhogervorst bishwajitdey zzzdong rohankumardubey ferriluli kotval jpv tonyng1969 shashisingh jordan-m-young vittorioexp chitralverma arafathnihar iq-scm yuping3222 typoniels midaslamb anatolybuga thephotonshadow hughdazz brunoscaglione nevi-me beicause messense duvenagep deleted xiongzx cabbagec carbirbal zen-xu matsmoll

connector-x's Issues

Build pipeline (locally) for Mac/Linux/win

Check the num of cols is correct with the schema

Currently you can have a wrong schema with different num of cols for pandas writer but the program still runs.

Summary

Simple cache support for query result in order to speed up loading the same query multiple times. Extensible to more sophisticated design and implementation.

Design-level Explanation Actions

Investigate existing cache / materialized view solutions for external storage
Decide the scope of the first version:
- what to cache (admission and eviction)
- how to use the cache
- how to maintain the cache
Decide the storage engine to store local cache in the first version (sqlite)
Design the implementation of cache logic

Design-level Explanation

Scope of Cache 0.1

What to cache
- Manually add query result to cache
- Do not remove cached data (do not consider limited disk space)
How to use the cache
- Load the entire cache if and only if exactly match on the entire query (without do any filtering or manipulation)
How to maintain the cache
- Manually force refreshing the cache when issuing query
- Refresh the entire cache (do not consider incremental refresh)

API Design

User Interface

def read_sql(..., enable_cache=True, force_query=False)

enable_cache(string or bool, optional(default True)) - Whether or not to cache the result data. If False is set, do not cache the result. If True is set, cache the result to /tmp with connection and sql as name. If a string is set, cache the result to the corresponding path.
force_query(bool, optional(default False)) - Whether or not to force download the data from database no matter there is a cache or not. If True, also update the local cache.

Logical workflow

Cache Module Implementation

pub trait Cache {
    fn init(conn: str) -> Result<()>; // init cache source, init metadata if not exists
    fn query_match(query: str) -> Result<(Vec<str>, Vec<str>)>; // lookup metadata, split query into probe query and remainder query, and partition each query
    fn post_execute(dests: Vec<Box<dyn Destination>>) -> Result<Destination>; // produce final result
}

Implementation-level Explanation

Left: current implementation, Right: implementation supporting cache

Either cache_queries or db_queries will be empty in the first version that only support exactly match.

Rational and Alternatives

Able to extend to more cache backends, data format and policies in the future.
Easy to incorporate in the current workflow of ConnectorX
Able to make use current loading and writing mechanism in ConnectorX

Prior Art

Some related works:

Future Possibilities

Support different external storages for cache (e.g. Redis)
Support using cache on partial match queries
- partial attribute
- partial predicate
- partial table (in join)
Admission and eviction policy given limited space budget
Incremental refresh the cache

Implementation-level Actions

Add cache source on decided storage
Add cache destination on decided storage
Support multiple source to multiple destination logic (source partition -> destination partition combinator)
read_sql API support
Add tests
Add documentation

Additional Tasks

This task is put into a correct pipeline (Development Backlog or In Progress).
The label of this task is setting correctly.
The issue is assigned to the correct person.
The issue is linked to related Epic.

The documentation is changed accordingly.
Tests are added accordingly.

Documentation

Supported queries and partition algorithm

Gen figures

Baselines: Dask, Modin (Dask), Pandas
Parameters: # partitions, network bandwidth (200Mbps, 10Gbps), pandas with chunk,
Metrics: time, peak mem

produce borrowed types

Currently, Produce<T> does not carry a lifetime. This means it can only produce owned types but not borrowed types. Producing borrowed types is important to the zero-copy goal.

We are currently fine even with zero-copy string because Postgres uses Bytes for the internal buffer. Cloning a Bytes just means cloning the Arc under it. However, this might not be true for arbitrary data sources.

I had an initial attempt under the borrowed-produce branch, but still, some works need to be done to make it compile.

the regex in csv infer schema can be compiled into static using lazy_static

redesign the TypeConversion trait so that it is not affected by the orphan rule

Adding more end-to-end tests

Write more tests for python
Compare with more existing solutions:
- pandas.read_sql
- modin.read_sql

Update Connection Pooling Crate

It would be nice if we could get logs from rust over the wire for debug purposes, preferably configurable from the Python client.

I have a supposedly posgtres compatible source that fails due to

RuntimeError: Cannot get metadata for the queries, last error: Some(Error { kind: UnexpectedMessage, cause: None })

In the meantime, any tips for debugging this?

Error in parsing schema for empty resultset

Document the data types we supported for each data sources

Mysql: Int, BigInt, Double, Datetime, Date, Time, Decimal, Char, VarChar.
Postgres: TODO
SQLite: TODO

Document the internal mechanism

cx will issue a min/max query to get the range of the partition column
cx will generate queries for each partition
cx will issue a count query for each partition
cx will allocate the memory for the destination (e.g. pandas)
cx will run queries and download data in parallel
cx will return the data to the user

RuntimeError: TypeError: Argument 'placement' has incorrect type

File "D:\BOT\PostgreSQLConnection\venv\lib\site-packages\connectorx_init_.py", line 98, in read_sql
partition_query=partition_query,
RuntimeError: TypeError: Argument 'placement' has incorrect type (expected pandas.libs.internals.BlockPlacement, got list)
OR
File "D:\BOT\PostgreSQLConnection\venv\lib\site-packages\connectorx_init.py", line 98, in read_sql
partition_query=partition_query,
RuntimeError: TypeError: Argument 'placement' has incorrect type (expected pandas._libs.internals.BlockPlacement, got int)

header information is missing when dispatching

maybe we should add a method to the source called headers.

Support trusted connection for mssql

Add MySQL as source

Output None for null string

Currently, csv parser gives you an empty string.

Add SQLServer as source

Improve pandas dataframe allocation

Currently it takes 30s for allocating the tpch lineitem x 10 table. To compare, rust requires 10us (maybe unfair), pure numpy takes 4 secs.

Support pandas 1.3

Make pandas dataframe allocation support pandas 1.3

Feature gate dependencies

With the new arrow support, I think connector-x has value as a rust crate as well. However, as a crate owner writing a library, I'd not want to depend on connector-x because it has a lot of dependencies (increasing compile times), I'd don't want to compile.

I think it would be really valuable if the dependencies could be opt-in and activated with feature gates.

remove unwrap/expects

support polars destination

phantom type for the variations {CSV, BINARY} of the postgres source

Add pyarrow as destination

return pyarrow (from arrow) if user ask arrow as destination

Postgres SSL Support

Currently no TLS implementation is supported, so connecting like this fails if the DB mandates an SSL/TLS connection:

df = cx.read_sql(f"postgresql://{username}:{password}@host:35432/schema?sslmode=require", 'select 1;', return_type='pandas')

Error:

RuntimeError: timed out waiting for connection: error performing TLS handshake: no TLS implementation configured

Is this a planned feature? It seems to be supported upstream in rust-postgres: https://docs.rs/postgres-native-tls/0.5.0/postgres_native_tls/

The lack of SSL support (even with sslmode=require) makes it difficult to use connector-x in enterprise-y environments.

MySQL source parsing NULL value error

import polars as pl
import pandas as pd
from sqlalchemy import create_engine
import pyarrow
# 
print(pl.__version__)
# 0.8.20
print(pd.__version__)
# 1.3.0
pip list | grep connector*
# connectorx                    0.2.0
pyarrow.__version__
# '4.0.1'

# pandas first
sql = '''select ORDER_ID from tables     '''
engine = create_engine('mysql+pymysql://root:***@*.*.*.*:*')
df = pd.read_sql_query(sql, engine)
df.dtypes
# ORDER_ID                      int64

# polars second
conn = "mysql://root:***@*.*.*.*:*"
pdf = pl.read_sql(sql, conn)
`
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<timed exec> in <module>

~/miniconda3/envs/test/lib/python3.8/site-packages/polars/io.py in read_sql(sql, connection_uri, partition_on, partition_range, partition_num)
    556     """
    557     if _WITH_CX:
--> 558         tbl = cx.read_sql(
    559             conn=connection_uri,
    560             query=sql,

~/miniconda3/envs/test/lib/python3.8/site-packages/connectorx/__init__.py in read_sql(conn, query, return_type, protocol, partition_on, partition_range, partition_num)
    126             raise ValueError("You need to install pyarrow first")
    127 
--> 128         result = _read_sql(
    129             conn,
    130             "arrow",

PanicException: Could not retrieve i64 from Value

`

Create a benchmark page, recording the performance change for each commit (release)

Use AWS machines.

debug windows py 3.8

Add tests for non ASCII strings

Fetch min/max if partition range not specified

And add tests

Allow user to input date and datetime format for the schema.

The proposed syntax would be date(%Y-%m-%d)

support postgres numeric data type

uuid object

for now we convert uuid to string object in pandas, make it to uuid object (align with pandas)

Investigate CSV parsing library

Using benchmark repo: csv-game

Follow the readme to run the test for c++ csvmonkey and rust libraries
Add simdcsv simdcsv to the benchmark

dump of the original plan

read_sql(): read data from database to pandas/dask dataframe through a sql query

Introduction

Databases is one of the most commonly used data source that data scientists fetch data from. However, the transformation process to load data from database and convert it into dataframes for further analyze is usually heavy-weight. The read_sql function aims to speed up the process through the following features:

Query partition: split a big query into a bunch of small queries so we can make the procedures like query execution, data transfer and format conversion in parallel and merge the results of small queries in the end
Result cache: offer persistence of the fetched data, do not need to repeatedly download data in situations like notebook restart or applying different tasks on the same dataset
Fast data conversion: speed up the csv to pandas process using 1. parallelism and 2. directly write to pre-allocated pandas dataframe memory buffer

User API

read_sql(sql, conn, cache=True, force_download=False, par_column=None, par_min=None, par_max=None, par_num=None, dask=False)

Parameters

sql(string) - The sql query for fetching the data
conn(string) - Connection string uri (e.g. postgresql://username:password@host:port/dbname)
cache(string or bool, optional(default True)) - Whether or not to cache the result data. If False is set, do not cache the result. If True is set, cache the result to /tmp with connection and sql as name. If a string is set, cache the result to the corresponding path
force_download(bool, optional(default False)) - Whether or not to force download the data from database no matter there is a cache or not
par_column(string, optional(default None)) - Name of column used to partition the query (Must be a integer column). If None is set, do not do partition
par_min(int, optional(default None)) - The minimum value to be requested from the partition column col. If None is set, do not do partition
par_max(int, optional(default None)) - The maximum value to be requested from the partition column col. If None is set, do not do partition
par_num(int, optional(default None)) - Number of queries to split. If None is set, do not do partition
dask(bool, optional(default False)) - Whether to return Dask dataframe instead of Pandas dataframe

Result

Pandas/Dask DataFrame

Related Works

Query partition
Result cache

Plan

Target use case: Fetch data from PostgreSQL to pandas dataframe
Tasks (expect save time calculated on TPCH scale=10, lineitem table (60M rows), 10 workers, 158s in total):
- Implement parallel read_csv in Rust arrow - contribute code to arrow (expect save time 12% [158s->138s])
- Read directly into pandas memory from DB, do not need to convert arrow to pandas (expect save time 52% [158s->75s])
- Implement a cache on the client side for reloading the same data
  - Finish the functionality, do not consider incremental update
  - Research on how to incremental update
- Partition the query and connect to DB in parallel (naive partition to 10 queries compared with 1 query saves time 67% [490s -> 158s])
  - Finish the functionality, do not consider how to partition
  - Research on how to do the partition

Employee Department MYSQL Creation Issue

We had given task as below attachment -

We written below query which is not working i.e. it does not fetching desire result -

With New Dept_id as (select  x.dpt_code from department x where x.dpt_code not in 
(select e.dpt_id from emplyee e join
employee e where e.dep_id=d.dpt_code))
With New Dept_name as (select  x.dpt_name from department x )
SELECT employee.emp_id, employee.emp_name, employee.hire_date, employee.jon_name, employee.dept_id, New Dept_id, New Dept_name from employee JOIN department on (employee.dep_id=department.dpt_code) JOIN

import connectorx as cx

tableName   = "semantic_search"
dataFrame = cx.read_sql('postgresql://postgres:postgres@localhost:5432/embeddings_sts_tf', "select * from " + tableName)

but got the following error:

thread '<unnamed>' panicked at 'not implemented: _float8', connectorx/src/sources/postgres/typesystem.rs:78:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/matthieu/Code/Python/postgreSQL_creation.py", line 142, in <module>
    dataFrame = cx.read_sql('postgresql://{user}:{pw}@{host}:5432/{db}'.format(host=conn_info["host"], db=conn_info["database"], user=conn_info["user"], pw=conn_info["password"]), "select * from " + tableName)
  File "/home/matthieu/anaconda3/envs/sts-transformers-gpu-fresh/lib/python3.8/site-packages/connectorx/__init__.py", line 99, in read_sql
    result = _read_sql(
pyo3_runtime.PanicException: not implemented: _float8

Thanks for helping!