fast_ds

Data science at high speed

0. How to set up git repository

Go to git, press + and create new repository
Copy the SSH clone
Type git clone <repository> fast_ds/ in terminal
Open up new VSCode window clicking on the directory fast_ds BUT NOT OPENING IT.
Make fast_ds directory within fast_ds for future imports
Create init.py file in EVERY subfolder (it can be empty)
If you use conda, you should be fine. you can always make a conda environment if you would like. If you use pip, then set up a virtual environment using python -m venv venv/. Then activate it using source venv/bin/activate
Add in a .vscode directory, and inside put the launch.json and settings.json. Copy everything, but you can skip the default interpreter path, which is specific to my computer.

1. Create models

Structure in five functions for reuse

2. Groupby-apply + Github

Create a new repository and add your model creation code
Create categorical variables in your data
Use groupby-apply in 3 ways: Series vs. DataFrame, as a "reduce". In the case of map, you can just run groupby().apply(np.mean)
- https://realpython.com/pandas-groupby/
- https://www.youtube.com/watch?v=qy0fDqoMJx8
Test out the idea of a class

SQL

Interact with SQL in three ways:

Using Beekeeper Studio https://github.com/beekeeper-studio/beekeeper-studio/releases/tag/v3.9.9 and directly calling SQL
Calling SQL queries using sqlite connections and cursors
Using Pandas via SQLAlchemy
Take an existing dataset, fda_data.csv, and separate it into a relational model of companies (applicant in the data) and drugs (both proper_name and proprietary_name).

Can you do it both via Python and SQL?
Which approach is better for a large-scale database?

Directly calling SQL

For Sqlite, use INSERT OR IGNORE INTO rather than ON CONFLICT IGNORE or ON CONFLICT DO NOTHING

Also can use table_id INTEGER REFERENCES table(id)

Using SQLAlchemy/Pandas

Note that Pandas and SQLAlchemy have different versions, so sometimes you cannot directly use the engine.

Rather than:

df.to_sql('raw', engine, index=False, if_exists='append')

Use:

with engine.connect() as conn:
   df.to_sql('raw', conn, index=False, if_exists='append')

Entity resolution/deduplication

Strings can overlap in the following ways (amongst many):

One string is a substring. If Janssen is there, and Janssen Pharmaceuticals, that's a comfortable overlap.
Fraction of words that overlap
Number of changes to get from one string to another (Levenshtein distance and my preferred one, Jarowinkler similarity/distance) conda install jarowinkler
Consider fuzzywuzzy
XGBoost?
How do we generate features and labels using only a list of companies
Consider sets

Output

A bridge table, which connects one id to another. It's just a list of pairs of ids.

Connect one id from a company table to another id.

Creating labels

Manually get a set of easy matches and nonmatches
- Can even make them up (10-30 examples)
- Three columns: name1, name2, ismatch
Calculate features from the names in 1
Train an xgboost model from the output of 2
Run the xgboost model on a big batch of data (100?, 1000?, 10_000?), predicting proba
Sample 3 examples from each chunk 0-0.1, 0.1-0.2, 0.2-0.3...
- This is making it so that we actually find the hardest examples, but keep a few easy ones
Hand-check the output of 5 to create the final xgboost training set
Use the output of 6 to create the final xgboost entity resolution model
Run the xgboost model on everything, saving ismatch where it is true, and save the ids of the two names that did match
Bridge table is the one place without primary key id, because it's not necessary, but it instead references two other tables to describe pairs

asugden / fast_ds Goto Github PK

fast_ds's Introduction

fast_ds

0. How to set up git repository

1. Create models

2. Groupby-apply + Github

SQL

Directly calling SQL

Using SQLAlchemy/Pandas

Entity resolution/deduplication

Output

Creating labels

fast_ds's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent