Giter VIP home page Giter VIP logo

clinical-trials's Introduction

clinical-trials

** Edits for database project

  • v0.2 Postgres + flask back-end. d3 front-end display a summary of clinical trials based on a query. Pubmed & clinical trials offline databases on linux pc.

  • Roadmap v0.3 - Logic to classify trials and assess a research pipeline. v0.4 - major bugs fixed. v0.5 Login system.

Scripts:

scripts/unzip-setup.py

Summary: Download bulk zip file and unzipped files

  • Bulk download (zip file) from clinicaltrials.gov
  • Unzip file
  • Save all xml files in a single folder
  • Cleanup folder
  • Time:

to-do:

  • Remove old files before downloading a new zip file
  • Save log of basic data: download date, url, time required to run script

scripts/parse-all-trials.py

Summary: Parse XML files, export all data as a single JSON file

  • Parse files
  • Extract tags with single values
  • Add tags with several possible values
  • Import dictionary in a dataframe
  • Dump results in a JSON file
  • Time: [real 166m53.614s]

scripts/db setup

Summary: Import JSON file, preprocess data, export all data to sqlite as working db

  • Import JSON file with all parsed data
  • Data cleaning and formatting
  • Create new column with all_text.
  • Create new column with recruiting label
  • Export df to sqlite
  • Create index in all_text for speed

Notebooks:

  • Notebook 4: Create PostgresDB, pre-defined searches.
  • Notebook 5: Pyspark + pubmed database

clinical-trials's People

Contributors

cmdelaserna avatar

Watchers

James Cloos avatar  avatar

clinical-trials's Issues

Create DB 1/2

  • - Download all clinicaltrialsrepository
  • - Parsed data
  • - Create working db (sqlite)

Review files names and structure

  1. Bulk download and folders setup
  2. Parse all xml files. XML_to_JSON_File.
  3. JSON file to CSV for dataframe exploration
  4. ML Pipeline

Corpus preprocessing: extract entities, etc

Function to extract entities from files. Check if done in preprocessing or back-end. Applicable to any dataframe (clinical trials, pubmed citations)

Videos_ https://www.youtube.com/playlist?list=PLBmcuObd5An4UC6jvK_-eSl6jCvP1gwXc

Libraries:

Add MESH terms to dataframe [notebook 3, dataframe setup]

  • Context: Medical Subject Headings (MeSH) terms are added through an algorithm to every file on clinicaltrials.gov.
  • Test ways to extract this information, with n numbers of fields, using notebook and limited number of random files.
  • Why: key information to map diseases and networks.

Check d3 chart passing data with column as index

Branch: column_index

App.py line 91:
df_timeline = df_timeline.set_index('year_submitted')
instead of: df_timeline = df_timeline.set_index('year_submitted').reset_index()

JS data variables in Result.html line 103...

Review script for generating json file

  • Add progress bar
  • Add main function at the end
  • Add dialogue/clear feedback
  • Revise tags parsed [add more info on condition and sponsors and centers involved]

Check how recruiting labels are calculated

Current state:
Issue might be: df['verification_year'] = df['verification_date'].dt.year, in notebook script 3

Check cells under ## Change date types, extract years

Background:

  • When creating the DB, a trial 'recruiting' and updated within the last 3 years is labeled as "recruiting". If not, is considered as not recruiting.

  • Conflicting results with sample db show trials older than 3 years old labeled as recruiting. Numbers don't match between timeline and phase dataset.

  • Check results in full database and review process if needed.

Related: #68

Screen Shot 2019-11-19 at 10 47 27 AM

3. New ML Pipeline

Transformations
Standarized transformations

ML classification
Test several models for unsupervised classification using conditions labels and keywords?

Extract value from data
Pending

Ranking of pre-defined searches

Test it with a selection of trials (see use cases #43 )

First: implement TF-IDF similarity
https://www.datacamp.com/community/tutorials/recommender-systems-python

https://www.scikit-yb.org/en/latest/api/text/tsne.html

Use KNN to classify search results in groups, using several indicators (ie, tf-id score, phase, year, citations?, subsequent studios with same component?)

An introduction to clustering: https://towardsdatascience.com/an-introduction-to-clustering-algorithms-in-python-123438574097

Analyzing time series
https://towardsdatascience.com/analyzing-time-series-data-in-pandas-be3887fdd621

Refactor how back-end serves data for charting

Backend returns two json files:

  • Phase data with all results and recruiting only results
    Reformat data in app.py
    Connect phase chart to new columns

  • Timeline data with all results and recruiting only results
    Reformat data in app.py
    Connect timeline chart to new columns

Script 3: dataframe setup

  • Import JSON file, clean and rename columns
  • Add dates columns
  • Select data since 2008
  • Export data as csv

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.