Giter VIP home page Giter VIP logo

synthnotes's Introduction

Overview

This is the repository for the beta version of SynthNotes, a clinical note generation tool for researchers seeking to make their own datasets for studying natural language processing. This is currently a beta release; a prototype for development and proof of concept.

There are two options for installation and running: either locally only, or distributed. Regardless, note generation happens locally and for the time being will be constrained by your local machine's RAM. For now, it is suggested to install locally and not rely on the distributed tools like HDFS and RabbitMQ.

Installation

Clone the repository:

$ git clone https://github.com/ebegoli/SynthNotes.git

Python Environment

SynthNotes uses Python 3.6. Make sure you have a compatible version of Python installed. It is also recommended to use some kind of virtual environment tool such as virtualenv or anaconda.

$ pip install -e .

Data Files

SynthNotes beta is built off of clinical notes in the Mimic-III database. In order to run SynthNotes, you will need to get a license for Mimic and install it somewhere locally. To get access to mimic data see you may find the steps for that process here. It isn't necessary to build the entire Postgres database, but you at least need the note events table. It is enough to just download that csv file. For running cTAKES you will need to extract out the note text which is easily done using pandas [1]. For preprocessing, the software expects a parquet file. Again, this can easily be done with pandas by first reading the csv file and then writing it out using parquet[2]. The supplemental script extra/csv_to_parquet.py can perform this task as well.

Parsing also needs to know where the cTAKES output xml files are. This filepath is passed as a command line argument. See the help page for information on those variables. Note that the directory containing the cTAKES output must contain ONLY these xml files.

$ synthnotes parse --help

The various scripts can take specific directory targets as arguments, but have default values throughout the entire pipeline. If you place your XML files in data/xml_files/ and the parquet file in data/note-events.parquet, you can run the pipeline with no filepath arguments. If you do, generated documents will be placed in data/generated_notes/.

Running

The pipeline runs as follows;

  • parse
  • preprocess
  • cluster
  • generate

Once clustering is complete, generation can be run as many times as desired without rerunning any prior stages. To avoid overwriting previous outputs, it is recommended that you specify different output directories for subsequent runs.

As mentioned above, cTAKES must first be run on the mimic notes data and that output stored in a place where the python program can access it. That filepath should be passed to the command line arguments. Alternatively, the cTAKES output can be placed in the default input directory data/xml_files/. You may pass --help to the command line program to see more information on those variables.

Once cTAKES has been run and the xml assembled into a single directory, then the parser must be run.

$ synthnotes parse --xml_dir path/to/ctakes/xml/files \
--output path/to/store/extracted/parquet/data \
--fs [local | hdfs]

Once the parsing is complete, a set of parquet files containing the extracted xml information will be available in the --output directory. Those parquet files will be used for preprocessing the data for clustering and generation.

To run preprocessing run:

$ synthnotes preprocess --pq_files parquet/files/path \
--output path/to/output/dir \
--mimic_notes path/to/mimic-notes/parquet/file

To perform clustering run:

$ synthnotes cluster --pq_files path/to/preprocessed/parquet/files \
--output path/to/output/dir

--pq_files for clustering should be the --output from the preprocessing stage.

Finally, note generation can be done with:

$ synthnotes generate -n 100 \
--pq_files path/to/clustered/parquet/files \
--output path/to/output/dir

It is recommended that you begin by generating a small number of notes, such as 2 or 5, to get a feel for how long this process takes on your machine.

synthnotes's People

Contributors

ebegoli avatar groppcw avatar kbrown42 avatar krisb88 avatar sudarshan85 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

synthnotes's Issues

No Predicates Output Folder from Parsing

The preprocessing python script fails since there is no predicates folder from the parser output. From the preprocessing.py file it looks like there used to be one.

Test the version 1.0

Test the version 1.0 for the quality of the content, performance and volume.

Assess the quality of generated notes, an ability to generate significant volume (100Ks+), and how long it takes and that it completes successfully (no memory leaks, etc.).

Write unit tests for this. They can be simple.

Release the version 1.0

Release the version 1.0 (templated, as published in a paper), and with completed test and runtime documentation.

Run all tests.

Complete and check documentation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.