Giter VIP home page Giter VIP logo

nih_reporter's Introduction

nih_reporter

Using NIH RePORTER data as a machine learning playground for Databricks, NLP, Azure tools, and collaborative development

Stream Labels

This repo is intended to contain multiple streams (sub-projects or research ideas). Unique stream labels are to be used as directory names to organise the streams and match across directories. Label shared is reserved for codes and features common to all streams.

Stream labels should also be used as branch names to aid code management.

Key directories

  doc/                         - documentation
  src/                          - source codes
    |_  pipelines/[stream]/     - data / ml pipelines
    |_  notesbooks/[stream]/    - exploratory/ experimental notebooks
    |_  utils                   - utility scripts
  test/                         - codes for unit or regression testing
    |_ [stream]/                - organised by streams
  out/[stream]/                 - small output files(eg plots) generated by codes
  data/[stream]/                - small resources or files used by your program
  models/[stream]/              - saved models for deployment
  README.md
  requirements.txt              - use if applicable

Note: Large files ( say, > 1MB) should reside in external file system such as Databricks DBFS and OneDrive.

Notes for contributors

  1. FORK: Create a fork from the main repo [jtjli/nih_reporter] unless you want to develop on top on an existing fork.
  2. BRANCH: Use a branch that's representative of your development, such as using a Stream Label as the branch name. Avoid developing on the main branch.
  3. Create a Pull Request when your codes are ready for merging into the main repo.
  4. Wherever appropriate, use Stream Labels as section heading in files such as .gitignore, the global requirements.txt, and README

[stream] databricks_ELT

Databricks notebooks for ingesting data into Delta Tables. Concepts include: SQL, spark dataframes, schema, pyspark, pandas, Upsert, databricks CLI.

See:

  src/pipelines/databricks_ELT

nih_reporter's People

Contributors

jtjli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.