Giter VIP home page Giter VIP logo

carpentries-incubator / julia-data-workflow Goto Github PK

View Code? Open in Web Editor NEW
5.0 7.0 2.0 1.62 MB

Learn Julia workflows for data-intensive research

Home Page: https://carpentries-incubator.github.io/julia-data-workflow/

License: Other

Ruby 0.50% Makefile 3.95% R 5.18% Shell 0.37% Python 35.30% HTML 42.40% SCSS 7.77% CSS 3.26% JavaScript 1.27%
lesson pre-alpha julia-language english carpentries-incubator julia

julia-data-workflow's Introduction

Data-intensive research workflows in Julia

Lesson

Learn how to use Julia to enable your data-intensive scientific research.

It can be a challenge to know where to start when developing a scalable and reproducible workflow for your data-intensive computations. The Julia programming language is notable for enabling researchers and analysts in diverse domains to get a handle on this challenge. Use this lesson to learn how to start implementing effective scientific computing workflows using Julia.

Contributing

We welcome all contributions to improve the lesson! Maintainers will do their best to help you if you have any questions, concerns, or experience any difficulties along the way.

We'd like to ask you to familiarize yourself with our Contribution Guide and have a look at the more detailed guidelines on proper formatting, ways to render the lesson locally, and even how to write new episodes.

Please see the current list of issues for ideas for contributing to this repository. For making your contribution, we use the GitHub flow, which is nicely explained in the chapter Contributing to a Project in Pro Git by Scott Chacon. Look for the tag good_first_issue. This indicates that the maintainers will welcome a pull request fixing this issue.

Maintainer(s)

Current maintainers of this lesson are

Authors

A list of contributors to the lesson can be found in AUTHORS

Citation

To cite this lesson, please consult with CITATION

julia-data-workflow's People

Contributors

jd-foster avatar tobyhodges avatar zkamvar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

julia-data-workflow's Issues

Learning objectives

Moving a discussion from PM with @jd-foster on julia discourse to here. We seem to be in agreement that the actual domain is less important than the specific learning objectives. I wrote (partly paraphrasing @jd-foster in places)

In very general terms (and this is somewhat reiterating what you already said), I think

  • IO for a standard data type like json or csv, and io for some other non-standard data
  • Organizing data in tabular (DataFrame) form, and as arrays or Dicts. In my field, I often use some custom structs, though I don’t know how universal that is
  • Some basic stats and descriptions of data
  • Some basic visualisations

Here's a first draft of formalizing these into concrete (though high-level) learning objectives:

After completing this course, students will be able to...

  1. Utilize existing julia libraries to read and write files in standard data formats (eg. CSV and JSON) into suitable data structures
  2. Make use of general I/O and string-manipulation utilities to read data in non-standard formats into basic data structures such as arrays and dictionaries.
  3. View, describe, and manipulate numerical and text data in a tabular format using DataFrames.jl
  4. Calculate statistical summaries of numerical data
  5. Generate visual representations of data using an existing plotting library.

Select software tools

While lower in priority to #1 or #2, we should decide on the software mode of delivery for the lesson. Each complete SC or DC lesson requires some setup from the learner in terms of interface (GUI or otherwise), browser or manager to facilitate the lesson ideas. This is should not be (but sometimes is) a major barrier to entry for the learner, so the advantages or drawbacks of the software required should be weighed carefully against getting started easily and eventually enabling good practices.

Here are a few options to consider:

  1. Julia REPL via terminal: most direct and immediate start, can be intimidating as "just a prompt".
  2. Jupyter notebook: harder to setup, but enables better learner feedback loop in process. Adds another layer of "cognitive load" to get through for new learner, learning Jupyter interface and Julia at the same time.
  3. VS Code with Julia extension: again, much more to setup than 1 or 2 initially but may pay off in terms automation of code editing, documentation and plot integration. Might be more natural to those used to a browser/app mode of interface.

Any I've missed?

Select domain / dataset

Though (in my opinion) less important than #1, we nevertheless need to identify what knowledge domain we will use as a backbone for the lesson. In PM, I wrote:

I think we should spend the bulk of our energy on solidifying what we want to teach, the actual domain is almost irrelevant. That said, I think features of the data should be:

  • Accessible, by which i mean almost everyone can understand what the data is with minimal context. Even if it’s domain-specific, some things are more understandable to outsiders than others
  • Inclusive, which is related to (1), but distinct. Eg. Everyone can understand wins/losses of sports teams, but not everyone is into sports. We should try to find something that had broad appeal
  • Evergreen - one thought I had was that doing something with corona virus data would be cool, but this may feel a bit less engaging 5 years from now

Despite that last point, a coronavirus - based project could definitely fulfill the first 2, especially since it could include both epidemiology as well as biological (sequencing) data types. I'm a little biased here though, as this is my field and I'm already developing some other materials along this line that I could double-dip :-D

Other possibilities:

  • There are lots of potential public-health type datasets. I was recently engaging with police violence datasets compiled by various organizations, though this may be considered too political and/or triggering. There are also lead exposure / air quality datasets I'm aware of
  • Climate / weather data is very relatable and evergreen.
  • Lots of potential ideas from https://www.wikidata.org

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.