Giter VIP home page Giter VIP logo

go's Introduction

Note from the Editor: Take Two

In the old days of 2013, the OSDSM was born. Then, there were "little to no Data Scientists with 5 years experience, because the job simply did not exist." (David Hardtke, Nov 2012) Since then, history has witnessed many things, including:

• Data Scientists working across industries and the world
• social media manipulation disrupts many elections
• BLM and #metoo and Extinction Rebellion and many other social movements
• machine learning begins falling under engineering domain
• a pandemic
• climate change disasters becoming very frequent while climate warms faster than predicted
• remote work becoming common • multiple global recession shocks

In that decade, Data Science has seen growth of jobs, shortfall of goals, success in many industries, abject failure in others, and nefarious use cases. In particular, adverse consequences and complications of learning from data appear in too many examples: elections undermined by psychographics, dismal gender (Men=74%) and BIPOC diversity in the AI field, a revived eugenics, an explainability crisis, facial recognition used to identify people and systematically detain them, "aggression" detection microphones in schools, and many others. It has never been more clear that we need to talk about the real world impacts of our work, and consider how our creations are used. As you consider this, read a prescient novel that grapples with the consequences of birthing, of creation, of technology.

Like any tool, data-driven technologies are indifferent to the morality of their ends. Perhaps the greatest risk of all is leaving this tool in the hands of the few expensively-educated people who cannot possibly represent all of us. To balance this, open source movements seek to lower the barriers to education for everyone. Data science and data literacy must be widespread, accessible, and leveraged for building our collective future. More than ever, we need that future to be built by members of society who are diverse and focused on generative, sustainable, resilient, emergent solutions. After all, the things we build are mirrors of ourselves (seriously, read Shelley's Frankenstein).

Computers reflect the biases and belief systems of the people programming them -@alicegoldfuss

The OSDSM is built with the belief that open source education makes a diverse, collective, generative future-building possible. I hope that you are one of the next people -- whether you call yourself a Data Scientist or not -- to help make better decisions with the scientific process, critical thinking, and everything else your unique perspective brings to the table. This rewritten curriculum focuses on what is needed to be successful in the entry-level role, but that is just a generic outline; truly, I hope where you take it extends far beyond that.


Start here 👇

The Open Source Data Science Masters

The open-source curriculum for learning to be a Data Scientist. Curriculum resources from both universities and working Data Scientists focuses on foundational theory and applied skills. The OSDSM is collectively-maintained and open to PRs.

The goal of this curriculum is to prepare the student for an entry level Data Scientist role, using open source materials, at no cost but with the same calibur of materials found in the most reputable paid programs. Books not offered for free are often available through a public library, also indicated here with current list price. The Masters is self-guided and self-accredited. To better support credibility, the structure now includes a Capstone project intended to demonstrate the student's problem solving approach, skills in execution, and communication. Upon completion, the student can award oneself a Credential on LinkedIn from the Open Source Data Science Masters. As with all things, the OSDSM is best played as a team sport (try finding people on r/learndatascience).

This is called a "Masters" because it is primarily concerned with "upper-level" college course material in mathematics, programming, economics, or related disciplines. Come as you are!

  1. 📖 The Core - This is a critical foundation for what is to come; don't skip the foundational lessons.
  2. ❄️ Specialty - Choose what is most interesting to you, or most relevant to the work you plan to do.
  3. 🤝 Doing Data Science - Learn about how doing science with others and for businesses can work.
  4. 🧑‍💻 Capstone Project - Choose a meaningful project or dataset to demonstrate what you've learned.

📖 The Core

This is a critical foundation for what is to come; don't skip!

What is Data Science?

One could argue that "Data Science" is a recent term for an already existing information analysis discipline. Humans instinctually search for patterns, a purpose we also see in this more digitized discipline. Read different sources (and search beyond this list) about the uses of data science.

  • The Signal and The Noise / Nate Silver Book $18 -- Narrated cases of Data Science at play in the real world.
  • Dataclysm: Who We Are (When We Think No One's Looking) / Christian Rudder Book $17 -- From the inside of OKCupid, real examples of how data science can illustrate human behavior.
  • Informatics of the Oppressed / Rodrigo Ochigame Logic Magazine -- Algorithms of oppression have been around for a long time. So have radical projects to dismantle them and build emancipatory alternatives.

Foundations of Data Science

Problem Solving

When there are no answers in the back of the book, how do you proceed? Breaking down problems is a skill, one that can and should be learned. Follow Pólya's process, and for extra credit, seek out resources on computer science decomposition.

The Scientific Process & Experimentation

It is crucial as a Data Scientist that you show integrity in and transparency of scientific process. Even if you've been here before, review and draw out the process diagram for the scientific method.

Querying Data

Get familiar and comfortable with manipulating data in a database with a common relational querying language. There are diverse query languages, but SQL is a widely used foundation.

Math & Statistics

Calculus

Linear Algebra

The foundational mathematics for working with large samples of data. Spend time in exercises until you feel highly confident in the key topics of Linear Algebra. It will serve you well.

Statistics

How can we answer questions with data? Everywhere you look, you'll see methods from statistics. Spend a lot of time here!

Working in Python

Learn Python

If you're starting from scratch with Python, start with this series.

Environment & Libraries

Set up your computer to use tools locally.

Data Analysis

Get familiar with using tools to do data analysis. Pro tip: Write out what you're going to do before you do it! When you hit a snag, return to your plan and rechart as necessary.

Python Programming + Algorithms

How does a computer know what to do? Algorithms are instructions with a fancy name. Learn how instructions are encoded, how to think about structuring those instructions, and patterns for making it work in code.

Survey Courses

Courses with many of the topics above included. Be sure you fill in any gaps!

  • Intro to Data Science / University of Washington Lectures
  • (Short Survey) Doing Data Science: Straight Talk from the Frontline O'Reilly / Book $50

❄️ Specialty: Choose 2

Choose what is most interesting to you, or most relevant to the work you plan to do.

Causation

A branch of statistics that uses graphical models and specialized statistics to describe and model cause and effect.

Natural Language Processing

The imperfect and immensely useful art (science?) of transforming human language into data.

Graph Analysis

Human relationships can be modeled as a network or graph. Many other things suit this model, too. Working with graphs

Machine Learning

This is a huge space with infinite things to learn. For advanced statistical foundation, see The Elements of Statistical Learning.

Visualization

The most persuasive data stories are ones you can see with your own eyes. Make it visual!

Courses

Books

Linear Programming + Convex Optimization

If you have interest in operations management, manufacturing, supply chains, or other real world queuing problems, dig in here.

Deep Learning / Neural Networks

🤝 Doing Data Science

Learn about how doing science with others and for businesses can work.

What is the job?

In ideal terms, a Data Scientist advises strategic decision-making using data-backed analysis and tested hypotheses. YMMV as this depends on the company needs and the team being supported.

Communication and Teamwork

For a Data Scientist's work to be impactful, they must be effective at communicating their work and findings. In any setting, clear logic and effective business writing are crucial to reaching your audience. And of course, doing Data Science with a team over zoom is different from being in person in an office. There is much more written communication and asynchronous consumption of content in the remote office environment. More than ever, writing and communication skills are crucial to being an effective Data Scientist for yourself and your team.

  • LEADERSHIP LAB: The Craft of Writing Effectively UChicago / Video. Recommend watching this twice and taking notes.

The Data Scientist works in a Team

In the modern organization, it is very rare that a Data Scientist works in isolation. Communicating the value of the work being done is crucial to getting buy-in from partners whose decisions and operations depend on your work. Those partners might be:

  • Product Managment
  • Engineering
  • Design (User Experience, Research, Product)
  • Operations (Project Management, Customer Service Agents, Data Management)
  • Marketing
  • Finance Operations
  • etc.

Typically, the more clearly you are able to communicate the "why", the value of what you are doing, the more these teams will be able to support you and your work in conversations you may not be a part of. Even if others don't understand "how" you do your work (which is very important to you and your manager!), they will be able to understand and repeat a well-communicated "why". This is why we write Specs, to get buy-in and allow for questions or input, before the work starts.

The Spec

A document conveying the motives, direction, investment, and expected value of the work.

  • Goal / "Why" -- What is the point of this work? What decision is the organization trying to make?
  • Impact -- What decisions might be made differently as a result of this work? What is the expected value?
  • Data -- What evidence will this draw on?
  • Assumptions -- What evidence does not exist? What assumptions are necessary or agreed upon?
  • Methods / "How" -- Overview methods expected to be used. Analysis, with what tools? Experimentation, with what methodology?
  • Results -- (to be filled in as completed)

Results Presentation

A slide deck or document with the goal of conveying the results of the work and how the findings support an important decision(s).

Best appended to the Spec, and summarized in a slide deck for easy consumption. Depending on the culture of the group, slides or a short document may be easier to look through to understand the results of the work. In the remote work era, think about how your work will be passed around and make sure your "above the fold" is easy to understand and clearly conveys the "why" and results in particular.

Example: A particularly polished presentation of map quality study results showing higher data quality in US maps on OSM than commercially available alternatives. The impact of this work was a) increased confidence in service reliability for the company and b) enabled the company to decide against buying a commercially available annual license costing millions of dollars annually.

🧑‍💻 Capstone Project

Choose a meaningful project or dataset to demonstrate what you've learned.

Pick a dataset that you care about

Formulate a Hypothesis & Write a Spec

Review the earlier reading on The Scientific Process. Formulate a clear, concise hypothesis. This is the headliner of your Spec, flesh that out.

Show your work + Explain why you chose this project

Show the process you used to disprove your hypothesis, preferably in a jupyter notebook. See examples to get a taste of how you can showcase your work.

Graduate!

  1. Create a document or github repo showcasing the list of courses and materials you completed. Include your project materials. Also recommended: include a personal statement about why you chose this course of study and what you seek to do with it.
  2. Award yourself a Credential on LinkedIn from The Open Source Data Science Masters, with a link to the documentation you created.
  3. Congratulations! 🎉

So Extra "Extracurriculars"


Take Two Change Log

  1. Restructured ala the 2022 Plan.
  2. Pruned broken links. It's been a while, and some of these resources have moved -- or worse -- been taken down.
  3. Pared down links to a more opinionated list.
  4. Proceeds. Bookshop.org links for all books, which supports independent bookshops with commissions. Since the first commits in 2014, I have donated any related commissions to Planned Parenthood, which was one of the few healthcare providers in my community growing up and is the largest single provider of reproductive health services in the US. Though donations should flow to independent bookshops from now on, my personal commitment to PP remains.

Please Contribute; this is Open Source!

Fearless Maintainer: @clarecorthell

RIP v1.0 commit

go's People

Contributors

aaronjbecker avatar acquayefrank avatar byrnenick avatar clarecorthell avatar dawny33 avatar florianbuetow avatar gnperdue avatar harjotsinghparmar avatar kressaty avatar lgeorge avatar mikezawitkowski avatar mm- avatar mminar avatar nathanepstein avatar nathantypanski avatar niangaotuantuan avatar omnipresent avatar phuongdoan13 avatar ptwobrussell avatar rajeshwerkushwaha avatar scdavis50 avatar seakun avatar shaunmccarthy avatar siyaoxu avatar srinify avatar ssaeger avatar stefsy avatar stevenmaude avatar tonyfischetti avatar westurner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

go's Issues

Brilliant UW Coursera course has gone!

I was working my way through the first course which was brilliant! It was an excellent level for beginners but now the link points to a new intermediate course which is unfortunately not free.

Recommendation and clarification on getting started

I have gone through this curriculum and found it very very helpful. I really appreciate your effort of gathering all these for everyone.
Looks like a lot to cover, but I believe I can scale through it all. Everything seems very clear from the middle; mostly after the Maths section. My main issue is on getting started. On the first part (start here), are you to pick one of the 3 options to go on? Or you go through all the courses there and focus on the particular topics provided? Or everything generally?
Same thing applies to the Maths sections. Pick one or two books or go through them all?

A clear description or recommendation on how to go about it could be of better help, so as to focus on the important things and not just beating around the bush.

The computing section is very clear as there is an average of about 2 resources to go through on each aspect of computing. Just the intro seems a little confusing, as each one of them offer almost the same thing.

Resourceful

I'm glad to finally find real-life application of my math skill

New link to Hardtke quote source

On the home page of http://datasciencemasters.org/ there is a quote followed by a citation

David Hardtke How To Hire A Data Scientist 13 Nov 2012

And that citation has a bitly link embedded in it...

http://bit.ly/howtohireadatascientist

...and that bitly link resolves to...

http://blog.bright.com/2012/11/13/how-to-hire-a-data-scientist/

..which is broken.

(FWIW I see that this link was removed entirely from the README.md page)

The link is not lost, however, and the new link to that same article is here:

https://brightemployers.wordpress.com/2012/11/13/how-to-hire-a-data-scientist/

I'd recommend updating the README.md page, and also the datasciencemasters.org page so that the quote points to the new link, but the less attractive alternative would be to remove the broken link from datasciencemasters.org

Two lab-heavy data classes on github

(Full disclosure: I was one of the creators of each of these courses)

This is a class that we taught at MIT called "From ASCII to Answers: Advanced Topics in Data Processing". The course website is http://db.csail.mit.edu/6.885/, but the entirety of the lectures and 8 labs are available on github: https://github.com/mitdbg/asciiclass/

Another class that Eugene Wu and I taught on Data Literacy: http://dataiap.github.io/dataiap/. Content is here: https://github.com/dataiap/dataiap. This class was more introductory, and gives students six three-hour labs to walk them through data cleaning/visualization/statistics/text processing/mapreduce in Python.

If these can be of any help here, let us know!

Are you still maintaining this repo?

This is a wonderful place to look for resources about data science education.
I really hope that the author is still active because some links are broken in README. There are quite a number of pull requests that has not been well-addressed, and I wonder if I should contribute to this repo.

Harvard Videos Dead Now Too

The Harvard videos won't work for me at all--they give error messages on trying to watch. The slides are still ok, but the videos don't work. Are there system specific issues with this? I was only able to test them on OS X.

Two more graph libraries

Hi!
These two links are really recommended libraries for graph processing as you already mentioned NetworkX. We worked with both of them with graphs over 5M nodes without any problem.

Graph tool (fast and efficient Python library):
https://graph-tool.skewed.de/

iGraph (with many algorithms implemented, and available in C, python and R):
http://igraph.org/

Regards

On Data to Practice With

Thanks for putting this together! 3 quick thoughts on helping people find cool data to get started with:

Another category could be "dataset newsletters", as Jeremy Singer-Vine's weekly newsletter features new ones every week. http://tinyletter.com/data-is-plural/archive .

What happens after the OSDSM?

I get emails from people thanking me for the OSDSM. Many also ask what to do next, or what career they can choose after studying pieces of the curriculum.

Let's open the conversation:

  • What do you want to work on? (not job title, but the work itself)
  • What projects helped you learn?
  • What does the OSDSM lack?

Disappointing affiliate linking

It's unfortunate that a project touting "ebooks... [that are] all free and open" links almost exclusively to for-purchase texts using Amazon affiliate linking, rather than actually linking to the many free and open texts out there.

Create DataScience Toolbox

Hi,

Instead of list all the python, R, other packages to install, why not create a bunch of scripts (dotfiles) that install all the packages once in your system (or instead in a Vagrant Box/ python virtualenv) [1]. I think it is a good idea for newbies.

See for example this two repos

Update: https://yhathq.com/products/sciencebox (instructions at https://docs.yhathq.com/sb/setup)

PS : I am currently doing this on my dotfile repo.

[1] http://datasciencetoolbox.org/

Feature Request: Contributing.md ?

I like that you note at the bottom of this that others should contribute. Perhaps we can expand upon that a little further, and add a Contributing.md file with instructions?

I'm curious about whether contributions are preferred in a certain format, say forking the repo, then making a change, and submitting it as a pull request? Or is it more open than that?

More importantly, what kind of contributions are welcome or needed? Are there areas that have been identified as needing further development? I don't necessarily agree with all of the items here, and I have come across my own books and courses that I think are better suited, but is this purely a big list that we add to, and not replace or substitute existing items?

Thanks for putting this repo together, it is very valuable and I refer to it often :)

Dead Link in README.md

Hi, there is a dead link in README.md for Differential Equations in Data Science "Python Tutorial".

It takes the user to an online jupyter notebook link with a 404 Error.

I would love to contribute. . .

Hi!

I'm a third-year university student majoring in neuroscience with a focus on the pathophysiology of neurodegenerative diseases.
I've recently begun teaching myself python and have gotten fairly comfortable with it - and I would love to contribute to datasciencemasters as it looks like something I would be interested working on!

Only issue is, I've never contributed to anything on github - I know enough git to help myself, but am a fair beginner outside of that.
Could someone point me in the right direction as to a) what needs work and b) in what manner?

General advice is also more than welcome!

Cheers =)

Facilitating team capstone projects

Hi Clare and all.

As I mentioned against another issue - I think that we could do a better job in facilitating team capstone projects. We would just need a system for proposing a project / raising a request for joining one.

Perhaps this could be done through the wiki or via issues?

Any other thoughts?

Add Cubes

http://cubes.databrewery.org/

Light-weight Python framework and OLAP HTTP server for easy development of reporting applications and aggregate browsing of multi-dimensionally modeled data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.