juliadatascience / juliadatascience Goto Github PK

View Code? Open in Web Editor NEW

450.0 12.0 89.0 71.19 MB

Book on Julia for Data Science

Home Page: https://juliadatascience.io

License: Other

Julia 100.00%

julia data-science book julia-language data-visualization data-manipulation data

juliadatascience's Introduction

Julia Data Science

Open source and open access book for data science in Julia.

You can read the full book on https://juliadatascience.io.

This book was once available on Amazon, but due to an absurd reason, our publishing account was terminated, and our book was removed. Interestingly, all Amazon customer accounts remain functional. It seems that our removal may have been a result of not selling enough books. Nevertheless, the book and its PDF version can still be accessed in their entirety on this website.

LICENSE

This book is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Also using an image from the Love pack by Freepik at Flaticon.

juliadatascience's People

Contributors

Stargazers

Watchers

Forkers

pitmonticone arimkatz pedrojma stevengogogo artaxerces fintrek varnerlab derbuhlert davibarreira sbalci markusbuchholz fzingler vmikk snowdj muhammadmotawe boukos garfield74 fourminfo ancientshi ntmt2903 markok20 anhnguyendepocen samy176 fork4jl juliacn guixinliu tang-dh logankilpatrick daniel692a victortocantins nanaakwasiabayieboateng chuanqichen dynamicprogramming-structuralestimation harrisonmetz nmlebedev vsrikrish mo-gul rcqls fredojb ctkqiang ederag chanjeunlam endeavorh hsugawa8651 johnmackintosh statsgary zwdiscover rogerlarsson karthy257 aidevnn aatwum studioego juanc-londono karolinakuligowska yuthreestone zipeilee nguyenquangchien zeta1999 lornarthebreton jariji omdgit unalozden tyleransom jesusossian ceopoundz kevindasilvas creative-research-project-v1-1 knuesel sangnguyens akeamazan onurcanbektas nnguyengiatan blogscot ielbadisy arbi11 rafalberes d-morrison anusornc lupen14461 xaviac vicky2618 samswede

juliadatascience's Issues

makie.jl: ERROR: LoadError: UndefVarError: Downloads not defined

Hi,

I get an

ERROR: LoadError: UndefVarError: Downloads not defined
Stacktrace:
 [1] top-level scope
   @ o:\Julia\makie.jl:604
in expression starting at o:\Julia\makie.jl:604

Commented out.

And on execution for all demo functions:

ERROR: UndefVarError: Options not defined
Stacktrace:
 [1] custom_plot()
   @ Main o:\Julia\makie.jl:16
 [2] top-level scope
   @ REPL[12]:1

Commented out.

On execution,

custom_plot()

no error, no plot appears.

Add TestImages.jl to GLMakie section

This is a spinoff of the discussion in #84

Add Section "About the Authors"

We should add somewhere a section "About the Authors" with a picture and a brief bio.

It helps to establish a nice connection of the reader to the book and the authors.

Different repo for version 2?

Guys,
probably is too early, but I have some material using Flux that I would like to start putting up. Not revision necessary yet, just starting material (more like notes), but it will be definitely breaking changes. Could we have a different repo within the organisation for that?

Typo at page 69 of the PDF version

As in the web version, where its written:

DataFrame(σ = ["a", "a", "a"], δ = [π, π/2, π/3])

The PDF version reads:

DataFrameσ( = ["a", "a", "a"], δ = π[, π/2, π/3])

page 69 Chapter 4 DataFrames.jl

Add Revise.jl somewhere

In the past, I've been verbal about the fact that Revise.jl isn't promoted enough. So, I should write a bit about the REPL workflow and some pointers to Pluto.jl etc.

Spelling: Mixmatch and Usecases

Fix two spelling confusions:

mixmatch to mix and match according to en.wiktionary.org/wiki/mixmatch
usecases to use cases according to en.wikipedia.org/wiki/Use_case

Thanks @rikhuijzer!

Add Lazaro Alonso as a Coauthor

[dataframes_performance] Speed and Memory Allocation

Here some thoughts on this section.

The first sentence of this section shows the word "fast" in bold. Speed is also the first thing I think of in context of performance. Unfortunately there are "only" shown comparisons of memory allocations. Of course this is also an important point (when data are that big that there is not enough memory available to hold it), but even more importantly would be speeeeeeeeed.

Is there a particular reason why you "only" show @allocated and not (also) @time and/or @benchmark from BenchmarkTools.jl (didn't try that package so far)?

A test of using @time instead of @allocated for the given examples showed that there was little to no improvement or even the "better" code (needing less memory) took longer. Maybe this is (only) due to the very small sample data used in the examples.?

From VBA (that is the only language I know a little bit (if you consider that being a programming language)) I heard that it could even be possible that using "non-optimal" types can take longer, although they allocates less memory. To be concrete, that is said to be the case when using Integer compared to Long (corresponding to Int16 and Int32 in Julia). Long should be most efficient, because (modern) processors on current platforms perform operations in that precision. Thus, internally each other type is first converted to the "optimal" type and after the operations back to the input type. When this should be true it is obvious that the two additional conversion steps would take extra time.

From what I have tested so far about this topic in VBA, on my computer I couldn't really find a (big/significant) difference between the calculation times. Any insides regarding that topic in Julia?

Introduce `...` splat operator in Julia 101 Section

In response to the comment in #29: I think that it is important enough to introduce in the intro.

Add split-apply-combine

For example, to calculate the score per group for various columns.
To make the example come alive, maybe come up with something about fields with plants.
Note that the dot in the combine is a bit tricky to figure out.

julia> df = DataFrame(group = [:A, :A, :B, :B], X = [1, 2, 3, 4], Y = [5, 6, 7, 8])
4×3 DataFrame
 Row │ group   X      Y
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ A           1      5
   2 │ A           2      6
   3 │ B           3      7
   4 │ B           4      8

julia> gdf = groupby(df, :group)
GroupedDataFrame with 2 groups based on key: group
First Group (2 rows): group = :A
 Row │ group   X      Y
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ A           1      5
   2 │ A           2      6
⋮
Last Group (2 rows): group = :B
 Row │ group   X      Y
     │ Symbol  Int64  Int64
─────┼──────────────────────
   1 │ B           3      7
   2 │ B           4      8

julia> combine(gdf, [:X, :Y] .=> mean; renamecols=false)
2×3 DataFrame
 Row │ group   X        Y
     │ Symbol  Float64  Float64
─────┼──────────────────────────
   1 │ A           1.5      5.5
   2 │ B           3.5      7.5

Justified Text

I think the book should have justified text, right now the text is left-aligned. See pic below for a clarification:

improve typographical stuff

As I already stated in e.g. e8e1d11 it would nice to make some sections unnumbered when there is no second section on that level.

When I have seen this correct, this should be possible by appending the section entry by {-} as e.g. can be seen in

JuliaDataScience/contents/index.md

Line 1 in 56151f0

# Welcome {-}

So if you consider it would be nice to have that I'll redo the suggestions in a new PR. Especially if it is that easily doable.

PS:
I am not sure if my newest comments in #217 have been noticed by one of you, since I have added them after the PR was merged. Thank you for your comments!

Packages index

It might be good idea to add a packages index at the back of the book. For example, that way, people who quickly want to know how CSV.jl works can look in the index and jump to the right page.

This is a test issue

Please do not reopen

Issues in the DataFrames section

Filter doesn't "remove rows" only filter! does
Introduce column selectors and ?select.
CategoricalArrays.jl should have a section for itself.
transform(df, ...; renamecols=false)

example of both filter and subset with multiple conditions, as in:

 filter(row -> row.col1 >= something1 && row.col2 <= something2, df)
 # and:
 subset(df, :col1 => ByRow(>=(something1)), :col2 => ByRow(<=(something2)))

cover vector of Symbols as a col selector in the dataframes_select.md

inversion judea and noam

in the file why_julia.md, line 250, there is an inversion between Judea and Noam

That was not what judea as a Linguist type would say to noam, a ComputerScientist type.

Judea is the computer scientist and Noam the linguist. Moreover according to the line before it is Noam who speak to Judea, so I guess the line should be:

That was not what noam as a Linguist type would say to judea, a ComputerScientist type.

Replace Disney Copyright Meme in Why Julia

We need a new meme for "Data, Data Everywhere"

Statistics Chapter

A final chapter with basic statistics using both DataFrames.jl and Makie.jl. Each concept we should give the intuition behind it and also get into mathematical details.

We should cover:

PDF has johndoe in the footer

Notation discussion points

Notation discussion points from #20:

Always using : before the start of a code block.
Mentioning functions like DataFrame as DataFrame() or DataFrame(...).
My suggestion: only Julia objects between backticks and filenames and extension names between quotation marks (like Julia's strings).

Incorrect statement on keyword arguments

From the webpage:

But defining a function with a keyword argument without default values is no problem:

julia> f(x; y) = x^2+y
f (generic function with 1 method)

julia> f(2, y=10)
14

Am I misunderstanding, or is the statement from the book simply wrong?

[dataframes_select] not the same selection

The lines

JuliaDataScience/contents/dataframes_select.md

Lines 48 to 56 in b5582d2

 Note how `q5` is now the first column in the `DataFrame` returned by `select`. 

 There is a more clever way to achieve the same using `:`. 

 The colon `:` can be thought of as "all the columns that we didn't include yet". 

 For example: 

 ```jl 

 s = """select(responses(), :q5, :)""" 

 sco(s, process=without_caption_label) 

 ```

don't give the same result as the previous example ... (where :id is not shown)

I think you need to rephrase the text.

is creating functions to avoid multiple variables really a good choice?

Hello there!

In the 4th chapter we find the comment bellow

NOTE: This works, but there is one thing that we need to change straight away. In this example, we defined the variables name, grade_2020 and df in global scope. This means that these variables can be accessed and edited from anywhere. If we would continue writing the book like this, we would have a few hundred variables at the end of the book even though the data that we put into the variable name should only be accessed via DataFrame! The variables name and grade_2020 where never meant to be kept for long! Now, imagine that we would change the contents of grade_2020 a few times in this book. Given only the book as PDF, it would be near impossible to figure out the contents of the variable by the end. We can solve this very easily by using functions.

Is that really necessary? I think its way more readable to let what is being explained outside a function. I think it would be better for each chapter to have a single Jupyter Notebook. For an example, Ch4_dataframes.ipynb, where every variable would be constrained in this chapter, avoiding the variable to be accessed anywhere.

Usage of `limits!`

Hi,

This is a great introduction book! The Makie part is also the first systematic introduction to the plotting library I've seen so far.

I have a question about the usage of limits! in the examples. I've seen many places where the limits are set without the axis type, e.g. the code for Figure 8

fig, ax, pltobj = scatter(xyvals[:, 1], xyvals[:, 2]; color=xyvals[:, 3],
    label="Bubbles", colormap=:plasma, markersize=15 * abs.(xyvals[:, 3]),
    figure=(; resolution=(600, 400)), axis=(; aspect=DataAspect()))
limits!(-3, 3, -3, 3)
Legend(fig[1, 2], ax, valign=:top)
Colorbar(fig[1, 2], pltobj, height=Relative(3 / 4))

When I tried to run the code locally, I got

julia> limits!(-3., 3., -3., 3.)
ERROR: MethodError: no method matching limits!(::Float64, ::Float64, ::Float64, ::Float64)

and it hinted to me to add the axis type as the 1st argument as

limits!(ax, -3, 3, -3, 3)

I used GLMakie but not CairoMakie for running the code. Is this due to the library version I'm using? GLMakie v0.1.30

Proof-reading

Hi there! I enjoyed reading the PDF of this book, and I think you've done a great job, and it looks impressive.

As I read through, I picked up some typos and grammatical awkwardnesses. Would you prefer me to list them separately in an issue, or make a PR to the source Markdown files? (I don't know whether you can selectively accept and reject various edits in a PR, so I'm not sure what's best...)

[DataFrames] Performance

Allocations: functions with a shebang ! versus without (e.g., filter! versus filter)
Copying vs not copying: df[!, col] versus df[:, col] (link)
CSV.jl reading a lot of .csv files with the new version 0.9 and passing a vector of Strings, details here
CSV.File versus CSV.read. Details in the index of the CSV.jl documentation
CategoricalArrays.jl compression with compress=true

Clarify our future plans to readers

From some valuable suggestions by Evgeny Pogrebnyak on Twitter (https://twitter.com/PogrebnyakE/status/1435707529137463302). Evgeny argues that a book title "Julia Data Science" should contain more than just plotting and dataframes transformations. We need to clarify to readers that for version 2, we plan to add:

Statistics (for example, GLM.jl and HypothesisTests.jl)
AlgebraOfGraphics.jl
Machine Learning (MLJ.jl)
maybe other things?

Text inconsistent with code

Hi. @rikhuijzer .

I found text there can not match up with the following code.

JuliaDataScience/contents/dataframes_transform.md

Lines 91 to 99 in b5582d2

 With this, we can add a column saying whether someone was approved by the criterion that all of their grades were above 5.5: 

 ```jl 

 s = """ 

  pass(A, B) = [5.5 < a || 5.5 < b for (a, b) in zip(A, B)] 

  transform(leftjoined, [:grade_2020, :grade_2021] => pass; renamecols=false) 

  """ 

 sco(s; process=without_caption_label) 

 ```

More specifically, you said all of their grades in the text , but you used boolean OR operator || in the code.
My suggestion is the least change, namely all -> any.

Add an amazon review banner

Hey folks, you all should add a big banner on the book page asking folks to leave a review on Amazon. This will help increase the visibility of the book significantly!

Fix how the YouTube video is embedded

It should be displaying something different for the HTML and PDF instead of instructions to the reader on what to do.

Issues with definitions in the statistics chapter

Details at https://discourse.julialang.org/t/corrections-for-juliadatascience-book/69204.

Remove the definition of a $P(A)$ as the "set"
Fix Kolmogorov axioms
Double-check the rest

Google Analytics for Book Site

Just something we should have in mind.

Announcements and Publicity

Official Julia here: https://julialang.org/learning/books/
Extra-Official here: https://github.com/svaksha/Julia.jl/blob/master/Resources.md#books
Make an announcement at Discord [ANN] Books.jl and JuliaDataScience
HackerNews

[dataframes_indexing] Not an anonymous function

From what I have learned so far there is no "anonymous function" in

JuliaDataScience/contents/dataframes_indexing.md

Lines 160 to 177 in b5582d2

 Now, to show **why anonymous functions are so powerful**, we can come up with a slightly more complex filter. 

 In this filter, we want to have the people whose names start with A or B **and** have a grade above 6: 

 ```jl 

 s = """ 

  function complex_filter(name, grade)::Bool 

  interesting_name = startswith(name, 'A') || startswith(name, 'B') 

  interesting_grade = 6 < grade 

  interesting_name && interesting_grade 

  end 

  """ 

 sc(s) 

 ``` 

 ```jl 

 s = "filter([:name, :grade_2020] => complex_filter, grades_2020())" 

 sco(s; process=without_caption_label) 

 ```

When I am correct you declare a function that is then used in filter. The => should be the Pair operator while -> is the operator to define an anonymous function, right?

If I am correct, please rephrase the text or adapt the example code.
If I am wrong, shame on me and I am not a good "student" and sorry for the noise.

Launch the Printed Book at Amazon.com

We need to:

Normalize functions and syntax to blue codestyle, also accordingly to DataFrames, Plots, Makie, AoG docs
Create a PDF with syntax highlighting (#15)
Create an Author Profile at Amazon.com (both @storopoli, @rikhuijzer and @lazarusA )
Get a ISBN (Amazon Provides a free one)
Adapt the book to Amazon.com format
Create cover (#148)
Write a summary
Tag a "first edition" release on this repository

Chapter 7 Link broken to Makie Docs

There is a link broken in Chapter 7 datavisMakie.md:

In the "See Makie’s documentation for more." It redirects to http://makie.juliaplots.org/stable/backends_and_output.html#Backends-and-Output which is broken.

cc @lazarusA

Text outside margins in PDF

On page 165:

On page 184:

This probably requires a fix in the PDF template.

Finish README.md

Make a README.md for the first edition

Add some text from juliadatascience.io homepage without copy-pasting (automated, fool-proof thing)
Book cover
Link to Amazon.com Book
Cheatsheets that @lazarusA created (link the README to the image to the book)

Section on AlgebraOfGraphics.jl?

Any plans for something on https://github.com/JuliaPlots/AlgebraOfGraphics.jl ?

Citation Information

We should provide citation information in index.md and also on README.md with biblatex and APA (psychology, social sciences etc.), IEEE (computer science).

Overview of Plots.jl Chapter

Here is an opinionated version. Feel free to criticize:

Subsections for Plots

Brief overview of the JuliaPlots ecosystem

JuliaPlots Organization
Plots.jl (what is it good for and what are its limitations)
Makie.jl (what is it good for and what are its limitations)
AlgebraOfGraphics.jl (what is it good for and what are its limitations) (version 2.0)

What is Plots.jl?
- plot vs plot!
- input data
- Series Types and functions(e.g. line, line!, heatmap, heatmap!)
- How to save a plot
Attributes
- Overview (What are attributes, and the whole symbol e.g. :xticks system)
- Series attributes
- Plot Attributes
- Footnote about the extra_kwargs stuff
- Examples using the most common things that you want to do in a visualization. This could be inserted right after you introduce a specific attribute.
Color and Palettes
I think we should cover colorbrewer and the ones from matplotlib (inferno, viridis, magma).
We should cover some stuff from the Claus Wilke Fundamentals of Data Visualization book (Chapters 4 and 19). Also we should cover the three types of color usage:
1. sequential: continuous stuff, e.g. :blues (only blue)
2. diverging: continuous stuff, e.g. :RdBu (from red to blue)
3. distinguishable: discrete stuff, e.g. :Set1_5
I have a very strong positive bias towards colorbrewer Sets (e.g. palette=:Set1_5).

We should also mention that the reader should use a colorblinded-friendly palette or colors. Maybe we should include an official statistics regarding the prevalence of any sort of colorblindness or color difficulties in the population. I remember seeing somewhere that it was around 5% of people.
Layouts
- Overview on several ways to do layouts
- the layout argument, also cover the grid
- the @layout macro
- specific measures with the Plots.PlotMeasures submodule
- adding subplots incremententally. Define p1, p2, p3; then do a plot(p1, p2, p3; layout=l)

Rewrite Preface Introduction to the Book

Section 1.1.1 is still little off.

Why Julia reference is not found

No typo, check here: https://github.com/JuliaDataScience/JuliaDataScience/blob/main/contents/why_julia.md

And them check here: https://github.com/JuliaDataScience/JuliaDataScience/blob/main/contents/preface.md

leftjoin! inplace

We need to include leftjoin!.

It will be added in 1.3 release of DataFrames.jl and already is in main branch.

EDIT: We only cover non-allocating functions (the bang ! ones) in performance:

Performance Section

Source code here: https://github.com/JuliaData/DataFrames.jl/blob/main/src/join/inplace.jl

PS: Thanks for the warning @bkamins!

Show roadmap to reader

See #110.

It should be on the front page and somewhere in the PDF which is hidden for the Amazon version.

It should not be in the Amazon book.

DataFrames regex indexing

Add DataFrames regex. From Kevin Bonham on Slack:

I've done this ❤️ before, but Regex column selection in DataFrames... I'm just doing some data wrangling on a table with hundreds of columns with really stupid and inconsistent names... being able to do df[!, r"pattern"i] to just ignore different capitalization and find multiple things saves SO much time

embed mp4 video

These lines of code work. You can embed a video in markdown (html). if one uncomments these lines, then it will work locally. But, it will fail in CI and the pdf generation.
Ideally, it will be good to allow the animation online, and print the first frame into the pdf version(probably too much to ask).

Implement cormullion's suggestions for the front cover

In #148 (comment), cormullion suggested

In #148, this was partially implemented. To implement it further, the font could be changed to Barlow (#148 (comment)) which involves ensuring that the font is available locally and in CI and a solution is required for properly placing text. I personally think that Makie isn't the best tool for text placement, but we'll might get it right with lots of fiddling.

Logo for Julia Data Science Organization

We need a logo I will talk to someone who can do that for me at UNINOVE. Any thoughts @rikhuijzer ? We should move anything stats/Bayesian so to not confuse with future endeavors.

Maybe something with Tabular Data or Line Plots. We should definitely use Julia colors.

Update JuliaDataScience GitHub Organization Logo
Update JuliaDataScience Book favicon site icon

	Note how `q5` is now the first column in the `DataFrame` returned by `select`.
	There is a more clever way to achieve the same using `:`.
	The colon `:` can be thought of as "all the columns that we didn't include yet".
	For example:

	```jl
	s = """select(responses(), :q5, :)"""
	sco(s, process=without_caption_label)
	```

	With this, we can add a column saying whether someone was approved by the criterion that all of their grades were above 5.5:

	```jl
	s = """
	pass(A, B) = [5.5 < a \|\| 5.5 < b for (a, b) in zip(A, B)]
	transform(leftjoined, [:grade_2020, :grade_2021] => pass; renamecols=false)
	"""
	sco(s; process=without_caption_label)
	```

	Now, to show why anonymous functions are so powerful, we can come up with a slightly more complex filter.
	In this filter, we want to have the people whose names start with A or B and have a grade above 6:

	```jl
	s = """
	function complex_filter(name, grade)::Bool
	interesting_name = startswith(name, 'A') \|\| startswith(name, 'B')
	interesting_grade = 6 < grade
	interesting_name && interesting_grade
	end
	"""
	sc(s)
	```

	```jl
	s = "filter([:name, :grade_2020] => complex_filter, grades_2020())"
	sco(s; process=without_caption_label)
	```