Giter VIP home page Giter VIP logo

datascience's People

Contributors

adityakuppa26 avatar adnanhemani avatar alvinwan avatar athossampayo avatar chengeaa avatar choldgraf avatar chrispyles avatar codtan avatar davidwagner avatar deculler avatar ericz82 avatar henryem avatar hmstepanek avatar jr-42 avatar khsu2000 avatar krisstee avatar leman-kg avatar maxwelljweinstein avatar mdibyo avatar papajohn avatar peterasujan avatar prad06 avatar preethamsura avatar rameshvs avatar samlau95 avatar sinchana-kumbale avatar stefanv avatar taylorkmw avatar vrii14 avatar yuvipanda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datascience's Issues

Switch to Bokeh for charts?

Bokeh now supports Python 3. The charts that are rendered support zooming and look nice. (Their mapping functionality looks less useful than folium, though.)

Table help is incorrect

help(Table) gives the following example of how to use a Table:

 |  >>> letters = ['a', 'b', 'c', 'z']
 |  >>> counts = [9, 3, 3, 1]
 |  >>> points = [1, 2, 2, 10]
 |  >>> t = Table([('letter', letters), ('count', counts), ('points', points)])

However copy-pasting that into Jupyter goes the following error message:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
...
/usr/local/lib/python3.4/dist-packages/datascience/table.py in __init__(self, columns, labels)
---> 46         assert labels is not None, 'Labels are required'

Where method doesn't handle invert or conjunction well

We should consider adding one of these, because where is currently lacking.

.where(column_label, fn)
.where(column_label, value, compare_fn)
.where(column_label, value, not_equal=True)

Top one seems most general, so that's probably the way to go, but it requires understanding higher-order functions.

The problem is that if I want to say n != 2 and m < 4 for columns n and m, neither of the following work.

t.where(t['n'] != 2).where(t['m'] < 4) fails because t['m'] is unfiltered and has the wrong length

t.where(t['n'] != 2 and t['m'] < 4) fails because and doesn't work with numpy arrays

The only working solutions currently are both ugly:

t.where(numpy.logical_and(t['n'] != 2, t['m'] < 4))

u = t.where(t['n'] != 2)
u.where(u['m'] < 4)

Minor documentation bug in Tables summary of methods

The summary of methods lists the first constructor as:

Table([columns, labels, formatter])

This has a typo: it should be

Table(columns, labels, formatter)

You need to pass 3 args (or 2 args), not 1 arg that's an array of length 3.

Should we use Zenhub?

What do you guys think about using Zenhub for our workflow? I've noticed that it's a little challenging to manage multiple issues in different states of progress because they're all in one big list. Zenhub adds a layer of functionality on top of Github while preserving existing functionality for those who don't want to use it.

Here's a couple problems and how Zenhub handles them.

It's hard to see which issues are important (we should work on immediately) and which we're putting off for later.
Zenhub has multiple columns (called pipelines) to file issues under called "Backlog", "In progress", "New Issues", etc.

We have multiple repos are being worked on but each one has its own issue list.
You can use one Zenhub board to view and manage issues from multiple repos.

Personally, I've used Zenhub before for a bunch of projects and liked it a lot more than the simple Issue list Github provides. But, there's always an overhead of getting used to the new technology.

@papajohn @alvinwan

Histogram from values (or intervals) and counts

From Ani:

Is it possible to have hist draw a histogram based on a distribution table?

E.g. the inputs are intervals and the proportions in each interval (adding up to 100%). Output is a histogram.

At the moment hist takes the raw data as its input. We could simply generate the right number of values at the center of each interval, and provide that as the dataset.

I'm asking because it will be very helpful when students find bad histograms in the newspaper or journal articles and try to fix them. They won't have the raw data. They'll just have the distribution, badly represented. To fix the representation, they could work with the distribution by hand as in Stat 2/20/21, but could we do better in our course?

See the following for an example; scroll down till you see the bar graph.
http://www.cdc.gov/mmwr/preview/mmwrhtml/rr58e0821a1.htm

Implement a percentile function

The pth percentile of a list is the smallest number that is at least as large as p% of the numbers in the list.

That means: sort the list from low to high, go p% up the list from the bottom. If you're at a place on the list, take its value. Else take the next one.

Feature request: way to create a heat map from a Table of (lat,long) points

It would be nice if the Map class provided a way to overlay a heat map on top of a map: given a large collection of (lat,long) points (e.g., Markers), construct a heat map making it easy to visualize where the points are most concentrated.

Maybe there's an easy way by using the options provided, but I couldn't figure out how to do that from the public documentation. Maybe provide an API for this, or document how to do it? I bet this will be a useful thing for students -- and I'd like to use it for Friday lecture too.

Table does not seem to support more than 255 columns

Here's a piece of code that triggers this issue:

lyricsTable = Table.read_table('http://eecs.berkeley.edu/~xinghao/ds10data/lyricsTable.csv')
lyricsTable
gives an error of
  File "<string>", line 12
SyntaxError: more than 255 arguments

If I used a smaller table (with 104 columns instead of 5004), the error goes away:

lyricsTable = Table.read_table('http://eecs.berkeley.edu/~xinghao/ds10data/lyricsTable_part.csv')
lyricsTable

Pivot needs to support multiple rows

I already ran into an example where I needed multiple rows in a pivot. I had to join and then split a column to get what I wanted, which was gross. We need to support something like

t.pivot(column, [row0, row1], value, ...)

No clean way to count duplicate items in a Table, with Table API

Goal: Given a table T and a column C, build a new table that has one row for each unique value in T.C along with a count of the number of times that value appears in T.C.

I was not able to find any clean way to do this within the Table API. Should this be doable using Tables, without leaving Table space and going back to arrays and raw Python?

Here is the solution I came up with:

from collections import Counter
c = Counter(origtbl['column_label'])
t = Table.from_rows(c.items(), ['column_label', 'count'])

Not so terrible if you know Python idioms, but also probably not so discoverable for students. Should there be an API in Table that's accessible to students that allows performing this kind of task? Or some suitable generalized primitive, which is enough to solve this problem?

Resampling-based Hypothesis Testing

@mijordan3 :

It would be good to have both bootstrap-based hypothesis testing (where
column A is resampled n times
with replacement and column B is independently sampled n times with
replacement), and permutation
testing (where column A and column B are put together in a single long
column, a random permutation
is made of that column, it is split in the middle into two new columns, and
the statistic is computed on
those two columns).

@SamLau95 I can expand sample functionality. Would the following syntax work?

def table(self, k, with_replacement=False, columns=[], random_permutation=False)
# columns is a list of column names - if set, with_replacement 
# must be a list of the same length OR random_permutation must
# be true
# if columns and with_replacement are both lists, and r_p is True,
# assert error (don't know what to do)
  • independent column sampling
  • permutation sampling

Marker.show() throws error

I tried to use the Maps demo from text/demos/MapsDemonstration.ipynb. I can't make it work for me.

When I try to create a Marker, I get an error. In particular, I can do

m = Marker(37.78, -122.42, 'San Francisco')

with no error, but then when I do

m.show()

I get a traceback:

TypeError: simple_marker() got an unexpected keyword argument 'popup_on'

Same if I type just

m

into an input cell.

Drop row

The current Table API supports take (for rows), but appears to lack the functionality to drop rows. While it is possible to work around this by constructing my own complement, it would be more elegant to directly support these operations.

This is in the context of doing cross-validation, where we typically drop a small number of rows to construct the test set.

[This issue should be labeled as enhancement, but I can't seem to figure out how to do that.]

Color palette needs more colors

We currently cycle only through blue, yellow, green, red; dark variants of these would be nicer than magenta & white. In table.py _visualize method.

[moved] Debug errors with append, group, etc.

  • append doesn't work
  • zip should be used to construct columns in from_rows
  • group by a zipped column should do the right thing, but currently expands the contents of tuples
  • group should not introduce a new column if a column is passed in

Cannot pip install version 0.2.0

I tried to create a new release, but it appears I was unsuccessful. Any suggestions?

$ pip install datascience==0.2.0
Collecting datascience==0.2.0
  Could not find a version that satisfies the requirement datascience==0.2.0 (from versions: 0.1.0, 0.1, 0.1.1)
No matching distribution found for datascience==0.2.0

I created a datascience release called 0.2.0, updated the setup.py file, and ran python setup.py sdist upload -r pypi

Methods need better documentation

When working with the datascience package, I spend a lot of time trying to figure out how the methods work since the docstrings aren't super helpful โ€” some methods require the table to be a certain shape, others require the table values to be numbers, others strings. None of these details are mentioned in some important methods like hist and barh.

In addition, to find out whether the package has the functionality I want (eg. whether I can group a table of years by decade) I have to browse the methods one by one, trying to keep a lot of things in my head about what methods are available.

I imagine I'm running into a majority of these issues because most of this code wasn't written by me. However, this will be the case for our students so IMO the earlier we can work on this the better.

It'd be very helpful to 1. Improve the docstrings and 2. Have easily navigable documentation (probably generated from docstrings using something like Sphinx).

A great place to start would be the plotting functions, since those seem to be the most finicky and most commonly used.

Mapping through a table

In many cases it would be good to build up a little table and use it to augment another table. A common case is to map through a table. For example, you have a table of Parcels. You have categorized them. Now you want to map categories to colors. So you build a little table.

color_map = Table.from_rows([["Residential",'#f1eef6'],
["Commercial",'#d0d1e6'],
["Industrial",'#a6bddb'],
["Apartment",'#74a9cf'],
["Public",'#2b8cbe'],
["Other",'#045a8d']], ("Category", "Color"))

The old join did a left outer join; thus, this would just work. The new one does an inner join - you only get one row per match. You can build it up with indexed_by, but it is pretty ugly because it is dealing with the possibility of multiple matching entries. If we implement lookup we can get this case. Or do we want to offer a richer join?

Table.read_table() could be smarter about auto-detecting the file format

Try this:

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv?accessType=DOWNLOAD')

Table.read_table() fails to recognize the columns; it stuff everything into one column.

Compare to

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv')

which does recognize that there are three columns.

Perhaps it is looking at the URL and trying to parse out the filename extension, and then using that to decide how to decode the data. If so, maybe it should be smarter about how to parse URLs (to remove fragments and parameters), or maybe it should ignore the URL/filename and have smarter format detection (e.g., auto-detect it as CSV based on the contents of the data rather than the filename).

CurrencyFormatter broken?

M(N-)WE:

from datascience import *
%matplotlib inline
import numpy as np

tab = Table(labels=["money"], columns=[[1.,2.,3.]])
tab.set_format("money", CurrencyFormatter)

Table vs numpy.matrix speed

Below is a benchmark for comparing Table vs numpy.matrix on an access pattern that I expect is very common in data science applications. Essentially, all I'm doing is treating each row as a vector, and attempting to compute pairwise distances between rows / vectors by iterating over all their values.

from datascience import *
import numpy as np
import time

numDatapoints = 10
numFeatures = 250
countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)])

countsMatrix = countsTable.matrix().transpose()



t0 = time.clock()
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s



t0 = time.clock()
[sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s



t0 = time.clock()
[sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

Running this code shows that iteration over numpy.matrix is about 100x faster than iterating over Table.columns, which in turn is 100x faster than using Table.rows.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.