fivethirtyeight / data Goto Github PK

View Code? Open in Web Editor NEW

16.6K 16.6K 11.0K 155.81 MB

Data and code behind the articles and graphics at FiveThirtyEight

Home Page: https://data.fivethirtyeight.com/

License: Creative Commons Attribution 4.0 International

R 15.53% Python 13.22% Jupyter Notebook 71.25%

data

data's People

Contributors

Stargazers

Watchers

Forkers

mattjanicki grancier tretelny gavinator123 spencerbingol gentleben187 dataorigami biggels ercunningham12 ellisonbg jtravisnorton swetz6 jtslaven miguelcarvajal travelinreid jinhyeokjang juicymailbox pragerd rmswartz jinningz ranjani185 tjwds jasontodd1 asalkever patrick7033 tbrown32 feeneya tkruml cie247 tmbronk michaelnyee saulshanabrook stevephelps dlacross scogell dino-fire raghunt mattyrobin8 dantintle greg-dubrow kirstywilliams amitkamra1 vikashranjan datadeng finiterank calderas danconder duynguyenuw lslangley fatigatti rubthemtogether msarano soar2hi sankashshankar marian1987 adieprestone eli-berkow tjenkins benkhsieh dsplaisance hokieedc joeljw elizabethricks cferejohn robnotbob zehual italiansinfuga tplaza3 prenden2 marc00l bstancil andy-olstad jacosan salandopee mhoyer05 jdm5056 mogismog scottball sonicrick baconstarvation gilatmat hlwiencko eliransapir is-noop yuwenmemon sanchitarora ankamahali prodigeni augustog gerardlopez2001 caraya silky coleww gdtm86 atleonhardt nactos suneel0101 cponeill rockcop hassanchatila

data's Issues

Include reference to Creative Commons in LICENSE.md

The readme includes the text:

We hope you'll use it to check our work and to create stories and visualizations of your own. The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License.

LICENSE.md only include reference to MIT.

When looking for the licensing information people (and scripts) tend to look at the LICENSE.md and assume all the info is there.

It looks like someone has already made the mistake of thinking everything was under MIT.
https://www.kaggle.com/fivethirtyeight/fivethirtyeight

Show your work/Data Hosting

Also, Data hosting becomes interesting once your dataset gets past 100mb (github: max file upload size) or 1gb in repo (github, max repo size) and git itself becomes slow at 10mb files and 100mb repos. I've used Amazon S3 and post-pull-hooks to create a /data directory in the .gitignore to avoid this issue in the past as part of work at dssg. Anyways, you might need a bigger solution, but if not:

Are any of the voting system data available?

...that was used in this article mentioning Verified Voting?

https://fivethirtyeight.com/features/demographics-not-hacking-explain-the-election-results/

Duplicate in CSV File

The Grand Illusion by Styx, is duplicated in data/classic-rock/classic-rock-song-list.csv

"Measuring the Effect of the Economy on Elections"

If I could access the dataset related to this article, I'd really appreciate it! Working on undergrad research pertinent to Mr. Silver's work.

Link: https://fivethirtyeight.blogs.nytimes.com/2012/07/05/measuring-the-effect-of-the-economy-on-elections/

Calculations for diverse/segregated cities?

It'd be great to see the calculations for http://fivethirtyeight.com/features/the-most-diverse-cities-are-often-the-most-segregated/

Particularly for calculating the integration-segregation index (and, I suppose by necessity, the the trend line the index is based off). The diversity index calcs we can get from footnotes 5 and 6, and the data from, as the article notes, US Census tract data. This gives readers enough to calculate diversity indices for any community (eg smaller towns and suburbs), but not enough to create an integration index.

Issues with data in the college-majors repo

Hi all,

I've been working with the data in the college majors repo over the weekend: https://github.com/fivethirtyeight/data/tree/master/college-majors

I think something went wrong during the data processing, since if you compare the gender-related data in recent-grads.csv and women-stem.csv, they don't match up. recent-grads.csv also indicates that 56% of all computer science majors are female, which is way off.

I'm looking into the data right now to try to find out what happened, but I'd appreciate a second look at this data set to make sure everything is in order.

Cheers,
Randy

Update with The Hateful Eight data

Impaired hearing subtitles file can provide word data: https://pastebin.com/raw/BwaWPw5e

Bracket file numbers should have leading zeros

You know, for readability in the repo.

The order gets goofy:

https://github.com/fivethirtyeight/data/tree/master/march-madness-predictions

nfl_elo incorrect playoff game dates

I believe that the nfl_elo dataset may have the 2015 season wild card and division round playoff games on the wrong days:

According to wikipedia (and other sources that I followed up on), the wild-card games were on 1/9 and 1/10, with the divisional round on 1/16 and 1/17.

https://en.wikipedia.org/wiki/2015%E2%80%9316_NFL_playoffs#NFL_playoff_schedule

Add license

The repo contains no license file or link to license information that I could find.

Trump approval ratings API or raw data endpoint?

FR: Is there any endpoint to the raw data similar to this: #72

But for the trump approval ratings? https://projects.fivethirtyeight.com/trump-approval-ratings/

Thanks

Out of date

Can you please update this with data from http://projects.fivethirtyeight.com/2016-endorsement-primary/? Or make that data otherwise exportable?

shiny app link is 404

oops, referring to this repo

Data for Global Club Soccer Rankings

https://projects.fivethirtyeight.com/global-club-soccer-rankings/

Are there any plans to make this data available through Github or any other mechanism? I'm building an application that requires a ranked list of soccer clubs and would love to use your dataset. If this list of ranked teams could be uploaded as a .csv here that would be fantastic.

Soccer SPI data

Dear 538 team,

I am curious about the preparation of your soccer data, here: https://github.com/fivethirtyeight/data/tree/master/soccer-spi. There appears to be some missing data: there are 467 unique teams in the matches csv, but only 453 teams in the ranks csv. Is it possible to obtain a complete version of this dataset? Thanks in advance for your help!

Best,
Stephanie

Oakland, California numbers

Oakland is listed as having 1530 police. I doubt it has ever had that many. http://www.nytimes.com/2012/03/25/us/oakland-police-try-to-fill-the-ranks-but-keep-falling-behind.html says 837 in November, 2008, and 636 at the time of the article in March, 2012.

The graphic in http://fivethirtyeight.com/datalab/most-police-dont-live-in-the-cities-they-serve/ says the numbers are 2010 with source U.S. Census. Is this census data source online? I did some naive searches on census.gov and don't see anything obvious.

What Encoding format are the "Riddler-castles" data in?

Oscars data available?

Hey 538 team,

Wondering if you guys could upload your Oscar's data here? Or if there's a good public db/API that houses the data? Thanks!

Sworn vs all staff

This is actually for all police force staff and not for "Officers" right- the totals are completely wrong otherwise, Oakland's force is around 700 so the 1,530 has to be all OPD staff, not just cop cops. Can you clarify?

Method for normalization

Hi,

Could you please share the method that you used for normalizing the ratings and user count?

Thanks.

Suggestion on repo structure

I see that you're grouping all data sets in one repo. While there's some convenience to organizing things that way, I think it's going to make it more difficult for curious readers to sort through once you've published hundreds of data sets. It would probably be better in the long term to do one repo for each story or data set, and then link to that individual repo from the story.

Given that there are only three data sets posted so far, this will be easier to re-organize now than later.

Data and code for Margarita Clustering

http://fivethirtyeight.com/features/we-got-drunk-on-margaritas-for-science/

Add dataset for population and race percentages by county

I have compiled the following spreadsheet, which I think would be a good fit for this data repository. All data comes from 2010 census figures, and it's advantage over said tables is that it doesn't require going through a 5000 line cross reference to figure out what each column means, and all figures are in the same sheet instead split across 20 different ones over 3 files.

I understand that this repository is meant for datasets featured in 538 articles, but it seems likely that this sort of data will be used in the future, and I can't think of a better repository for this to live in.

Race % and Pop by County.xlsx

Data Dictionary for the datasets

Hi @BenCasselman thanks for uploading these useful datasets. Can we also get a data dictionary explaining the columns. For example, in the grad_students dataset, the columns of Grad_employed and Grad_unemployed don't add upto Grad_total. So a dictionary would go a long way to help. Also how was the unemployment rate was computed?

It would be great if the codes are also shared

I like the graphics on fivethirtyeight. And hope to learn how to make them. The internet needs more graphics like these. Thanks.

Louisville Beat St. Louis 66-51

Currently, at FiveThrityEight, St. Louis is listed as the winner, but according to NCAA, Louisville is the winner.

Chance of Winning data over time

Can someone please help me with getting the data behind the chance of winning forecast for Clinton and Trump? I tried using data scraping tools, but wouldn't work on the graphs.

http://projects.fivethirtyeight.com/2016-election-forecast/?ex_cid=rrpromo
It's under How the forecast has changed, and the latest data as of 10/20 is Clinton for 87.2% and Trump 12.7%. Thank you in advance!

Elo Code

Do you guys have any plans to publish your Elo code? Specifically NBA Elo https://github.com/fivethirtyeight/data/tree/master/nba-elo

Stream

Can you post code required to get the CSV data from the Twitter API? I would like to create and host a streaming / realtime version of this. It would be a cool addition to your site IMO.

[Closed]

College majors repo does not link to story

Can be changed in the readme by adding

[FiveThirtyEight's story on earnings of college majors](http://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/)

as a link in README.md

Burrito Bracket data missing

Thank you for making your data available.

Any chance I can get my hands on some delicious, delicious burrito data?

Movie boxoffice data

Any chance of data getting posted for this article? http://fivethirtyeight.com/datalab/scary-movies-are-the-best-investment-in-hollywood

I guess this is a... feature request?

xxx

Update 2016 Potential Candidates?

Looks like there's a new article updating the potential candidates but the data set doesn't reflect the new updates. This could be a lot of fun to have access to as it gets updated.

Line endings are CR instead of LF or CR/LF

Hello!

First of all, thanks for posting the data to the stories making it easier to follow the described methodology in each article. But, please, could you upload the csv files with correct line endings? As of now, most of the CSV files have CR line endings, instead of the more "canonical" LF or CR/LF. The following articles from StackOverflow provide some guidelines:

http://stackoverflow.com/questions/2332349/best-practices-for-cross-platform-git-config/
http://stackoverflow.com/questions/10491564/git-and-cr-vs-lf-but-not-crlf

Thanks!

Race (Unknown-White) in Biopics CSV

I noticed that in the CSV of the biopics, every single line in which the race_known column was Unknown had the subject_race column as White; was it that White was the default race, or that you would guess (based on the subject's name and appearance, but without confirmation based on ancestry or self-reporting) that all 197 of the subjects were White? I realize that this wouldn't change the displays, because "Unknown" makes the race column meaningless, but it is a bit curious.

Making the Data Available in Earlier Stages

First, thanks FiveThirtyEight for making your data available--this is very cool to see, and interesting to be able to replicate results.

One request I have, which may or may not be tenable, is to make data available from the earlier stages of variable construction. For instance, in Nate's recent piece on airline safety, the data we have access to is the number of incidents, fatalities, etc from 1985 to 1999, and again from 2000-2014. While the data is interesting to see, the same data by year would be even more interesting, as would a list of all incidents and how they are coded.

For example, it's easy to imagine for example analysis that could be done by combining the by-year incidents with airline sales in subsequent (something Nate alludes) to. However, this isn't something we're able to do, given that we can only see the data in 15 year chunks. The earlier the data is available to us, the more flexibility we'll have in using your data to develop our own theories and test them, which makes it of greater use to us.

I can appreciate the reasons why this might not be a good idea for FiveThirtyEight (in particular, it means any assumptions and decisions made in cleaning are potentially open to criticism) but to the extent possible, making the earlier stages of your data publicly available would be very appreciated.

Irreproducible Research

No code or data has been shared for the following articles:

http://fivethirtyeight.com/datalab/the-return-of-mlbs-youth/
http://fivethirtyeight.com/datalab/what-to-expect-from-baseball-americas-top-100-prospects/
http://fivethirtyeight.com/features/the-hidden-value-of-the-nba-steal/

Some issues that could be addressed by publicy sharing data and code:

A steal is worth 9.1 times a point, but the article makes no mention of confidence intervals.
This conclusion is drawn from a sample of players who have missed at least 20 games and played at least 20 games in a season. Is this a representative sample? We have no way of assessing this because the underlying dataset has not been shared publicly.

Polls Data up until now

Hi, not really a data issue but more like a FR.

Could you share the data you use for building this model?
http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/

It seems you share the polls rating but not the polls results for this year, am I right?

Congressional generic ballot link update

Hi @dmil: it looks like the Congressional generic ballot data link on this repo points to the Trump approval rating data instead. Is there a link to the CSV behind https://projects.fivethirtyeight.com/congress-generic-ballot-polls/?

2016 Election Forecast data feed

(Feature request rather than data issue.)
Will you have a data feed for the 2016 Election Forecasts that can be used in 3rd party apps?
I've been writing voice applications for the Amazon Echo, including "Tweet Poll" which used "IBM Insights for Twitter" for state-by-state sentiment analysis of candidates during the primary season.
I'm interested in writing an interface that surfaces the daily 538 forecast. I would love to know if there is going to be a data feed for that, and what Terms and Conditions would apply to it. (Or who to talk to about setting up an app specific feed.)

Broken Links in data/college-majors

The census links are broken. Is here a way to find the data you started with?

Thanks!

Transition matrix calculation in pew-religions

In https://github.com/fivethirtyeight/data/tree/master/pew-religions it is not clear how the transition data was obtained. In the full pew report I could not find the anything at that level of detail.

[Closed]

Separate TSVs for all March Madness updates?

@dmil Is there any reason that you're adding a new TSV file for every change? Git makes this unnecessary/undesirable. Viewing (or restoring) previous versions of files is easily done with Git (and GitHub), and keeping everything in the same file would make it easier to view what changed in each commit. As Git is designed to track changes to individual files, separating everything defeats some of its purpose and functionality.

If you still want to have multiple files, I would suggest no more than one per round. But the Round of 64 hasn't even started and we already have 7 separate files; so at this rate there will be a lot of clutter soon.

Is the shoot out data available?

I teach high school computer science with a match teacher who also teaches AP Stats. The raw 269,000 match preferences would be an awesome data set for us to play with. Is that available? I only see your dataset which contains the reduction of that.

Separate into different repos

24MB is a big chunk, when someone may only want to get the code for one story.