fivethirtyeight / data Goto Github PK
View Code? Open in Web Editor NEWData and code behind the articles and graphics at FiveThirtyEight
Home Page: https://data.fivethirtyeight.com/
License: Creative Commons Attribution 4.0 International
Data and code behind the articles and graphics at FiveThirtyEight
Home Page: https://data.fivethirtyeight.com/
License: Creative Commons Attribution 4.0 International
The readme includes the text:
We hope you'll use it to check our work and to create stories and visualizations of your own. The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License.
LICENSE.md only include reference to MIT.
When looking for the licensing information people (and scripts) tend to look at the LICENSE.md and assume all the info is there.
It looks like someone has already made the mistake of thinking everything was under MIT.
https://www.kaggle.com/fivethirtyeight/fivethirtyeight
Also, Data hosting becomes interesting once your dataset gets past 100mb (github: max file upload size) or 1gb in repo (github, max repo size) and git itself becomes slow at 10mb files and 100mb repos. I've used Amazon S3 and post-pull-hooks to create a /data directory in the .gitignore to avoid this issue in the past as part of work at dssg. Anyways, you might need a bigger solution, but if not:
...that was used in this article mentioning Verified Voting?
https://fivethirtyeight.com/features/demographics-not-hacking-explain-the-election-results/
The Grand Illusion by Styx, is duplicated in data/classic-rock/classic-rock-song-list.csv
If I could access the dataset related to this article, I'd really appreciate it! Working on undergrad research pertinent to Mr. Silver's work.
It'd be great to see the calculations for http://fivethirtyeight.com/features/the-most-diverse-cities-are-often-the-most-segregated/
Particularly for calculating the integration-segregation index (and, I suppose by necessity, the the trend line the index is based off). The diversity index calcs we can get from footnotes 5 and 6, and the data from, as the article notes, US Census tract data. This gives readers enough to calculate diversity indices for any community (eg smaller towns and suburbs), but not enough to create an integration index.
Hi all,
I've been working with the data in the college majors
repo over the weekend: https://github.com/fivethirtyeight/data/tree/master/college-majors
I think something went wrong during the data processing, since if you compare the gender-related data in recent-grads.csv
and women-stem.csv
, they don't match up. recent-grads.csv
also indicates that 56% of all computer science majors are female, which is way off.
I'm looking into the data right now to try to find out what happened, but I'd appreciate a second look at this data set to make sure everything is in order.
Cheers,
Randy
Impaired hearing subtitles file can provide word data: https://pastebin.com/raw/BwaWPw5e
You know, for readability in the repo.
https://github.com/fivethirtyeight/data/tree/master/march-madness-predictions
I believe that the nfl_elo dataset may have the 2015 season wild card and division round playoff games on the wrong days:
According to wikipedia (and other sources that I followed up on), the wild-card games were on 1/9 and 1/10, with the divisional round on 1/16 and 1/17.
https://en.wikipedia.org/wiki/2015%E2%80%9316_NFL_playoffs#NFL_playoff_schedule
The repo contains no license file or link to license information that I could find.
FR: Is there any endpoint to the raw data similar to this: #72
But for the trump approval ratings? https://projects.fivethirtyeight.com/trump-approval-ratings/
Thanks
Can you please update this with data from http://projects.fivethirtyeight.com/2016-endorsement-primary/? Or make that data otherwise exportable?
oops, referring to this repo
https://projects.fivethirtyeight.com/global-club-soccer-rankings/
Are there any plans to make this data available through Github or any other mechanism? I'm building an application that requires a ranked list of soccer clubs and would love to use your dataset. If this list of ranked teams could be uploaded as a .csv here that would be fantastic.
Dear 538 team,
I am curious about the preparation of your soccer data, here: https://github.com/fivethirtyeight/data/tree/master/soccer-spi. There appears to be some missing data: there are 467 unique teams in the matches csv, but only 453 teams in the ranks csv. Is it possible to obtain a complete version of this dataset? Thanks in advance for your help!
Best,
Stephanie
Oakland is listed as having 1530 police. I doubt it has ever had that many. http://www.nytimes.com/2012/03/25/us/oakland-police-try-to-fill-the-ranks-but-keep-falling-behind.html says 837 in November, 2008, and 636 at the time of the article in March, 2012.
The graphic in http://fivethirtyeight.com/datalab/most-police-dont-live-in-the-cities-they-serve/ says the numbers are 2010 with source U.S. Census. Is this census data source online? I did some naive searches on census.gov and don't see anything obvious.
Hey 538 team,
Wondering if you guys could upload your Oscar's data here? Or if there's a good public db/API that houses the data? Thanks!
This is actually for all police force staff and not for "Officers" right- the totals are completely wrong otherwise, Oakland's force is around 700 so the 1,530 has to be all OPD staff, not just cop cops. Can you clarify?
Hi,
Could you please share the method that you used for normalizing the ratings and user count?
Thanks.
I see that you're grouping all data sets in one repo. While there's some convenience to organizing things that way, I think it's going to make it more difficult for curious readers to sort through once you've published hundreds of data sets. It would probably be better in the long term to do one repo for each story or data set, and then link to that individual repo from the story.
Given that there are only three data sets posted so far, this will be easier to re-organize now than later.
I have compiled the following spreadsheet, which I think would be a good fit for this data repository. All data comes from 2010 census figures, and it's advantage over said tables is that it doesn't require going through a 5000 line cross reference to figure out what each column means, and all figures are in the same sheet instead split across 20 different ones over 3 files.
I understand that this repository is meant for datasets featured in 538 articles, but it seems likely that this sort of data will be used in the future, and I can't think of a better repository for this to live in.
Hi @BenCasselman thanks for uploading these useful datasets. Can we also get a data dictionary explaining the columns. For example, in the grad_students dataset, the columns of Grad_employed and Grad_unemployed don't add upto Grad_total. So a dictionary would go a long way to help. Also how was the unemployment rate was computed?
I like the graphics on fivethirtyeight. And hope to learn how to make them. The internet needs more graphics like these. Thanks.
Currently, at FiveThrityEight, St. Louis is listed as the winner, but according to NCAA, Louisville is the winner.
Can someone please help me with getting the data behind the chance of winning forecast for Clinton and Trump? I tried using data scraping tools, but wouldn't work on the graphs.
http://projects.fivethirtyeight.com/2016-election-forecast/?ex_cid=rrpromo
It's under How the forecast has changed, and the latest data as of 10/20 is Clinton for 87.2% and Trump 12.7%. Thank you in advance!
_
Do you guys have any plans to publish your Elo code? Specifically NBA Elo https://github.com/fivethirtyeight/data/tree/master/nba-elo
Can you post code required to get the CSV data from the Twitter API? I would like to create and host a streaming / realtime version of this. It would be a cool addition to your site IMO.
Can be changed in the readme by adding
[FiveThirtyEight's story on earnings of college majors](http://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/)
as a link in README.md
Thank you for making your data available.
Any chance I can get my hands on some delicious, delicious burrito data?
Any chance of data getting posted for this article? http://fivethirtyeight.com/datalab/scary-movies-are-the-best-investment-in-hollywood
I guess this is a... feature request?
Looks like there's a new article updating the potential candidates but the data set doesn't reflect the new updates. This could be a lot of fun to have access to as it gets updated.
Hello!
First of all, thanks for posting the data to the stories making it easier to follow the described methodology in each article. But, please, could you upload the csv files with correct line endings? As of now, most of the CSV files have CR line endings, instead of the more "canonical" LF or CR/LF. The following articles from StackOverflow provide some guidelines:
http://stackoverflow.com/questions/2332349/best-practices-for-cross-platform-git-config/
http://stackoverflow.com/questions/10491564/git-and-cr-vs-lf-but-not-crlf
Thanks!
I noticed that in the CSV of the biopics, every single line in which the race_known
column was Unknown
had the subject_race
column as White
; was it that White was the default race, or that you would guess (based on the subject's name and appearance, but without confirmation based on ancestry or self-reporting) that all 197 of the subjects were White? I realize that this wouldn't change the displays, because "Unknown" makes the race column meaningless, but it is a bit curious.
First, thanks FiveThirtyEight for making your data available--this is very cool to see, and interesting to be able to replicate results.
One request I have, which may or may not be tenable, is to make data available from the earlier stages of variable construction. For instance, in Nate's recent piece on airline safety, the data we have access to is the number of incidents, fatalities, etc from 1985 to 1999, and again from 2000-2014. While the data is interesting to see, the same data by year would be even more interesting, as would a list of all incidents and how they are coded.
For example, it's easy to imagine for example analysis that could be done by combining the by-year incidents with airline sales in subsequent (something Nate alludes) to. However, this isn't something we're able to do, given that we can only see the data in 15 year chunks. The earlier the data is available to us, the more flexibility we'll have in using your data to develop our own theories and test them, which makes it of greater use to us.
I can appreciate the reasons why this might not be a good idea for FiveThirtyEight (in particular, it means any assumptions and decisions made in cleaning are potentially open to criticism) but to the extent possible, making the earlier stages of your data publicly available would be very appreciated.
No code or data has been shared for the following articles:
http://fivethirtyeight.com/datalab/the-return-of-mlbs-youth/
http://fivethirtyeight.com/datalab/what-to-expect-from-baseball-americas-top-100-prospects/
http://fivethirtyeight.com/features/the-hidden-value-of-the-nba-steal/
Some issues that could be addressed by publicy sharing data and code:
Hi, not really a data issue but more like a FR.
Could you share the data you use for building this model?
http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/
It seems you share the polls rating but not the polls results for this year, am I right?
Hi @dmil: it looks like the Congressional generic ballot data link on this repo points to the Trump approval rating data instead. Is there a link to the CSV behind https://projects.fivethirtyeight.com/congress-generic-ballot-polls/?
(Feature request rather than data issue.)
Will you have a data feed for the 2016 Election Forecasts that can be used in 3rd party apps?
I've been writing voice applications for the Amazon Echo, including "Tweet Poll" which used "IBM Insights for Twitter" for state-by-state sentiment analysis of candidates during the primary season.
I'm interested in writing an interface that surfaces the daily 538 forecast. I would love to know if there is going to be a data feed for that, and what Terms and Conditions would apply to it. (Or who to talk to about setting up an app specific feed.)
The census links are broken. Is here a way to find the data you started with?
Thanks!
In https://github.com/fivethirtyeight/data/tree/master/pew-religions it is not clear how the transition data was obtained. In the full pew report I could not find the anything at that level of detail.
@dmil Is there any reason that you're adding a new TSV file for every change? Git makes this unnecessary/undesirable. Viewing (or restoring) previous versions of files is easily done with Git (and GitHub), and keeping everything in the same file would make it easier to view what changed in each commit. As Git is designed to track changes to individual files, separating everything defeats some of its purpose and functionality.
If you still want to have multiple files, I would suggest no more than one per round. But the Round of 64 hasn't even started and we already have 7 separate files; so at this rate there will be a lot of clutter soon.
I teach high school computer science with a match teacher who also teaches AP Stats. The raw 269,000 match preferences would be an awesome data set for us to play with. Is that available? I only see your dataset which contains the reduction of that.
24MB is a big chunk, when someone may only want to get the code for one story.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.