Giter VIP home page Giter VIP logo

datasets's Introduction

datasets

U.S. Census Bureau Data: https://www.census.gov/

World Values Survey: https://www.worldvaluessurvey.org/

Pew Research Center Data: https://www.pewresearch.org/

Human Rights Data Analysis Group (HRDAG): https://hrdag.org/

Global Terrorism Database: https://www.start.umd.edu/gtd/

National Crime Victimization Survey: https://www.bjs.gov/ncvs/

Twitter API: https://developer.twitter.com/en/docs/twitter-api

https://www.openml.org/search?type=data

winequalityN come from: https://www.kaggle.com/datasets/shelvigarg/wine-quality-dataset data.csv comes from: https://www.kaggle.com/datasets/shree1992/housedata

cars.csv comes from: https://www.kaggle.com/datasets/abineshkumark/carsdata

Binder
Spotify Classifier: https://www.kaggle.com/datasets/geomack/spotifyclassification

https://sports-statistics.com/sports-data/sports-data-sets-for-data-modeling-visualization-predictions-machine-learning/
Michael Jordan and Shaquille O'Neil Career Stats: Classification (Win is the Target Variable)
NBA shot logs: Data on shots taken during the 2014-2015 season, which player took the shot, where on the floor was the shot taken from, who was the nearest defender, how far away was the nearest defender, time on the shot clock, and much more.
Diamonds : Multiclass Classification and/or Regression
This list is the diamonds dataset. It is ideal in length for practice (+50k samples) and has multiple targets you can predict as a regression or a multi-class classification task 🎯 Targets: ‘carat’ or ‘price’
🔗 Link: Kaggle
📦Dimensions: (53940, 10)
⚙Missing values: No


Abalone Dataset: Classification / Regression (Male/Female or Age) This is a unique dataset from the field of zoology. The task is to predict the age of Abalone shells (a type of mollusk) using several physical measurements. Traditionally, their age is found by cutting through their cone, staining them, and counting the number of rings inside the shell under a microscope.
🎯 Target: ‘Rings’
🔗 Link: Kaggle
📦Dimensions: (4177, 9)
⚙Missing values: No


King County Real Estate Dataset:
This is the dataset for those who are still interested in real estate and house prices regression
🎯 Target: ‘price’
🔗 Link: Kaggle
📦Dimensions: (21613, 17)
⚙Missing values: Yes

Cancer death rate Dataset
This dataset challenges you to find cancer mortality rate per capita (100,000) using several demographic variables. These data were aggregated from a number of sources including the American Community Survey (census.gov), clinicaltrials.gov, and cancer.gov. Most of the data preparation process can be veiwed here.
🎯 Target: ‘TARGET_deathRate’
🔗 Link: Data.world
📦Dimensions: (3047, 33)
⚙Missing values: Yes

Life Expectancy (WHO)
How long will a person live? This is one of the hardest questions unanswered in science. Several studies have been undertaken to understand human life and longevity, and this dataset provided by WHO (World Health Organization) is one of them
🎯 Target: ‘Life expectancy.’
🔗 Link: Kaggle
📦Dimensions: (2938, 21)
⚙Missing values: Yes


Car prices The title says it all — predict car prices using variables like mileage, fuel type, transmission, and several domain-specific features. This is also an excellent dataset for pumping out your feature engineering muscles.

🎯 Target: ‘selling_price’ 🔗 Link: Kaggle 📦Dimensions: (8128, 12) ⚙Missing values: Yes

Binary classification

7️⃣. NBA rookie stats The first binary classification dataset in the list requires you to predict if a rookie basketball player will last more than 5 years in the league:

🎯 Target: ‘TARGET_5Yrs’ 🔗 Link: Data.world 📦Dimensions: (8128, 12) ⚙Missing values: Yes

Stroke prediction Another medical dataset asks you to predict whether a patient will have a stroke or not based on their history with interesting features: 🎯 Target: ‘stroke’ 🔗 Link: Kaggle 📦Dimensions: (5110, 11) ⚙Missing values: Yes

Water potability Safe drinking water is the most basic human right and a major influencer on health. Using this dataset, you should classify water bodies into potable (drinkable) and not potable using several chemical properties: 🎯 Target: ‘Potability’ 🔗 Link: Kaggle 📦Dimensions: (3276, 10) ⚙Missing values: Yes

Smart grid stability This is an augmented version of the “Electrical Grid Stability Simulated Dataset” created by Vadim Arzamasov. It is donated to UCI and made available on Kaggle. You will be predicting the stability of 4-node smart grid systems (whatever they mean):

🎯 Target: ‘stabf’ 🔗 Link: Kaggle 📦Dimensions: (60000, 13) ⚙Missing values: No

IBM HR analytics & employee attrition This fictional dataset created by IBM datasets tasks you to uncover which factors lead to employee attrition (whether they will leave their role): 🎯 Target: ‘Attrition’ 🔗 Link: Kaggle 📦Dimensions: (1470, 35) ⚙Missing values: No

Can I eat this mushroom? Another one-of-a-kind dataset is classifying mushrooms into edible and poisonous. It also presents a unique challenge — all features are categorical: 🎯 Target: ‘class’ 🔗 Link: Kaggle 📦Dimensions: (8124, 23) ⚙Missing values: Yes

Banknote authentication Even though this dataset has very few features, I wanted to include it because the task is really interesting — using physical attributes of banknotes, you should classify them into forged or original: 🎯 Target: ‘class’ 🔗 Link: Kaggle 📦Dimensions: (1372, 5) ⚙Missing values: No

Adult income dataset Predict whether a person will end up earning more than 50k using factors like age, education, background, gender, marital status, etc.: 🎯 Target: ‘income’ 🔗 Link: Kaggle 📦Dimensions: (48842, 15) ⚙Missing values: Yes

Multi-class classification datasets

Yeast classification This dataset will give you a small taste from the world of microbiology. You are tasked to classify a fungus called yeast into species: 🎯 Target: ‘class_protein_localization’ 🔗 Link: OpenML 📦Dimensions: (1484, 9) ⚙Missing values: No


  • mlb_salaries_2014.csv Salaries of players in Major League Baseball at the start of the 2014 season, from the Lahman Baseball Database.

  • disease_democ.csv Data illustrating a controversial theory suggesting that the emergence of democratic political systems has depended largely on nations having low rates of infectious disease, from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries.

  • gdp_pc.csv World Bank data on 2014 Gross Domestic Product (GDP) per capita for the world’s nations, in current international dollars, corrected for purchasing power in different territories.

  • nations.csv Data from the World Bank Indicators portal, which is an incredibly rich resource. Contains the following fields:iso2c iso3c Two- and Three-letter codes for each country, assigned by the International Organization for Standardization.

  • oil_production.csv Data on oil production by world region from 2000 to 2014, in thousands of barrels per day, from the U.S. Energy Information Administration.

  • ucb_stanford_2014.csv Data on federal government grants to UC Berkeley and Stanford University in 2014, downloaded from USASpending.gov.

  • urls.xls A spreadsheet that we’ll use in webscraping.

Data used in reporting this story, which revealed that some of the doctors paid as “experts” by the drug company Pfizer had troubling disciplinary records:

  • pfizer.csv Payments made by Pfizer to doctors across the United States in the second half on 2009.

  • fda.csv Data on warning letters sent to doctors by the U.S. Food and Drug Administration, because of problems in the way in which they ran clinical trials testing experimental treatments. Contains the following variables:

  • food_stamps.csv U.S. Department of Agriculture data on the number of participants, in millions, and costs, in $ billions, of the federal Supplemental Nutrition Assistance Program from 1969 to 2015.

  • kindergarten.csv Data from the California Department of Public Health, documenting enrollment and the number of children with complete immunizations at entry into kindergartens in California from 2001 to 2015.

  • -gpd_pc.csv gdp_pc.csvt CSV file with World Bank data on GDP per capita for the world’s nations in 2014, plus ancillary file for QGIS to understand the data types for each field.

  • warming.csv NASA data on the annual average global temperature, from 1880 to 2015, compared the the average from 1951-1980.

Global Terrorism Database Maintained by the National Consortium for the Study of Terrorism and Responses to Terrorism (START) at the University of Maryland in College Park, the Global Terrorism Database contains information on more than 150,000 terrorist attacks from 1970 to 2015. It is a rich source of information on terrorist groups across the globe, and the attacks they are responsible for.

You can download the data from here: https://gtd.terrorismdata.com/, selecting the Download full GTD dataset option. An extensive codebook details all of the fields in the data.

The data is provided as a series of spreadsheets in .xlsx format. I suggest that you import this data into Open Refine before processing any further, and create a new field giving the date of each event in standard YYYY-MM-DD format. This can be done from the eventid field.

Do take care to read the Terms of Use and instructions for citing the source of the GTD data.

datasets's People

Contributors

fenago avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.