Giter VIP home page Giter VIP logo

data-512-homework_1's Introduction

data-512-homework_1: Professionalism & Reproducibility

This repository has assignment for DATA-512 - Human Centered Design Homework 1.

Goal

The project aims at acquiring, constructing, analyzing and publishing a dataset and analysis from a subset of wikipedia pages. The goal is to follow best practices in scientific research not only from the coding but also the documentation stanspoint which entails reproducing workflows referring to the articles "Assessing Reproducibility" and "The Basic Reproducible Workflow Template" from The Practice of Reproducible Research.

Datasource Information

The data for this project is extracted from the Pageviews API. The Pageviews API (documentation, endpoint) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through the previous complete month. Leveraging the API, I collected the pageviews information using a subset of Wikipedia article pages. Please find below the subset of the English Wikipedia that represents a large number of dinosaur related articles.

Dataset Source and License

Relevant APIs

API Terms

Issues and Special Considerations

The code has been accustomed to handle any exceptions occurring with regards to data manipulation operations. Also, if the URL or article titles do not exist, the code wouldn't break and is entirely reproducible. As of now, everything just works fine and we have all the relevant information

Repository Structure

Here are the main folders in our github data-512-homework_1 repository:

├── DATA512-Homework_1_RohitLokwani.ipynb
├── README.md
├── LICENSE
├── input_data
│   ├── dinosaur_genera.cleaned.SEPT.2022 - dinosaur_genera.cleaned.SEPT.2022.csv.csv
├── intermediate_outputs
│   ├── dinosaurs_data.csv
├── json_outputs
│   ├── dino_monthly_cumulative_201507-202209.json
│   ├── dino_monthly_desktop_201507-202209.json
│   └── dino_monthly_mobile_201507-202209.json
├── plotted_graphs
│   ├── top_max_min_average_views.png
│   ├── top_ten_peak_page_views.png
│   └── articles_with_fewest_data_monthly.png

JSON Outputs

The following files are a part of data acquisition output as it is a time-consuming process and we can rerun the rest of the steps using these files.

dino_monthly_desktop_201507-202209.json - JSON file that contains all articles pages view data for desktop access type dino_monthly_mobile_201507-202209.json - JSON file that contains all articles pages view data for mobile access type dino_monthly_cumulative_201507-202209.json - JSON file that contains the cumulative page views of all articles for both desktop and mobile access type

Data Fields

project - Source of data.
article - Title of the article.
granularity - Time period for Cumulative data.
timestamp - Time checkpoint for data.
agent - type of views.
views - Views between two timestamps as per the specified granularity.
Note: Access types are mentioned on the filenames.

Sample

{ "project": "en.wikipedia", "article": "Coelosaurus_antiquus", "granularity": "monthly", "timestamp": 2015070100, "agent": "user", "views": 79 }

Languages used: Python

Other References

Rokem, Marwick, and Staneva. Assessing Reproducibility in Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press.

Kitzes. The Basic Reproducible Workflow Template in Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press.

data-512-homework_1's People

Contributors

rohitl17 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.