Giter VIP home page Giter VIP logo

data-science-study's Introduction

Data Science Study (CMSE428 EMU)

  • People create, recieve and generate data.

  • Data is collected by companies to influence our decisions by sending matching commercials

  • Intelligent algorithms can help you decide which movie to which, which book to read, which place to visit...

  • In biometric sensors, algorithms may help understand our emotions and support making better decisions. Help in health as well (detecting cancer cells and start of several diseases)

  • Big Tech companies apply Attension Marketing business model to collect data (they provide free information, services, intertainment and collect data about us)

  • Data should be processed to extract information from it.

  • Main Data Science Concepts:

    • Data Collection
    • Data Visualization
    • Data Exploration
    • Data Modeling
    • Models Evaluation
  • Programming languages used in data science: python & R

  • Data science can be defined as Engineering of Data.

  • Data Scientist : is capable of making decisions about what data to collect, and why. The need to be able to formulate questions and hypotheses then make a plan for how the problem will be attacked.

  • Components Fields of Data Science

    • Statistics
    • Algorithms
    • Scientific Methods
    • Machine Learning
    • Data Analysis
    • Models

Data Engineering

  • Steps of data engineering:

    • Ask question what data needs to be recorded or collected?
    • After row data is collected (emails, logs, medical records, surveys, blood drawn, ...)
    • Data Processing (Pipelines, web-scraping, cleaning, munging, joining, wrangling)
    • Finally data is cleaned and ready (clean, outliers, missing values, debugging, tables)
  • Data is never clean!

  • The exploratory data analysis combines visualization and data sense => crucial part. (scientist will find patterns, build models, algorithms to serve as prototypes)

To Implement a DataScience Project:

  • Discovery: first phase of data science

    • learn and ivestigate the problem
    • develop context and understanding
    • learn about data source needed and available for the project
    • team formulates initial hypotheses that can be tested later with data.
    • Identify main objective of project
    • Identify what needs to be achieved in business terms
    • Identify what must be done to meet needs
    • consider success creteria
    • Identify kinds of data team will need to solve problem, consider volume, type, time span of the data needed to test the hypothese.
    • in most cases, team will need the row data to avoid introducing bias(by domain experts) for the analysis.
    • Develop set of hypotheses
      • Form ideas that can be tested
      • Come up with few primary hypothese to test.
      • Hypothesis form the basis of the analytical tests the team will use in later phases and serve as foundation for findings to be shared
    • Make a list of candidate data sources team may need.
    • make inventory of datasets currently available.
    • Begin understanding the interdepndencies among data attributes and become familliar with content of data, its quality and limitations
    • Evaluate data structures and tools needed.
  • Data Preperation: includes steps to explore, preprocess data - team needs to create robust environment in which it can explore data that is separate from production environment. - > critical aspect of data science project is to become familiear with the data itself - Data Conditioning = cleaning data, normalizing datasets, performing transformations on data. - join data - merge data - Visualization to gain overview of data.

  • Model Planning: - team identifies candidate models to apply (clustering, classifying, finding relationships in data depending on goal of project) - the objective of data exploration in this phase is to understand the relationships among the variables to inform selection of variables and methods and to understand problem domain - better variable sets lead to better models. (use tools to perform data visualizations) - team main goal is to choose an analytical technique based on end goal of the project.

  • Model Building: training and testing the model - gain training data and test data. - model is fit on the training data - model is evaluated againest the test data - model planning and model building phases can overlap, in practice we can iterate between the two phases for a while. - Modeling tools: - R: strong statistical language - Octave: free software programming language for computational modeling, has some of MATLAB functionalities. - WEKA: free data mining software package (can be executed within java code. - Python: provides toolkits for ML and analytics such as scikit-learn, numpy, scipy, pandas, matplotlib..

  • Communicate Results:

    • team compares outcomes of modeling to the criteria
    • team must be rigorous enough to prove or disprove the hypotheses outlined in discovery phase.
    • determine if results are statistically significant and valid.
  • Operationalize:

    • team communicates benefits of project more broadly.
    • deploy work in a controlled way before broadening work to a full enterprise or ecosystem of users.
    • team learns about performance and related constraints of the model production environment on a small scale then make adjustments before a full deployment.
    • create mechanisim for performing ongoing monitoring of model accuracy, and if accuracy degrades, find a way to retrain model.

Case Study (Development of a Daibetes Detection App Based on non-invasive Biomarkers)

  1. Discovery

    • form a team involving diabetes experts.
    • know how long will it take to collect data? how to identify participants? which hospital?
    • know which measuring device will be needed(to measure weight, height..)
    • know where to store data
    • know who will be responsible for data collection
    • know what will be the main objective of the project
    • know what should be the main characteristics of the app?
    • know about the performance of the model in terms of True Positive Rate and False Positive Rate
    • know what will be the hypothesis?
  2. Data preperation

    • know if there are missing values?
    • normalization, transformation needed or not?
    • cleaning?
    • Labeling?
    • Visualization for deeper understaning?
    • Number of participants?
  3. Model planning

    • which model to use?
    • should we apply backward elimination to choose the best variable set?
    • are other methods of variable selection needed?
  4. Model building

    • split data to two parts: training and testing data
    • you can use 10-fold cross validation for this purpose
    • compute detection performance of the models developed
  5. Communicate results

    • implement the model as a mobile app
    • present the expected performance of the developed system in practice to the experts in domain
    • present the findings in terms of strengths and weaknesses
    • present the resultant system to the experts in the domain and explain how to use it
  6. Operationalize

    • set up a pilot project to evaluate the app
    • identify the characteristics of the people for whom the model fails

data-science-study's People

Contributors

alshubati99 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.