Giter VIP home page Giter VIP logo

capstone2_sb's Introduction

Genomic Coverage Data Analysis Summary

Data Exploration and Visualization

The dataset comprised genomic coverage data, which typically represents the number of reads aligning to specific regions of the genome.

  • Descriptive Statistics: Calculations of mean, median, and variance provided insights into the central tendency and variability of the coverage data.
  • Histograms: Plots were generated to visualize the frequency distribution of coverage values, which is crucial for understanding the range and common patterns within the data.

Statistical Distribution Fitting

Statistical distributions were fitted to model the coverage data, which is essential for understanding the underlying biological processes and for further predictive modeling.

Poisson Distribution

  • The Poisson distribution fitting is appropriate for count data, assuming the mean and variance are equal. The fit was assessed visually against the empirical data.

Normal Distribution

  • The Normal distribution fitting was assessed due to the Central Limit Theorem, which would apply if the coverage data were the mean of many independent count processes.

Negative Binomial Distribution

  • The Negative Binomial distribution was considered because it allows for the mean and variance to be different, which is common in biological data where overdispersion (variance greater than the mean) is present.

Model Fit Evaluation Metrics

Kullback-Leibler (KL) Divergence

  • KL Divergence measures how one probability distribution diverges from a second, expected probability distribution. Lower values indicate the fitted distribution is closer to the true data distribution.
    • For Poisson: The value was extremely low, indicating a close fit, but this might be misleading if overdispersion is present.

Regression Analysis for Predicting Copy Number Variations

Negative Binomial Regression

  • A regression model was built to predict the copy number variation from the coverage data. This approach is suitable for count data and can handle overdispersion.

Model Metrics

  • Log-Likelihood: Represents the log of the likelihood function for the model, with higher values indicating a better fit.
  • Pseudo R-squared: Provides an estimation of the variance explained by the model, similar to the R-squared value in linear regression, but adapted for models that do not have a true R-squared measure. Values closer to 1 suggest a better fit.
  • Coefficients: Indicate the expected change in the response variable for a one-unit change in the predictor. A significant p-value (typically < 0.05) suggests that changes in the predictor are associated with changes in the response variable.

Predictive Performance Metric

  • Mean Squared Error (MSE): Represents the average squared difference between the observed actual outcomes and the outcomes predicted by the model. Lower MSE values indicate a model that predicts more accurately.

Summary

The analysis indicated that the Negative Binomial distribution was a suitable model for the genomic coverage data, accounting for the overdispersion often seen in such data. The resulting regression model was significant and provided a reasonable fit to the data, offering a potential tool for predicting copy number variations from new coverage data.

Recommendations for Further Research

  • Incorporate additional biological features relevant to coverage and copy number variation to potentially enhance model predictions.
  • Perform cross-validation to evaluate the model's predictive performance on independent data subsets, ensuring the robustness of the model.
  • Validate the model's predictions against an independent dataset or through a biological experiment to confirm the model's utility in practical applications.

capstone2_sb's People

Contributors

redndgreen8 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.