The dataset comprised genomic coverage data, which typically represents the number of reads aligning to specific regions of the genome.
- Descriptive Statistics: Calculations of mean, median, and variance provided insights into the central tendency and variability of the coverage data.
- Histograms: Plots were generated to visualize the frequency distribution of coverage values, which is crucial for understanding the range and common patterns within the data.
Statistical distributions were fitted to model the coverage data, which is essential for understanding the underlying biological processes and for further predictive modeling.
- The Poisson distribution fitting is appropriate for count data, assuming the mean and variance are equal. The fit was assessed visually against the empirical data.
- The Normal distribution fitting was assessed due to the Central Limit Theorem, which would apply if the coverage data were the mean of many independent count processes.
- The Negative Binomial distribution was considered because it allows for the mean and variance to be different, which is common in biological data where overdispersion (variance greater than the mean) is present.
- KL Divergence measures how one probability distribution diverges from a second, expected probability distribution. Lower values indicate the fitted distribution is closer to the true data distribution.
- For Poisson: The value was extremely low, indicating a close fit, but this might be misleading if overdispersion is present.
- A regression model was built to predict the copy number variation from the coverage data. This approach is suitable for count data and can handle overdispersion.
- Log-Likelihood: Represents the log of the likelihood function for the model, with higher values indicating a better fit.
- Pseudo R-squared: Provides an estimation of the variance explained by the model, similar to the R-squared value in linear regression, but adapted for models that do not have a true R-squared measure. Values closer to 1 suggest a better fit.
- Coefficients: Indicate the expected change in the response variable for a one-unit change in the predictor. A significant p-value (typically < 0.05) suggests that changes in the predictor are associated with changes in the response variable.
- Mean Squared Error (MSE): Represents the average squared difference between the observed actual outcomes and the outcomes predicted by the model. Lower MSE values indicate a model that predicts more accurately.
The analysis indicated that the Negative Binomial distribution was a suitable model for the genomic coverage data, accounting for the overdispersion often seen in such data. The resulting regression model was significant and provided a reasonable fit to the data, offering a potential tool for predicting copy number variations from new coverage data.
- Incorporate additional biological features relevant to coverage and copy number variation to potentially enhance model predictions.
- Perform cross-validation to evaluate the model's predictive performance on independent data subsets, ensuring the robustness of the model.
- Validate the model's predictions against an independent dataset or through a biological experiment to confirm the model's utility in practical applications.