Giter VIP home page Giter VIP logo

datar's Introduction

The PCA App

The PCA app performs a Principal Component Analysis (PCA) on the Golub dataset of leukemia patients. PCA is a tool for dimension reduction of multi dimensional data. In the input panel on the left side the number of genes in the analysis and the wanted Principal Components can be selected. It is also possible to change the maximum amount of gene names displayed in the Loading plot. The displayed Genes then are the ones with the highest variance. In the main panel on the right side the user can select the plot type and can get some basic information about the analysis. For more information about the plot types please review the explanation below.

The Golub Data

The Golub data by Todd Golub covers the Gene Expression of 27 Patients with acute lymphoblastic leukemia (ALL) and 11 patients with acute myeloid leukemia (AML). In total there were 7129 Genes measured. The App performs a PCA to find difference in the gene expression between these two group of patients.

Installation

The PCA App can be used locally on your computer after downloading the source code. A full Zip Download of all resources is possible by using the green Clone or Downlaod Button in the top right corner.

These R Packages need to be installed for the package to work:

To start the app open your R console and change the folder to the downloaded file, then use this code to execute the source code:

library(shiny) runApp("app.R")
The testing has been done with R 3.6.1 with R Studio 1.2.5001 on Windows 10.

What is Principal Component Analysis?

The gene expression of organisms is a very complex system. Changes in the conditions result in changes of a variety of genes correlated positive or negative. Therefore, a lot of information gain is made by looking at several genes at the same time. While the gene expression of cells depends of thousands of genes (multidimensional data) only up to three dimensions of the data could be visualised at the same time.

In this case we will use the dimension reduction method Principal Component analysis to reduce the amount of dimensions of the data while keeping the maximum amount of information.

A three dimensional approach

For explaining the idea of dimension reduction with PCA we will first focus on visualising three dimensional data in two dimensions and will increase the amount of dimensions afterwards. For a visual explanation we have a look at the randomly generated expression data of 3 genes x, y and z from 100 Patients. We can visualise this data in three dimensions. One point in the plot represents one of the patients.

But how is the difference in the expression? Can we group the Patients into different Groups? How can we see a difference in the expression? To answer these questions we can have a look from different sides on the data, imagine turning the plot in 3D space and taking pictures from different views. Which picture has the best angle to group the patients or tell about a difference in the gene expression? Here the Principal Components Analysis can compute this "view" on the data for us. With the given data the method tries to find an axis trough the three dimensional data with the maximal possible variance, this axis is called "Principal Component 1 (PC1)".

Orthogonal to that axis it constructs another Principal Component maximising the rest of the variance. With three dimensional data we can get a maximum of three new Principal Components who are a new orthogonal Coordinate System to have a look at the data. As defined along PC1 the data shows the biggest variance, along PC2 the second biggest etc. To have the "best" view on the Data we have to Plot PC1 against PC2.

In this Score Plot we can se a projection of the original Data on the Plane made up by the two first principal components. Now we can work with a plot with reduced dimension (only 2) and can be shure that the new coordinate System covers the most variance possible.

But how exactly do the Principal Components look like? As said before in this three dimensional case the principal components are a straigth line trough the Data. Therefore the Principal Components are a linear combination of the original variablen (here: Genes).

In the result of the PCA the coefficients of the Principal Components are called Loadings and can be visualised in a Loading Plot. In the following Loading Plot the actual composition of the first two PCs are visualized.

As we can see PC1 is made up mainly by the original gene x while PC 2 is mainly characterized by Gene z. In the first Principal Components gene y seems not to explain a lot of the found variance.

Because we are working with three dimensional Data we can still have a look into the original Data. This time we rotate the Plot to have a look in the x/z direction as directed from the PCA.

When we take a look back on the Score Plot we can clearly see that this is nearly the view which was calculated by the PCA, however the exact view consists in a small rotation in the y axis which can be seen in the Loading Plot. Also this view still has the three dimensional perspective while the Score plot is projected on the plane described by the Principal Components.

Multi Dimensional Data

In our three Dimensional example the "best view" can clearly also be found with rotating the original Data, but with multidimensional data this is, due to the impossible visualisation, unpractial. The Principal Components can still be computed an visualised. The main part of the variance can be explained by more than 2 PCs. But on how many PCs to we have to look at?

The following Scree Plot shows the variance for each Principal Component. By a look at this Plot we have a visual indicator at how many Principal Components we have to look.

Often the results of a PCA get visualized with a Biplot which is a combination of the shown Score and Loading Plots.

The source code which was used for generating the explaining plots is available in the file PCA_example.R . The randomly generated three Dimensional data is also available for recalculations in the file PCA_test_data.txt.

Used Functions

  • data("Golub_Train""), exprs(Golub_Train): Loading the data from the dataset
  • replace (): Removing unwanted data
  • log2():
  • rownames(): changing the rownames to the leukemia type of the patient
  • fluidPage(): UI definition
  • titlePanel(): title setup of the App
  • sidebarLayout(): sidebar UI
  • sidbarPanel(): initialising the Input Panel on the left side
  • manPanel(): initialising the main Panel with the plots
  • sliderInput(), numericInput(): getting User intput
  • tabsetpanel(), tabPanel(): initializing the Panel setup of the plots
  • recalcPCA(): recalculation of the PCA when the number of Genes is changed
  • sort(): sorting the data
  • prcomp(): the Principal Component Analysis
  • server (), renderPlot(), renderText(): rendering dynamic Outputs
  • plot(), points(), arrows(), text(): Functions for creating the PCA Plots

The results

The Principal Component Analysis can reveal how exactly the patients vary in their expression. The Score Plot is a good start to see if there is a significant difference. For understanding this difference the Loading plot shows the genes which construct the view seen in the Score plot. Strongly correlated genes are expressed with very close arrows. The displayed gene names can be used to get more information about the genes in databases. In the Leukemia example the PCA can reveal how exactly the two group of patients vary in their gene expression and how different the two types are.

Further Information

For further information about Principal Component Analysis please have a look at these resouces:

datar's People

Contributors

czackl avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.