Using ADMIXTURE, a program customed for SNP datasets only, we aimed to define the genetic units present in two fish species: Sebastes faciatus and S. mentella.
ADMIXTURE estimates individual ancestries by efficiently computing maximum likelihood estimates in a parametric model. To better understand this program, read the manual page Alexander 2011.
Sebastes faciatus and S. mentella are curently fished in some regions of the Atlantic Canada while in othe regions the stocks are endangered. Using a wide dataset of SNPs markers may help us to better delineate the stock structure than with the use of microsatellites.
** Prepare a .bed file** from your vcf file. To do so, first use VCFTOOLS in the terminal.
vcftools --vcf nameofyourfile.vcf --plink-tped --out nameofyourfile
Then, use and PLINK.
plink --tped nameofyourfile.tped --tfam nameofyourfile.tfam --make-bed --out nameofyourfile
The commend --make-bed
will produce three files:
- a binary ped file (*.bed)
- the pedigree/phenotype information file (*.fam)
- an extended MAP file (*.bim) that contains information about the allele names, which would otherwise be lost in the .bed file
Run Admixture in bash. Go into the folder where your bed file is. Run admixture on your .bed file in your terminal by typing:
for K in echo $(seq 29) ; do admixture --cv=10 -B2000 -j8 nameofyourfile.bed $K | tee log${K}.out; done
Usually the maximum number of K - to test as a first step - is selected based on the number of sampling locations that you have. Here we had 28 sampling locations (n = 28), so we first tested a K max = n + 1 = 29.
Collect the cross validation information obtained from the all the log files.
grep -h CV log*.out>cross_validation.txt
done
Extract the right order for individual id from your vcf file, using the tfam file information. To do so, you need to use VCFTOOLS with the following command:
cut -f 1 nameofyourfile.tfam > id_admixture.txt
done
cut -f 1 nameofyourfile.tfam > id_admixture.txt
done
Using the R environment, first check the percent of error due to the number of genetic cluster inferred. Several methods used for defining the optimal number of clusters.
Keep in mind not over interpreting the number o clusters defined by this analysis and to complement ADMIXTURE approch with a Principal Component Analysis
or a Discriminant Principal Component Analysis
.
Indeed ADMIXTURE,which is similar to STRUCTURE, is based on model assumptions that do not always follow the biological reality of your dataset.
See [Lawson et al] (https://www.nature.com/articles/s41467-018-05257-7) for careful advices on how to interpret ADMIXTURE results.
In R, download libraries:
library(stringr)
library(ggplot2)
library(dplyr)
Download the cross-validation results you have previously created via bash command.
cv <- read.table("cross_validation.txt")
Analyze the cross-validation results Then, add a K-cluster column indicating the number of K you test and select only two columns of interest, CV and K.
cv$K <-gsub("[\\(\\)]", "", regmatches(cv$V3, gregexpr("\\(.*?\\)", cv$V3)))
CV <- select(cv, V4,K)
Rename your two columns CV and K-cluster
colnames(CV) <- c("CV","K")
Do a graph showing the cross validation results. Then select the optimal number of clusters regarding :
- the lowest cross validation error
- when the cross-validation error decrease the most
graph_title="Cross-Validation plot"
x_title="K"
y_title="Cross-validation error"
graph_1<-ggplot(CV,aes(x=K,y=CV))
graph_1+geom_line()+scale_x_continuous(breaks=c(1,2,3,4,5))+
labs(title=graph_title)+
labs(x=x_title)+
labs(y=y_title)+
theme(axis.text.x=element_text(colour="black"))+
theme(legend.title=element_blank())+
theme(axis.text.y=element_text(colour="black",size=12))+
theme(axis.text.x=element_text(colour="black",size=12))+
theme(panel.border = element_rect(colour="black", fill=NA, size=3),
axis.title=element_text(size=18,colour="black",family="Helvetica",face="bold"))
Save the graph
ggsave("Admixture_cross-validation.pdf",width=7,height=5,dpi=600)
dev.off()
In this section, the goal is to delineate how is distributed the percent of ancestry to each genetic cluster found, and this for each individual. Assignment of clusters can be sometimes trivial while any clear population structure patterns tend to emerge or even when the vas majority of the individuals tends to show less 50% of ancestry to one genetic cluster (suggesting admixed populations). One common mistake is to have an unbalanced number of samples per population. This brings out a high number of indivdiuals that tend to belong to this large population instead of other populations, as this population has
First download libraries:
library(reshape2)
library(plyr)
library(stringr)
library(ggplot2)
library(tidyverse)
library(RColorBrewer)
Read file.Q with the Q = the optimal K
admixture <- read.table("24603snps_860ind.2.Q")
Add one column with the individuals names and the population they belong to.
id <- read.table("860ind_sebastes.txt",header=TRUE)
admixture <- cbind(id,admixture)
Rename columns.
colnames(admixture) <- c("IND","POP","K1","K2")
Transform the admixture object into a long format.
admixture_long <- melt(admixture,id.vars=c("IND","POP"),variable.name="ANCESTRY",value.name="PERC")
names(admixture_long)
class(admixture_long$ANCESTRY)
levels(admixture_long$ANCESTRY)
Subset only the individuals showing more than 50% of ancestry with one genetic cluster
admixture_long_50 <- subset(admixture_long, subset=admixture_long$PERC>=0.50)
Give a palett color
col2 <- c('blue',"red")
Make a graph with ADMIXTURE results.
graph_title="Stacked barplot of Admixture analysis in species"
x_title="Individuals"
y_title="Ancestry"
graph_1<-ggplot(admixture_long_50,aes(x=POP,y=PERC,fill=ANCESTRY))
graph_1+geom_bar(stat="identity")+
scale_fill_manual(values=col2, name= "K", labels=c("K1","K2"))+
labs(y=y_title)+
labs(x=x_title)+
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(colour="grey", linetype="dashed"),
axis.title.x=element_text(size=14,family="Helvetica",face="bold"),
axis.text.x=element_text(size=6,family="Helvetica",face="bold", angle=90, hjust=0, vjust=0.5),
axis.title.y=element_text(size=14,family="Helvetica",face="bold"),
axis.text.y=element_text(size=14,family="Helvetica",face="bold"))
Save the graph.
ggsave("Sebastes_cercles.pdf",width=15,height=10,dpi=600,units="cm",useDingbats=F)
Check the percent of ancestry per sampling locations(POP) and per individuals (IND).
aggregate(admixture[, 2:6], list(admixture$POP), mean)
aggregate(admixture[, 2:6], list(admixture$IND), max)
Estimate at which each genetic group belong each indivdiual regarding its maximum % of ancestry.
admixture[, "max"] <- apply(admixture[, 2:6], 1, max)
summary(admixture)
Check the indivdiuals that could not be clearly attributed to one genetic cluster.
admixture_subset <- subset(admixture, subset=admixture$max >= 0.5)
Report how many individuals per cluster.
admixture_cluster <- select(admixture, K1,K2,K3,K4,K5)
admixture %>%
gather(POP,cnt, K1:K5) %>%
group_by(POP) %>%
slice(which.max(cnt))
Create a pop map regarding the cluster found.
admixture_subset$CLUSTER <- colnames(admixture_subset)[apply(admixture_subset,1,which.max)]
admixture_results <- select(admixture_subset, IND, CLUSTER)
Save the files.
write.table(admixture_results, 'Admixture_results_K2.txt',quote=FALSE, row.names=FALSE, sep="\t", dec=".")
table(admixture_subset$CLUSTER, admixture_subset$POP)
Calculate the number of individuals with ancestry bellow 50%.
quantile(admixture$max)
group_unknown <- subset(admixture, subset=admixture$max<0.5)
group_unknown$POP <- substr(group_unknown$IND,1,5)
table(group_unknown$POP)
Install the dependency packages and libraries required.
install.packages(c("Cairo","devtools","ggplot2","gridExtra","gtable","tidyr"),dependencies=T)
library(pophelper)
Check pophelper
version.
packageDescription("pophelper", fields="Version")
Install the current version of pophelper
.
devtools::install_github('royfrancis/pophelper', force=TRUE)
Load population map.
popmap = read.delim("444ind_admixture.txt",header=FALSE,stringsAsFactors=F)
pop_order = sort(unique(popmap$V1))
Load admixture files.
sfiles <- list.files(pattern = "*.Q", full.names=T)
Import the .Q files.
slist <- readQ(files=sfiles, filetype = "basic")
Qlist attributes
attributes(slist)
Import labels for ADMIXTURE runs.
labset <- read.table("860ind_pop.txt",header=TRUE,stringsAsFactors=F)
Verify that the length of labels is equal to number of individuals.
nrow(labset)
Check if labels are a character data type.
sapply(labset, is.character)
class(labset)
Give a palett color.
col2 <- c('blue',"red")
Create a qplot for K = 2 considering two species.
slist1 <- alignK(slist[1])
plotQ(slist1, clustercol= col2,
,showsp=FALSE,grplab = labset,ordergrp=T,imgtype="pdf",
showlegend=T, legendpos="right", legendkeysize = 6, legendtextsize = 6)