biostars / biostar-handbook Goto Github PK
View Code? Open in Web Editor NEWIssue tracker for the Biostar Handbook
Issue tracker for the Biostar Handbook
A reader reports that links in the
https://read.biostarhandbook.com/ontology/gene-ontology.html
chapter are non functional.
In slide #15 in lec 14, there is a code like
update_blastdb.pl --showall | head
16SMicrobial
cdd_delta
env_nr
When I ran update_blastdb.pl, this showed up.
update_blastdb.pl --showall | head
Connected to NCBI
update_blastdb.pl --decompress 16SMicrobial
Connected to NCBI
16SMicrobial not found, skipping.
No 16SMicrobial or etc after the phrase "Connected to NCBI". (I'm using Bash on Ubuntu on Windows)
Should I get the data first? I downloaded "16SMicrobial.tar.gz" by
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz
I couldn't have found how to get or store the data. Could you explain the detailed procedure?
Thank you in advance.
how can I uncompressed a compresse src file
As mentioned in:
I recommend including GeneSCF as a tool (http://genescf.kandurilab.org). It's an excellent gene ontology tool that is easy to use and works better that DAVID, as you can see here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1250-z
I have personally used this tool in my analysis and got much better results than other GO tools.
Keep up the good work!
I generated lecture11.sh based on 10 page of lecture 11 ppt slide and executed "bash lecture11.sh".
However, I experienced this error. I think "wget" in lecture11.sh file didn't work because wget works well when this is directly executed. I am using windows 10 bash. Is there any hint?
~/unix$ bash lecture11.sh
lecture11.sh: line 2: $'\r': command not found
lecture11.sh: line 5: $'\r': command not found
--2017-09-26 14:30:56-- http://data.biostarhandbook.com/data/sequencing-platform-data.tar.gz%0D
Resolving data.biostarhandbook.com (data.biostarhandbook.com)... 198.74.58.207
Connecting to data.biostarhandbook.com (data.biostarhandbook.com)|198.74.58.207|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-09-26 14:30:57 ERROR 404: Not Found.
lecture11.sh: line 8: $'\r': command not found
tar (child): sequencing-platform-data.tar.gz\r: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
lecture11.sh: line 11: $'\r': command not found
' which didn't exist, or couldn't be read
lecture11.sh: line 14: $'\r': command not found
TrimmomaticSE: Started with arguments:
illumina.fq better.fq SLIDINGDOWN:4:30
Automatically using 4 threads
Exception in thread "main" java.lang.RuntimeException: Unknown trimmer: SLIDINGDOWN
at org.usadellab.trimmomatic.trim.TrimmerFactory.makeTrimmer(TrimmerFactory.java:70)
at org.usadellab.trimmomatic.Trimmomatic.createTrimmers(Trimmomatic.java:59)
at org.usadellab.trimmomatic.TrimmomaticSE.run(TrimmomaticSE.java:303)
at org.usadellab.trimmomatic.Trimmomatic.main(Trimmomatic.java:85)
lecture11.sh: line 17: $'\r': command not found
' which didn't exist, or couldn't be read
lecture11.sh: line 20: $'\r': command not found
<lecture11.sh>
# Comments start with the # sign.
# Work at the command line.
# If it works, copy it into this file.
# Get the example dataset for lecture 11.
wget http://data.biostarhandbook.com/data/sequencing-platform-data.tar.gz
# The file unpacks into 4 sequencing datasets.
tar zxvf sequencing-platform-data.tar.gz
# Quality plots before trimming.
fastqc illumina.fq
# Trim back by quality.
trimmomatic SE illumina.fq better.fq SLIDINGDOWN:4:30
# Quality plots after trimming.
fastqc better.fq
After I downloaded the perfect_coverage.py file, run the following code:
$ cat refs/AF086833.fa | python perfect_coverage.py
it complains:
Traceback (most recent call last):
File "perfect_coverage.py", line 55, in
read1, read2 = file('R1.fq', 'wt'), open('R2.fq', 'wt')
NameError: name 'file' is not defined
(bioinfo)
Hello,
I am reading the lecture 1 until running the doctor.py srcipt (Windows 10), which doesn't work :/
Here is what my terminal shows:
"bash: /root/bin/doctor.py: /usr/bin/python: bad interpreter: No such file or directory"
So I did: "ls" which gave "Miniconda3-latest-Linux-x86_64.sh bash_profile bin miniconda3"
When I choose bin ("cd bin" then "ls"), it gave me: "doctor.py"
...
Any idea of what it could be happening with the script doctor?
Many thanks in advance!
I think it is appropriate, since many scientists still use microarrays (often for reasons of cost), to have sections on microarray experiments and analysis. There is a lot of avid discussion on technologies, normalization techniques, etc that I think are important for bioinformaticians to know.
Additionally, have you considered adding any information about using R? I noticed that your BioMart section is a TODO, and R has a good package to handle this.
With that said, thanks for the amazing book. I'm enjoying it.
Do the GO annotations change?
The latest GO data download produces a very different report than the first list. Notably HLA genes are not in the top ten anymore. This reinforces the ephemeral nature of annotations - as new information or different filtering parameters are applied the information content will drastically shift. Why did the HLA genes disappear? Was that an error. Probably... but what does it mean for the publications that were based on research completed during the **time ** the error was present? Was anyone ever publicly notified about the error? We discovered this discrepancy by accident... This demonstrates that not everything is going well in the GO world....
The following is a description of a discovery process that may not even apply anymore but it is educational to learn about.
As we can see, the most annotated protein is P04637, a cellular tumor antigen p53, which acts as a tumor suppressor in many tumor types. It corresponds to the gene TP53.
The most annotated gene was HLA-B. As it happens, this is a gene that we were unable to find right away when searching for it on the official gene name nomenclature (HGNC).
It is one of those &$!*&@!@ moments that every bioinformatician will have to learn to deal with:
The most annotated gene in the GO database seemingly cannot be found when searching for it in HGNC, the repository that is supposed to be the authoritative resource for the domain!
The cause is both simple and infuriating: the search for HLA-B finds over a thousand entries related to HLA-B but the gene that is actually named HLA-B turns up only on the second page of hits, buried among entries distantly related to it.
Situations like this, where seemingly mundane tasks suddenly require problem solving, are very common in bioinformatics. Keep that in mind the next time something completely surreal seems to happen.
An alternative and much better maintained resource*, GeneCards,* allows you to locate the gene by name right away.
How complete is the GO?
We may wonder: how complete are these GO annotations? Would it be possible to estimate what percent of all functional annotations have been discovered so far?
While that question is too abstract for our liking, we could tabulate the dates assigned to each piece of evidence and see how many pieces of evidence are produced per year.
It is "reasonably simple" to extract and tabulate the date information from the two files that we downloaded to get the rate of information growth in the GO:
cat assoc.txt
| cut -f 14
| awk '{ print substr($1, 1, 4) }'
| sort
| uniq -c
| sort -k 2,2 -n
| awk '{ print $2 "\t" $1 }'
Produces the number of annotations verified in a given year.
...
2008 9682
2009 13322
2010 15026
2011 23490
2012 16428
2013 60555
2014 34925
2015 33096
2016 235077
This result is quite surprising, and we don't quite know what to make of it. Is seems that most evidence is assigned to the latest year - 2016. Or maybe there is some sort of quirk in the system that assigns the latest date to every piece of evidence that is being re-observed? We don't know.
There is a also a surprising dip in the number of annotations in 2012, then a subsequent increase in 2013 that can not be easily explained.
After following the instructions to setup my computer and running the doctor, I get the following 2 errors:
Thanks for any advice
I'm working from the Biostar Handbook and trying to do the alignment with tophat. I'm on page 480 for reference.
I was able to successfully run
$ bowtie2-build $REF $IDX
However, then it says that I need to change the invocation of the aligner to
$ tophat -G $GTF -o tophat_hbr1 $IDX $R1 $R2
bash: /usr/bin/tophat: No such file or directory
(bioinfo)
And that's my output. If I do which tophat, nothing comes up, so it seems like it's just not there. So, I tried to install tophat with
$ conda install tophat
Fetching package metadata .................
Solving package specifications: .
UnsatisfiableError: The following specifications were found to be in conflict:
- python 3.6*
- tophat -> python 2.7*
Use "conda info <package>" to see the dependencies for each package.
(bioinfo)
I saw some solutions online for changing a line of code in tophat to do this, but I don't even know where that file is. I can downgrade python, but I'm not sure if that will work, and I'm worried that I'll break other things if I do. Also, when I do
$ ls /usr/bin | grep python
dh_python2
dh_python3
python
python-config
python2
python2-config
python2.7
python2.7-config
python3
python3.4
python3.4m
python3m
x86_64-linux-gnu-python-config
x86_64-linux-gnu-python2.7-config
(bioinfo)
moltres@moltres-ao ~/biostar/Sequencing/griffith
$ python -V
Python 3.6.3
(bioinfo)
I see a bunch of different python versions, and I'm not sure if I should switch between them or what.
It doesn't even say to install tophat anywhere in this section. I'm pretty lost here.
Hi,
I installed the latest BLAST according to the instruction for LINUX (https://www.biostarhandbook.com/tools/align/blast.html)
After the successful installation, I could see the latest BLAST (blastn 2.6.0), and it worked well (e.g, update_blastdb.pl).
which blastn
/home/flyark/src/ncbi-blast-2.6.0+/bin/blastn
blastn -version
blastn: 2.6.0+
Package: blast 2.6.0, build Dec 7 2016 14:50:34
However, blastn version became 2.2.31 when I re-started bash on windows.
which blastn
/usr/bin/blastn
blastn -version
blastn: 2.2.31+
Package: blast 2.2.31, build Jan 7 2016 23:17:17
I think that the old BLAST, which had been automatically installed according to the introduction, masks the new BLAST when bash is re-started.
Could I know how to replace the old BLAST to the new one completely?
p.s. the present ncbi-blast version is 2.6.0.
2.5.0 in the installation page needs to be replaced to 2.6.0.
Looks like the 'How do I use Homebrew?' section for setting up a MacOS computer may need updating:
To use brew for bioinformatics you will need to "tap" the "science formulas":
brew tap homebrew/science
The above command gives an error:
Error: homebrew/science was deprecated. This tap is now empty as all its formulae were migrated.
The last command in the chapter "BLAST use cases" shown underneath
efetch -db nucleotide -id AKC37152 -format fasta > AKC37152.fa
should be "protein" instead of "nucleotide".
Thanks.
I am trying to create a conda environment called bioinfo, but every time I go to run the command...
conda create -y --name bioinfo python=3.7
... I get the following notice:
Solving environment: failed
CondaHTTPError: HTTP 404 NOT FOUND for url <https://conda.anaconda.org/r/noarch/repodata.json>
Elapsed: 00:00.038179
CF-RAY: 45cf07930a8b2585-ORD
The remote server could not find the noarch directory for the
requested channel with url: https://conda.anaconda.org/r
As of conda 4.3, a valid channel must contain a `noarch/repodata.json` and
associated `noarch/repodata.json.bz2` file, even if `noarch/repodata.json` is
empty. please request that the channel administrator create
`noarch/repodata.json` and associated `noarch/repodata.json.bz2` files.
$ mkdir noarch
$ echo '{}' > noarch/repodata.json
$ bzip2 -k noarch/repodata.json
You will need to adjust your conda configuration to proceed.
Use `conda config --show channels` to view your configuration's current state.
Further configuration help can be found at <https://conda.io/docs/config.html>.
Any suggestions?
conda install -y bwa
Solving environment: failed
PackagesNotFoundError: The following packages are not available from current channels:
Current channels:
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
I'm trying to do the Zika RNA-Seq automated alignments, but I noticed that when I got to the differentially expressed genes analysis, all my numbers were off. I'm trying to trace back what might be wrong, and I suspect that it's something with my bam files produced by the automated script on page 496. This is the script I ran using bash:
set -ueo pipefail
mkdir -p bam
CPUS=4
IDX=refs/grch38/genome
RUNLOG=runlog.txt
echo "Run started by `whoami` on `date`" > $RUNLOG
for SAMPLE in $(cat paired_ids.txt)
do
R1=reads/${SAMPLE}_1.fastq
R2=reads/${SAMPLE}_2.fastq
BAM=bam/${SAMPLE}.bam
SUMMARY=bam/${SAMPLE}_summary.txt
echo "Running HISAT2 on paired end $SAMPLE"
hisat2 -p $CPUS -x $IDX -1 $R1 -2 $R2 | samtools sort > $BAM 2> $RUNLOG
samtools index $BAM
done
for SAMPLE in $(cat single_ids.txt)
do
R1=reads/${SAMPLE}.fastq
BAM=bam/${SAMPLE}.bam
SUMMARY=bam/${SAMPLE}_summary.txt
echo "Running Hisat2 on single end: $SAMPLE"
hisat2 -p $CPUS -x $IDX -U $R1 | samtools sort > $BAM 2> $RUNLOG
samtools index $BAM
done
However, my runlog.txt file is completely empty, and I don't get a bam/${SAMPLE}_summary.txt
I also don't understand what the IDX=refs/grch38/genome is referencing since I don't have a file precisely by that name.
Is this a glitch, or am I doing something wrong here? Thanks.
In the last echo "*** Calling variants from all runs: samples.vcf", there is no samples.vcf but combined.vcf. I hope this is edited in the next update.
Hi, I'm working through the environment setup stage.
I entered bioinfo
environment, and executed curl http://data.biostarhandbook.com/install/conda.txt | xargs conda install -y
to automatically install all tools needed, however, I got stuck when downloading openjdk and mkl, the speed is too slow and the connection would get lost.
I tried to manually install openjdk 8 by excuting sudo apt-get install openjdk-8-jre
, but I was informed that:
openjdk-8-jre is already the newest version (8u162-b12-0ubuntu0.16.04.2).
openjdk-8-jre set to manually installed.
Is there any workaround?
Hi there,
I am very new to bioinformatics and to this book.
I am trying to set-up my Mac as per the instructions in the book.
When I ran the command:
curl http://data.biostarhandbook.com/install/conda.txt | xargs conda install -y
I get an error saying:
"xargs: conda: No such file or directory".
Can anyone tell me why this is coming up?
Cheers, Zak
Hi,
I was using bash on ubuntu on windows, which worked well. At this time, ubuntu version was 15.
Recently, I did format and re-installed windows as well as bash on ubuntu on windows (version 16).
I tried to install doctor.py, but experienced the following error.
mkdir -p ~/bin
curl http://data.biostarhandbook.com/install/doctor.py > ~/bin/doctor.py
chmod +x ~/bin/doctor.py
flyark:~$ ~/bin/doctor.py,
bash: /home/flyark/bin/doctor.py: /usr/bin/python: bad interpreter: No such file or directory
In addition, I installed wonderdump,which showed the following error.
mkdir -p ~/bin
curl http://data.biostarhandbook.com/scripts/wonderdump.sh > ~/bin/wonderdump
chmod +x ~/bin/wonderdump
flyark:~$ ~/bin/wonderdump
/home/flyark/bin/wonderdump: line 19: $1: unbound variable
Could you help me how to fix these problems?
I didn't experience these problems when I used the previous version of bash on ubuntu on windows.
As reported in:
https://www.biostars.org/p/225812/#242326
In the section How to visualize genomic variation (What would realistic and good data look like?) it says that the script simulate-experimental-data.sh will generate a file called results.bam. It actually generates a file called align.bam.
In the section Variant effect prediction (How do I use snpEff?) the link http://data.biostarhandbook.com/variant/find-ebola-variants.sh results in a file not found error. Please check, thanks.
Hello everyone,
I would like to access the files from both Linux and with Windows Explorer, following instructions from "How do I set up the filesystem with Ubuntu on Windows?" with the bash for ubuntu terminal
So I typed:
mkdir -p '/mnt/c/Users/Lpain/Desktop/unix'
then
ln -s '/mnt/c/Users/Lpain/Desktop/unix' ~/unix
I have a folder named "unix" inside the Linux terminal but not visible in my Desktop under Windows..
Does somebody have any idea/suggestion of what i can fix it?
Many thanks in advance
Hi, I'm running bash on windows 10 and I followed the instructions to download miniconda. Then when I close the terminal, open it back up and type conda it says:
conda: command not found
I have a suspicion its not located in the right place but I'm not sure where it should be located.
Thanks,
Gabe
Conda fails with:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/biostar/linux-64/repodata.json
Elapsed: -
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
SSLError (MaxRetryError('HTTPSConnextionPool(host='conda.anaconda.org', port 443):Max retries exceeded with yrl: /bioconda/linux-64/repodata.json(Caused by SSLError (''Can't connect to HTTPS URL because the SSL module is not available. '',))',),)
This question is about the whole genome classification section, specifically the subsection called "What are the expected abundances of the data?".
It shows a portion of the XLS file and says that
Psychrobacter cryohalolentis K5 corresponds to accession numbers NC_007969 and will...
Should "numbers" be singular?
Then there is code that searches for accession numbers NC_007969 and NC_007968. Where did the second number come from? It doesn't seem to appear in the XLS file.
First and foremost, thank you for putting together this wonderful guide. I am a complete novice when it comes to programming/bioinformatics and this guide has really helped me learn a lot. I'm very close to being able to start to run some analyses on my own.
My suggestion pertains to the section RNA-Seq: Griffith Test Data -> Analyzing the control samples -> Did our RNA-Seq analysis reproduce the expected outcomes?
The main issue I have is with the final command
paste table1 table2 > compare.txt
If opened with $ less compare.txt
, it shows:
ERCC-00002 0.5 -1^M ERCC-00002 0.587081192626208 -0.768368054756608
ERCC-00003 0.5 -1^M ERCC-00003 0.680961534372993 -0.554354788186953
ERCC-00004 4 2^M ERCC-00004 5.88286106939268 2.55651796573194
ERCC-00009 1 0^M ERCC-00009 1.21972406601959 0.286554808762869
ERCC-00012 0.67 -0.58^M ERCC-00012 Inf Inf
ERCC-00013 0.5 -1^M ERCC-00013 Inf Inf
ERCC-00014 0.5 -1^M ERCC-00014 0.487600858570454 -1.0362274286211
ERCC-00016 0.67 -0.58^M ERCC-00016 NA NA
ERCC-00017 4 2^M ERCC-00017 Inf Inf
ERCC-00019 4 2^M ERCC-00019 2.7835815473997 1.4769423488067
If this is opened in Excel, it shows a very ugly output
The main issue underlying this is the ^M
generated by the paste
command. Now, I have no idea why the paste command generated this ^M
, and I have verified that all previous files do not have '^M'.
$ head ERCC-datasheet.csv
ERCC ID,subgroup,concentration in Mix 1 (attomoles/ul),concentration in Mix 2 (attomoles/ul),expected fold-change ratio,log2(Mix 1/Mix 2)
ERCC-00130,A,30000,7500,4,2
ERCC-00004,A,7500,1875,4,2
ERCC-00136,A,1875,468.75,4,2
ERCC-00108,A,937.5,234.375,4,2
ERCC-00116,A,468.75,117.1875,4,2
ERCC-00092,A,234.375,58.59375,4,2
ERCC-00095,A,117.1875,29.296875,4,2
ERCC-00131,A,117.1875,29.296875,4,2
ERCC-00062,A,58.59375,14.6484375,4,2
$ head results.txt
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
ERCC-00130 29681.8244237545 10455.9218232761 48907.7270242329 4.67751460376822 2.22574215774208 1.16729711209905e-88 9.10491747437256e-87
ERCC-00108 808.597670575459 264.877838024487 1352.31750312643 5.10543846632202 2.35203486825767 2.40956154792488e-62 9.39729003690704e-61
ERCC-00136 1898.3382995277 615.744918976546 3180.93168007886 5.16598932779828 2.36904466305553 2.80841619396485e-58 7.3018821043086e-57
ERCC-00116 952.57953992746 337.704944218003 1567.45413563692 4.64149004174798 2.21458802318734 1.72224091670519e-45 3.35836978757511e-44
ERCC-00092 310.791194556933 96.697066636053 524.885322477813 5.42814110849266 2.44045822515553 2.44705874688655e-40 3.81741164514302e-39
ERCC-00004 3918.98719921685 1138.76690513024 6699.20749330347 5.88286106939268 2.55651796573194 8.29322966465066e-38 1.07811985640459e-36
ERCC-00095 141.487857460492 52.8817320556433 230.093982865341 4.35110526680992 2.12138192059427 2.46928414480309e-19 2.7514880470663e-18
ERCC-00062 77.4886630526754 22.8781484767386 132.099177628612 5.7740327091121 2.52957928026876 7.80930857468792e-17 7.61407586032072e-16
ERCC-00131 134.742367710255 55.9465732085198 213.53816221199 3.81682290738535 1.93237225006829 4.75964190578523e-16 4.12502298501386e-15
$ head table1
ERCC-00002 0.5 -1
ERCC-00003 0.5 -1
ERCC-00004 4 2
ERCC-00009 1 0
ERCC-00012 0.67 -0.58
ERCC-00013 0.5 -1
ERCC-00014 0.5 -1
ERCC-00016 0.67 -0.58
ERCC-00017 4 2
ERCC-00019 4 2
$ head table2
ERCC-00002 0.587081192626208 -0.768368054756608
ERCC-00003 0.680961534372993 -0.554354788186953
ERCC-00004 5.88286106939268 2.55651796573194
ERCC-00009 1.21972406601959 0.286554808762869
ERCC-00012 Inf Inf
ERCC-00013 Inf Inf
ERCC-00014 0.487600858570454 -1.0362274286211
ERCC-00016 NA NA
ERCC-00017 Inf Inf
ERCC-00019 2.7835815473997 1.4769423488067
Thus, for some bizarre reason, ^M
is being added by the paste command which completely breaks down-stream data analysis.
Of course, it is easy to remove ^M
by multiple commands and fix the file. However, for folks like myself that follow the guide very closely, it is not clear what is going on. The guide simply glosses over intermediate steps to analyze the data:
At the very least, it should be mentioned that additional steps were taken to organize the data as shown in the final data table. Otherwise, it is extremely frustrating to figure out if something went wrong in the intermediate steps. Ideally, the output of compare.txt
should be shown before showing the data table so the user is reassured that they did the steps correctly.
Thanks!
What are nucleotides?
Nucleotides are the building blocks of nucleic acids (DNA and RNA–we'll get to that one later on). In DNA, there are four types of nucleotide: Adenine, Cytosine, Guanine, and Thymine. Because the order in which they occur encodes the information biologists try to understand, we refer to them by their first letter, A, C, G and T, respectively.
A Adenine
G Guanine
C Cytosine
T Thymine
It's distracting to have the ordering be alphabetical 2 times in the text and then grouped by purine/pyrimadine in list format.
Hello
I am having extremely difficult time getting the efetch/esearch command to work.
As per previous discussion, i tried to uninstall perlbrew but i couldn't find an option to install perl without pearlbrew. Any link/command line for installing perl without perlbrew?
This has become a huge issue because i simply cannot move forward as this module is used often.
Thank you
[Hi. I used to have a fork to the original version of the handbook that I would read and then send suggestions to ialbert for changes using a pull-request. After I finished the first version I wasn't using the handbook anymore. Now there are additional topics I'm learning about in the newer version. Since I don't have a fork, I just cut and pasted the following sections and added the possible changes to them. Unfortunately it is hard to see where I made changes, so if possible I used bold around the last word I left before cutting text (this doesn't really work at the beginning of sentences so I just put ** **), and I used italics for any suggested rewording. Best, Paige]
Are there different ways to compute ORA analyses?
If you followed the section on Gene Ontology, you know that the GO files have a relatively simple format. Moreover, the data structure is that of a tree (network). To build a program that assigns counts to each node and that can traverse the tree* , * is a standard problem that requires only moderately advanced programming skill.
** Data interpretation is such an acute problem, however, there always seems to be a shortage of tools that can perform the ORA analysis in one specific way. ** A cottage industry of hundreds of ORA tools now ** exists- the vast majority have been abandoned, and when run fail ** or even worse* ,* tacitly give the wrong answer** **.
Moreover, the accuracy of a method will critically depend **on ** integrating the most up to date information. As a tale of caution, we note that the DAVID: Functional Annotation Tool was not updated from 2010 to 2016! Over this time it was unclear to users whether the data that the server operated on included the most up to date information. New functions and annotations are added on an almost weekly basis to functional annotation databases and over many years this increased the difference between DAVID data and the available data ** ** considerably. We believe that by the end of 2016 DAVID operated on a mere 20% of the total ** information ** available. Nevertheless, it had gained thousands of citations every single year after the 2010 update.
A user reports:
I'm trying to install all the tools required for the book by using:
curl http://data.biostarhandbook.com/install/conda.txt | xargs conda install -y
but I get the following UnsatisfiableError:
The following specifications were found to be in conflict:
- python 3.6*
- tophat -> python 2.7*
Hi,
First, great job on the book. I like to take a look at it whenever I have time. Just wanted to report that one of your links here which was supposed to lead to a BioStar post, is actually going to the picard manual of MarkDuplicates.
Nothing major, but I wanted to report it as I know it is not what you meant to add :)
where could I download a fasta reference genome for human
Currently working through the biostar handbook on Windows 10 using bash on Ubuntu as a bench scientist. The guide is incredibly useful, but it's taking some time to troubleshoot various issues along the way I've encountered due to running Ubuntu through Windows.
I'm also making a thorough enough how-to manual that anybody else in my lab would be able to copy it line by line to successfully install the required components. I've started writing scripts to automate the install process for windows users so they can avoid potential problems caused by an inexperienced user attempting to set up the necessary tools. I'd like to contribute what I've found so far, and troubleshoot a couple resilient errors moving forward.
Hello,
When running the command
esearch -db sra -query PRJNA257197
I get this issue:
501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=sra&term=PRJNA257197&retmax=0&usehistory=y&edirect_os=darwin&edirect=7.70&tool=edirect&[email protected]'
Result of do_post http request is
$VAR1 = bless( {
'_headers' => bless( {
'client-date' => 'Sat, 27 Jan 2018 14:18:01 GMT',
'::std_case' => {
'client-warning' => 'Client-Warning',
'client-date' => 'Client-Date'
},
'client-warning' => 'Internal response',
'content-type' => 'text/plain'
}, 'HTTP::Headers' ),
'_rc' => 501,
'_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
'_msg' => 'Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)',
'_request' => bless( {
'_headers' => bless( {
'content-type' => 'application/x-www-form-urlencoded',
'user-agent' => 'libwww-perl/6.31'
}, 'HTTP::Headers' ),
'_content' => 'db=sra&term=PRJNA257197&retmax=0&usehistory=y&edirect_os=darwin&edirect=7.70&tool=edirect&email=[email protected]',
'_method' => 'POST',
'_uri' => bless( do{(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi')}, 'URI::https' )
}, 'HTTP::Request' )
}, 'HTTP::Response' );
WebEnv value not found in search output - WebEnv1
After a paired-end sequencing under illumina I have for my TC31A sample of two fastq files: TC31A_S189_R1.fastq.dsrc2 and TC31A_S189_R2.fastq.dsrc2
my question is how can I evaluate the size of my sample and the number of reads that compose it?
Hi, I created a fork and tried to use it to suggest some edits in a pull request but I couldn't access the text in the handbook to add the edits.
At the beginning of the Unix bootcamp, you introduce ls:
ls
Applications Desktop Documents Downloads
You then make 4 points about the output from that command. The third point says
The output of the ls command lists two things. In this case, they are directories, but they could also be files. We'll learn how to tell them apart later on. These directories were created as part of a specific course that used this bootcamp material. You will therefore probably see something very different on your own computer.
I'm not sure how to interpret this because there are four directories listed. Maybe "two" refers to two types of things, but my best guess here is that it refers to the number of directories and is out of date.
i downloaded:
efetch -db=nuccore -format=gb -id=AF086833 > AF086833.gb
and then typed
samtools view -c bwa.bam AF086833:470-2689
but it showed:
[main_samview] region "AF086833:470-2689" specifies an unknown reference name. Continue anyway.
Could anyone help me to figure out why this error occur? Thanks!
Hi
I have been going through the handbook every now and then and I have seen typos at places.
What is the best method and platform to report them? By method, I mean a "format" to report what typo occurs in which section at which page number.
Thanks
Vijay
On the page https://www.biostarhandbook.com/ is the sentence "The book is available in over the web, as a PFD, an eBOOK and in Kindle formats.". Did you mean PDF instead of PFD?
/Users/ialbert/
is identical to specifying the full path:
cd /Users/ialbert/edu/tmp
Presumably, there should be a ~/edu/tmp
first.
For first time learners of Unix, it would be helpful if this section (page 420) contained an explanation of the commands/pipelines called to sort the list of genes and grab the top ten most highly annotated genes and proteins. Particularly the last step sort -k1,1nr is unclear. Are we not meant to understand this yet?
conda create -y --name sequana_env python=3.6
Solving environment: failed
$ /home/stanislas/miniconda3/bin/conda create -y --name sequana_env python=3.6
environment variables:
CIO_TEST=
CONDA_ROOT=/home/stanislas/miniconda3
CONDA_SHLVL=0
DEFAULTS_PATH=/usr/share/gconf/ubuntu.default.path
MANDATORY_PATH=/usr/share/gconf/ubuntu.mandatory.path
PATH=/home/stanislas/src/edirect:/home/stanislas/miniconda3/bin:/home/stani
slas/bin:/home/stanislas/.local/bin:/usr/local/sbin:/usr/local/bin:/us
r/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
REQUESTS_CA_BUNDLE=
SSL_CERT_FILE=
WINDOWPATH=2
active environment : None
shell level : 0
user config file : /home/stanislas/.condarc
populated config files : /home/stanislas/.condarc
conda version : 4.5.4
conda-build version : not installed
python version : 3.6.5.final.0
base environment : /home/stanislas/miniconda3 (writable)
channel URLs : https://conda.anaconda.org/bioconda/linux-64
https://conda.anaconda.org/bioconda/noarch
https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://conda.anaconda.org/r/linux-64
https://conda.anaconda.org/r/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/free/linux-64
https://repo.anaconda.com/pkgs/free/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
https://repo.anaconda.com/pkgs/pro/linux-64
https://repo.anaconda.com/pkgs/pro/noarch
package cache : /home/stanislas/miniconda3/pkgs
/home/stanislas/.conda/pkgs
envs directories : /home/stanislas/miniconda3/envs
/home/stanislas/.conda/envs
platform : linux-64
user-agent : conda/4.5.4 requests/2.18.4 CPython/3.6.5 Linux/4.15.0-34-generic ubuntu/18.04 glibc/2.27
UID:GID : 1000:1000
netrc file : None
offline mode : False
V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V
CondaHTTPError: HTTP 404 NOT FOUND for url https://conda.anaconda.org/r/noarch/repodata.json
Elapsed: 00:00.232108
CF-RAY: 45d3af7ffe1fc03b-MRS
The remote server could not find the noarch directory for the
requested channel with url: https://conda.anaconda.org/r
As of conda 4.3, a valid channel must contain a noarch/repodata.json
and
associated noarch/repodata.json.bz2
file, even if noarch/repodata.json
is
empty. please request that the channel administrator create
noarch/repodata.json
and associated noarch/repodata.json.bz2
files.
$ mkdir noarch
$ echo '{}' > noarch/repodata.json
$ bzip2 -k noarch/repodata.json
You will need to adjust your conda configuration to proceed.
Use conda config --show channels
to view your configuration's current state.
Further configuration help can be found at https://conda.io/docs/config.html.
A reportable application error has occurred. Conda has prepared the above report.
Upload successful.
how can i uncompress a compress file in dsrc format
in section: Accessing the Short Read Archive (SRA)
when I type this command as listed:
fastq-dump SRR1553607
it shows:
2017-03-31T03:39:48 fastq-dump.2.3.5 err: error unexpected while resolving tree within virtual file system module - failed to resolve accession 'SRR1553607' - Obsolete software. See https://github.com/ncbi/sra-tools/wiki ( 406 )
Redirected!!!
2017-03-31T03:39:48 fastq-dump.2.3.5 err: name incorrect while evaluating path within network system module - Scheme is 'https'
2017-03-31T03:39:48 fastq-dump.2.3.5 err: item not found while constructing within virtual database module - the path 'SRR1553607' cannot be opened as database or table
Has anybody successfully installed all necessary components from Conda as described in the biostar handbook?
It seems there is an issue with perl running on windows 10. It's possible to fix this issue but that would mean installing the programs from source instead of conda. The particular issue with conda installing perl can be found here.
I'm afraid I haven't found an easy way around this. I'm using the anniversary build Windows 10. From the first link, it seems Microsoft is aware of the issue and has fixed it (along with adding additional improvements) in the creators update rumored to be released around April 17th.
My question is has anybody encountered and surmounted this obstacle in conda? If not, it would probably be better to hold off on creating a solution until after the creator update when they can apply a patch.
As reported in:
https://www.biostars.org/p/225812/#242326
In the section How to visualize genomic variation (What would realistic and good data look like?) it says that the script simulate-experimental-data.sh will generate a file called results.bam. It actually generates a file called align.bam.
In the section Variant effect prediction (How do I use snpEff?) the link http://data.biostarhandbook.com/variant/find-ebola-variants.sh results in a file not found error. Please check, thanks.
This concerns the Zika example of kallisto quantification.
Where do the fragment length of 187 bp and standard deviation of 70 bp come from? It would be helpful to the reader to know how to get this information.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.