Giter VIP home page Giter VIP logo

biostar-handbook's People

Contributors

greggjm avatar ialbert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biostar-handbook's Issues

update_blastdb.pl --showall | head

In slide #15 in lec 14, there is a code like

update_blastdb.pl --showall | head

16SMicrobial
cdd_delta
env_nr

When I ran update_blastdb.pl, this showed up.

update_blastdb.pl --showall | head
Connected to NCBI

update_blastdb.pl --decompress 16SMicrobial
Connected to NCBI
16SMicrobial not found, skipping.

No 16SMicrobial or etc after the phrase "Connected to NCBI". (I'm using Bash on Ubuntu on Windows)

Should I get the data first? I downloaded "16SMicrobial.tar.gz" by
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz

I couldn't have found how to get or store the data. Could you explain the detailed procedure?

Thank you in advance.

help

how can I uncompressed a compresse src file

lecture11.sh failed

I generated lecture11.sh based on 10 page of lecture 11 ppt slide and executed "bash lecture11.sh".
However, I experienced this error. I think "wget" in lecture11.sh file didn't work because wget works well when this is directly executed. I am using windows 10 bash. Is there any hint?

~/unix$ bash lecture11.sh
lecture11.sh: line 2: $'\r': command not found
lecture11.sh: line 5: $'\r': command not found
--2017-09-26 14:30:56--  http://data.biostarhandbook.com/data/sequencing-platform-data.tar.gz%0D
Resolving data.biostarhandbook.com (data.biostarhandbook.com)... 198.74.58.207
Connecting to data.biostarhandbook.com (data.biostarhandbook.com)|198.74.58.207|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-09-26 14:30:57 ERROR 404: Not Found.

lecture11.sh: line 8: $'\r': command not found
tar (child): sequencing-platform-data.tar.gz\r: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
lecture11.sh: line 11: $'\r': command not found
' which didn't exist, or couldn't be read
lecture11.sh: line 14: $'\r': command not found
TrimmomaticSE: Started with arguments:
 illumina.fq better.fq SLIDINGDOWN:4:30
Automatically using 4 threads
Exception in thread "main" java.lang.RuntimeException: Unknown trimmer: SLIDINGDOWN
        at org.usadellab.trimmomatic.trim.TrimmerFactory.makeTrimmer(TrimmerFactory.java:70)
        at org.usadellab.trimmomatic.Trimmomatic.createTrimmers(Trimmomatic.java:59)
        at org.usadellab.trimmomatic.TrimmomaticSE.run(TrimmomaticSE.java:303)
        at org.usadellab.trimmomatic.Trimmomatic.main(Trimmomatic.java:85)
lecture11.sh: line 17: $'\r': command not found
' which didn't exist, or couldn't be read
lecture11.sh: line 20: $'\r': command not found

<lecture11.sh>

# Comments start with the # sign.

# Work at the command line. 
# If it works, copy it into this file. 


# Get the example dataset for lecture 11.
wget http://data.biostarhandbook.com/data/sequencing-platform-data.tar.gz

# The file unpacks into 4 sequencing datasets. 
tar zxvf sequencing-platform-data.tar.gz

# Quality plots before trimming.
fastqc illumina.fq

# Trim back by quality. 
trimmomatic SE illumina.fq better.fq SLIDINGDOWN:4:30

# Quality plots after trimming. 
fastqc better.fq

python version issue in section "Visualizing genomic variations"

After I downloaded the perfect_coverage.py file, run the following code:

$ cat refs/AF086833.fa | python perfect_coverage.py

it complains:

Traceback (most recent call last):
File "perfect_coverage.py", line 55, in
read1, read2 = file('R1.fq', 'wt'), open('R2.fq', 'wt')
NameError: name 'file' is not defined
(bioinfo)

Lecture 1-Biostar Handbook- Doctor.py issue

Hello,

I am reading the lecture 1 until running the doctor.py srcipt (Windows 10), which doesn't work :/
Here is what my terminal shows:

"bash: /root/bin/doctor.py: /usr/bin/python: bad interpreter: No such file or directory"

So I did: "ls" which gave "Miniconda3-latest-Linux-x86_64.sh bash_profile bin miniconda3"

When I choose bin ("cd bin" then "ls"), it gave me: "doctor.py"

...

Any idea of what it could be happening with the script doctor?

Many thanks in advance!

Suggestion (Topic expansion): Microarrays & R

I think it is appropriate, since many scientists still use microarrays (often for reasons of cost), to have sections on microarray experiments and analysis. There is a lot of avid discussion on technologies, normalization techniques, etc that I think are important for bioinformaticians to know.

Additionally, have you considered adding any information about using R? I noticed that your BioMart section is a TODO, and R has a good package to handle this.

With that said, thanks for the amazing book. I'm enjoying it.

More suggested edits:

Do the GO annotations change?
The latest GO data download produces a very different report than the first list. Notably HLA genes are not in the top ten anymore. This reinforces the ephemeral nature of annotations - as new information or different filtering parameters are applied the information content will drastically shift. Why did the HLA genes disappear? Was that an error. Probably... but what does it mean for the publications that were based on research completed during the **time ** the error was present? Was anyone ever publicly notified about the error? We discovered this discrepancy by accident... This demonstrates that not everything is going well in the GO world....

The following is a description of a discovery process that may not even apply anymore but it is educational to learn about.

As we can see, the most annotated protein is P04637, a cellular tumor antigen p53, which acts as a tumor suppressor in many tumor types. It corresponds to the gene TP53.

The most annotated gene was HLA-B. As it happens, this is a gene that we were unable to find right away when searching for it on the official gene name nomenclature (HGNC).

It is one of those &$!*&@!@ moments that every bioinformatician will have to learn to deal with:

The most annotated gene in the GO database seemingly cannot be found when searching for it in HGNC, the repository that is supposed to be the authoritative resource for the domain!

The cause is both simple and infuriating: the search for HLA-B finds over a thousand entries related to HLA-B but the gene that is actually named HLA-B turns up only on the second page of hits, buried among entries distantly related to it.

Situations like this, where seemingly mundane tasks suddenly require problem solving, are very common in bioinformatics. Keep that in mind the next time something completely surreal seems to happen.

An alternative and much better maintained resource*, GeneCards,* allows you to locate the gene by name right away.

How complete is the GO?
We may wonder: how complete are these GO annotations? Would it be possible to estimate what percent of all functional annotations have been discovered so far?

While that question is too abstract for our liking, we could tabulate the dates assigned to each piece of evidence and see how many pieces of evidence are produced per year.

It is "reasonably simple" to extract and tabulate the date information from the two files that we downloaded to get the rate of information growth in the GO:

Don't worry about this command if you don't get it yet.

We'll cover it later but we just want to demonstrate some results.

cat assoc.txt
| cut -f 14
| awk '{ print substr($1, 1, 4) }'
| sort
| uniq -c
| sort -k 2,2 -n
| awk '{ print $2 "\t" $1 }'
Produces the number of annotations verified in a given year.

...
2008 9682
2009 13322
2010 15026
2011 23490
2012 16428
2013 60555
2014 34925
2015 33096
2016 235077
This result is quite surprising, and we don't quite know what to make of it. Is seems that most evidence is assigned to the latest year - 2016. Or maybe there is some sort of quirk in the system that assigns the latest date to every piece of evidence that is being re-observed? We don't know.

There is a also a surprising dip in the number of annotations in 2012, then a subsequent increase in 2013 that can not be easily explained.

Version 5.60 mismatch

After following the instructions to setup my computer and running the doctor, I get the following 2 errors:

Version 5.60 mismatch for: efetch -version

Version 5.60 mismatch for: esearch -version

Thanks for any advice

Tophat not installing with Biostar Handbook

I'm working from the Biostar Handbook and trying to do the alignment with tophat. I'm on page 480 for reference.

I was able to successfully run

$ bowtie2-build $REF $IDX

However, then it says that I need to change the invocation of the aligner to

$ tophat -G $GTF -o tophat_hbr1 $IDX $R1 $R2
bash: /usr/bin/tophat: No such file or directory
(bioinfo) 

And that's my output. If I do which tophat, nothing comes up, so it seems like it's just not there. So, I tried to install tophat with

$ conda install tophat
Fetching package metadata .................
Solving package specifications: .

UnsatisfiableError: The following specifications were found to be in conflict:
  - python 3.6*
  - tophat -> python 2.7*
Use "conda info <package>" to see the dependencies for each package.

(bioinfo) 

I saw some solutions online for changing a line of code in tophat to do this, but I don't even know where that file is. I can downgrade python, but I'm not sure if that will work, and I'm worried that I'll break other things if I do. Also, when I do

$ ls /usr/bin | grep python
dh_python2
dh_python3
python
python-config
python2
python2-config
python2.7
python2.7-config
python3
python3.4
python3.4m
python3m
x86_64-linux-gnu-python-config
x86_64-linux-gnu-python2.7-config
(bioinfo) 
moltres@moltres-ao ~/biostar/Sequencing/griffith
$ python -V
Python 3.6.3
(bioinfo) 

I see a bunch of different python versions, and I'm not sure if I should switch between them or what.

It doesn't even say to install tophat anywhere in this section. I'm pretty lost here.

BLAST version (2.6.0 -> 2.2.31)

Hi,

I installed the latest BLAST according to the instruction for LINUX (https://www.biostarhandbook.com/tools/align/blast.html)

After the successful installation, I could see the latest BLAST (blastn 2.6.0), and it worked well (e.g, update_blastdb.pl).

which blastn
/home/flyark/src/ncbi-blast-2.6.0+/bin/blastn
blastn -version
blastn: 2.6.0+
 Package: blast 2.6.0, build Dec  7 2016 14:50:34

However, blastn version became 2.2.31 when I re-started bash on windows.

which blastn
/usr/bin/blastn
blastn -version
blastn: 2.2.31+
Package: blast 2.2.31, build Jan  7 2016 23:17:17

I think that the old BLAST, which had been automatically installed according to the introduction, masks the new BLAST when bash is re-started.

Could I know how to replace the old BLAST to the new one completely?

p.s. the present ncbi-blast version is 2.6.0.
2.5.0 in the installation page needs to be replaced to 2.6.0.

homebrew

Looks like the 'How do I use Homebrew?' section for setting up a MacOS computer may need updating:

To use brew for bioinformatics you will need to "tap" the "science formulas":

This is used to "tap" formulas used in science

Needs to be done only once.

brew tap homebrew/science

The above command gives an error:
Error: homebrew/science was deprecated. This tap is now empty as all its formulae were migrated.

Wrong command in "BLAST use cases"

The last command in the chapter "BLAST use cases" shown underneath

Get the sequence

efetch -db nucleotide -id AKC37152 -format fasta > AKC37152.fa

should be "protein" instead of "nucleotide".

Thanks.

Unable to create conda environment

I am trying to create a conda environment called bioinfo, but every time I go to run the command...

conda create -y --name bioinfo python=3.7

... I get the following notice:

Solving environment: failed

CondaHTTPError: HTTP 404 NOT FOUND for url <https://conda.anaconda.org/r/noarch/repodata.json>
Elapsed: 00:00.038179
CF-RAY: 45cf07930a8b2585-ORD

The remote server could not find the noarch directory for the
requested channel with url: https://conda.anaconda.org/r

As of conda 4.3, a valid channel must contain a `noarch/repodata.json` and
associated `noarch/repodata.json.bz2` file, even if `noarch/repodata.json` is
empty. please request that the channel administrator create
`noarch/repodata.json` and associated `noarch/repodata.json.bz2` files.
$ mkdir noarch
$ echo '{}' > noarch/repodata.json
$ bzip2 -k noarch/repodata.json

You will need to adjust your conda configuration to proceed.
Use `conda config --show channels` to view your configuration's current state.
Further configuration help can be found at <https://conda.io/docs/config.html>.

Any suggestions?

help

conda install -y bwa
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  • bwa

Current channels:

To search for alternate channels that may provide the conda package you're
looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

Automated HISAT2 not generating SAMPLE_summary.txt or runlog.txt (p. 496)

I'm trying to do the Zika RNA-Seq automated alignments, but I noticed that when I got to the differentially expressed genes analysis, all my numbers were off. I'm trying to trace back what might be wrong, and I suspect that it's something with my bam files produced by the automated script on page 496. This is the script I ran using bash:

set -ueo pipefail

mkdir -p bam

CPUS=4

IDX=refs/grch38/genome

RUNLOG=runlog.txt

echo "Run started by `whoami` on `date`" > $RUNLOG

for SAMPLE in $(cat paired_ids.txt)
do
	R1=reads/${SAMPLE}_1.fastq
	R2=reads/${SAMPLE}_2.fastq
	BAM=bam/${SAMPLE}.bam
	SUMMARY=bam/${SAMPLE}_summary.txt

	echo "Running HISAT2 on paired end $SAMPLE"
	hisat2 -p $CPUS -x $IDX -1 $R1 -2 $R2 | samtools sort > $BAM 2> $RUNLOG
	samtools index $BAM
done

for SAMPLE in $(cat single_ids.txt)
do
	R1=reads/${SAMPLE}.fastq
	BAM=bam/${SAMPLE}.bam
	SUMMARY=bam/${SAMPLE}_summary.txt

	echo "Running Hisat2 on single end: $SAMPLE"
	hisat2 -p $CPUS -x $IDX -U $R1 | samtools sort > $BAM 2> $RUNLOG
	samtools index $BAM

done

However, my runlog.txt file is completely empty, and I don't get a bam/${SAMPLE}_summary.txt

I also don't understand what the IDX=refs/grch38/genome is referencing since I don't have a file precisely by that name.

Is this a glitch, or am I doing something wrong here? Thanks.

bash find-variants.sh

In the last echo "*** Calling variants from all runs: samples.vcf", there is no samples.vcf but combined.vcf. I hope this is edited in the next update.

Unable to download openjdk and mkl packages

Hi, I'm working through the environment setup stage.

I entered bioinfo environment, and executed curl http://data.biostarhandbook.com/install/conda.txt | xargs conda install -y to automatically install all tools needed, however, I got stuck when downloading openjdk and mkl, the speed is too slow and the connection would get lost.
I tried to manually install openjdk 8 by excuting sudo apt-get install openjdk-8-jre, but I was informed that:

openjdk-8-jre is already the newest version (8u162-b12-0ubuntu0.16.04.2).
openjdk-8-jre set to manually installed.

Is there any workaround?

doctor.py and

Hi,

I was using bash on ubuntu on windows, which worked well. At this time, ubuntu version was 15.

Recently, I did format and re-installed windows as well as bash on ubuntu on windows (version 16).

I tried to install doctor.py, but experienced the following error.

mkdir -p ~/bin
curl http://data.biostarhandbook.com/install/doctor.py > ~/bin/doctor.py
chmod +x ~/bin/doctor.py

flyark:~$ ~/bin/doctor.py,
bash: /home/flyark/bin/doctor.py: /usr/bin/python: bad interpreter: No such file or directory

In addition, I installed wonderdump,which showed the following error.
mkdir -p ~/bin
curl http://data.biostarhandbook.com/scripts/wonderdump.sh > ~/bin/wonderdump
chmod +x ~/bin/wonderdump

flyark:~$ ~/bin/wonderdump
/home/flyark/bin/wonderdump: line 19: $1: unbound variable

Could you help me how to fix these problems?
I didn't experience these problems when I used the previous version of bash on ubuntu on windows.

Missing file: find-ebola-variants.sh

As reported in:

https://www.biostars.org/p/225812/#242326

In the section How to visualize genomic variation (What would realistic and good data look like?) it says that the script simulate-experimental-data.sh will generate a file called results.bam. It actually generates a file called align.bam.

In the section Variant effect prediction (How do I use snpEff?) the link http://data.biostarhandbook.com/variant/find-ebola-variants.sh results in a file not found error. Please check, thanks.

Access the files from both Linux and with Windows Explorer

Hello everyone,

I would like to access the files from both Linux and with Windows Explorer, following instructions from "How do I set up the filesystem with Ubuntu on Windows?" with the bash for ubuntu terminal

So I typed:
mkdir -p '/mnt/c/Users/Lpain/Desktop/unix'

then

ln -s '/mnt/c/Users/Lpain/Desktop/unix' ~/unix

I have a folder named "unix" inside the Linux terminal but not visible in my Desktop under Windows..

Does somebody have any idea/suggestion of what i can fix it?

Many thanks in advance

conda: command not found

Hi, I'm running bash on windows 10 and I followed the instructions to download miniconda. Then when I close the terminal, open it back up and type conda it says:
conda: command not found

I have a suspicion its not located in the right place but I'm not sure where it should be located.
Thanks,
Gabe

CondaHTTPError: HTTP 000 CONNECTION FAILED

Conda fails with:

 CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://conda.anaconda.org/biostar/linux-64/repodata.json
    Elapsed: -
    
An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
SSLError (MaxRetryError('HTTPSConnextionPool(host='conda.anaconda.org', port 443):Max retries exceeded with yrl: /bioconda/linux-64/repodata.json(Caused by SSLError (''Can't connect to HTTPS URL because the SSL module is not available. '',))',),)

Accession numbers in whole genome classification section

This question is about the whole genome classification section, specifically the subsection called "What are the expected abundances of the data?".

It shows a portion of the XLS file and says that

Psychrobacter cryohalolentis K5 corresponds to accession numbers NC_007969 and will...

Should "numbers" be singular?

Then there is code that searches for accession numbers NC_007969 and NC_007968. Where did the second number come from? It doesn't seem to appear in the XLS file.

Possible improvement of RNA-Seq: Griffith Test Data

First and foremost, thank you for putting together this wonderful guide. I am a complete novice when it comes to programming/bioinformatics and this guide has really helped me learn a lot. I'm very close to being able to start to run some analyses on my own.

My suggestion pertains to the section RNA-Seq: Griffith Test Data -> Analyzing the control samples -> Did our RNA-Seq analysis reproduce the expected outcomes?

The main issue I have is with the final command

paste table1 table2 > compare.txt

If opened with $ less compare.txt, it shows:

ERCC-00002      0.5     -1^M    ERCC-00002      0.587081192626208       -0.768368054756608
ERCC-00003      0.5     -1^M    ERCC-00003      0.680961534372993       -0.554354788186953
ERCC-00004      4       2^M     ERCC-00004      5.88286106939268        2.55651796573194
ERCC-00009      1       0^M     ERCC-00009      1.21972406601959        0.286554808762869
ERCC-00012      0.67    -0.58^M ERCC-00012      Inf     Inf
ERCC-00013      0.5     -1^M    ERCC-00013      Inf     Inf
ERCC-00014      0.5     -1^M    ERCC-00014      0.487600858570454       -1.0362274286211
ERCC-00016      0.67    -0.58^M ERCC-00016      NA      NA
ERCC-00017      4       2^M     ERCC-00017      Inf     Inf
ERCC-00019      4       2^M     ERCC-00019      2.7835815473997 1.4769423488067

If this is opened in Excel, it shows a very ugly output
image

The main issue underlying this is the ^M generated by the paste command. Now, I have no idea why the paste command generated this ^M, and I have verified that all previous files do not have '^M'.

$ head ERCC-datasheet.csv
ERCC ID,subgroup,concentration in Mix 1 (attomoles/ul),concentration in Mix 2 (attomoles/ul),expected fold-change ratio,log2(Mix 1/Mix 2)
ERCC-00130,A,30000,7500,4,2
ERCC-00004,A,7500,1875,4,2
ERCC-00136,A,1875,468.75,4,2
ERCC-00108,A,937.5,234.375,4,2
ERCC-00116,A,468.75,117.1875,4,2
ERCC-00092,A,234.375,58.59375,4,2
ERCC-00095,A,117.1875,29.296875,4,2
ERCC-00131,A,117.1875,29.296875,4,2
ERCC-00062,A,58.59375,14.6484375,4,2
$ head results.txt
id      baseMean        baseMeanA       baseMeanB       foldChange      log2FoldChange  pval    padj
ERCC-00130      29681.8244237545        10455.9218232761        48907.7270242329        4.67751460376822        2.22574215774208        1.16729711209905e-88       9.10491747437256e-87
ERCC-00108      808.597670575459        264.877838024487        1352.31750312643        5.10543846632202        2.35203486825767        2.40956154792488e-62       9.39729003690704e-61
ERCC-00136      1898.3382995277 615.744918976546        3180.93168007886        5.16598932779828        2.36904466305553        2.80841619396485e-58       7.3018821043086e-57
ERCC-00116      952.57953992746 337.704944218003        1567.45413563692        4.64149004174798        2.21458802318734        1.72224091670519e-45       3.35836978757511e-44
ERCC-00092      310.791194556933        96.697066636053 524.885322477813        5.42814110849266        2.44045822515553        2.44705874688655e-40       3.81741164514302e-39
ERCC-00004      3918.98719921685        1138.76690513024        6699.20749330347        5.88286106939268        2.55651796573194        8.29322966465066e-38       1.07811985640459e-36
ERCC-00095      141.487857460492        52.8817320556433        230.093982865341        4.35110526680992        2.12138192059427        2.46928414480309e-19       2.7514880470663e-18
ERCC-00062      77.4886630526754        22.8781484767386        132.099177628612        5.7740327091121 2.52957928026876        7.80930857468792e-17       7.61407586032072e-16
ERCC-00131      134.742367710255        55.9465732085198        213.53816221199 3.81682290738535        1.93237225006829        4.75964190578523e-16       4.12502298501386e-15
$ head table1
ERCC-00002      0.5     -1
ERCC-00003      0.5     -1
ERCC-00004      4       2
ERCC-00009      1       0
ERCC-00012      0.67    -0.58
ERCC-00013      0.5     -1
ERCC-00014      0.5     -1
ERCC-00016      0.67    -0.58
ERCC-00017      4       2
ERCC-00019      4       2
$ head table2
ERCC-00002      0.587081192626208       -0.768368054756608
ERCC-00003      0.680961534372993       -0.554354788186953
ERCC-00004      5.88286106939268        2.55651796573194
ERCC-00009      1.21972406601959        0.286554808762869
ERCC-00012      Inf     Inf
ERCC-00013      Inf     Inf
ERCC-00014      0.487600858570454       -1.0362274286211
ERCC-00016      NA      NA
ERCC-00017      Inf     Inf
ERCC-00019      2.7835815473997 1.4769423488067

Thus, for some bizarre reason, ^M is being added by the paste command which completely breaks down-stream data analysis.

Of course, it is easy to remove ^M by multiple commands and fix the file. However, for folks like myself that follow the guide very closely, it is not clear what is going on. The guide simply glosses over intermediate steps to analyze the data:


image


At the very least, it should be mentioned that additional steps were taken to organize the data as shown in the final data table. Otherwise, it is extremely frustrating to figure out if something went wrong in the intermediate steps. Ideally, the output of compare.txt should be shown before showing the data table so the user is reassured that they did the steps correctly.

Thanks!

Inconsistent nucleotide ordering

What are nucleotides?

Nucleotides are the building blocks of nucleic acids (DNA and RNA–we'll get to that one later on). In DNA, there are four types of nucleotide: Adenine, Cytosine, Guanine, and Thymine. Because the order in which they occur encodes the information biologists try to understand, we refer to them by their first letter, A, C, G and T, respectively.

A Adenine
G Guanine
C Cytosine
T Thymine

It's distracting to have the ordering be alphabetical 2 times in the text and then grouped by purine/pyrimadine in list format.

Perl without perlbrew

Hello
I am having extremely difficult time getting the efetch/esearch command to work.

As per previous discussion, i tried to uninstall perlbrew but i couldn't find an option to install perl without pearlbrew. Any link/command line for installing perl without perlbrew?

This has become a huge issue because i simply cannot move forward as this module is used often.

Thank you

Suggested edits to some typos and possibly confusing wording in the Functional analysis section

[Hi. I used to have a fork to the original version of the handbook that I would read and then send suggestions to ialbert for changes using a pull-request. After I finished the first version I wasn't using the handbook anymore. Now there are additional topics I'm learning about in the newer version. Since I don't have a fork, I just cut and pasted the following sections and added the possible changes to them. Unfortunately it is hard to see where I made changes, so if possible I used bold around the last word I left before cutting text (this doesn't really work at the beginning of sentences so I just put ** **), and I used italics for any suggested rewording. Best, Paige]

Are there different ways to compute ORA analyses?
If you followed the section on Gene Ontology, you know that the GO files have a relatively simple format. Moreover, the data structure is that of a tree (network). To build a program that assigns counts to each node and that can traverse the tree* , * is a standard problem that requires only moderately advanced programming skill.

** Data interpretation is such an acute problem, however, there always seems to be a shortage of tools that can perform the ORA analysis in one specific way. ** A cottage industry of hundreds of ORA tools now ** exists- the vast majority have been abandoned, and when run fail ** or even worse* ,* tacitly give the wrong answer** **.

Moreover, the accuracy of a method will critically depend **on ** integrating the most up to date information. As a tale of caution, we note that the DAVID: Functional Annotation Tool was not updated from 2010 to 2016! Over this time it was unclear to users whether the data that the server operated on included the most up to date information. New functions and annotations are added on an almost weekly basis to functional annotation databases and over many years this increased the difference between DAVID data and the available data ** ** considerably. We believe that by the end of 2016 DAVID operated on a mere 20% of the total ** information ** available. Nevertheless, it had gained thousands of citations every single year after the 2010 update.

Fix link

Hi,

First, great job on the book. I like to take a look at it whenever I have time. Just wanted to report that one of your links here which was supposed to lead to a BioStar post, is actually going to the picard manual of MarkDuplicates.

Nothing major, but I wanted to report it as I know it is not what you meant to add :)

What's the best way to get involved in contributing to the guide?

Currently working through the biostar handbook on Windows 10 using bash on Ubuntu as a bench scientist. The guide is incredibly useful, but it's taking some time to troubleshoot various issues along the way I've encountered due to running Ubuntu through Windows.

I'm also making a thorough enough how-to manual that anybody else in my lab would be able to copy it line by line to successfully install the required components. I've started writing scripts to automate the install process for windows users so they can avoid potential problems caused by an inexperienced user attempting to set up the necessary tools. I'd like to contribute what I've found so far, and troubleshoot a couple resilient errors moving forward.

esearch -db sra -query PRJNA257197

Hello,

When running the command
esearch -db sra -query PRJNA257197

I get this issue:

501 Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)
No do_post output returned from 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=sra&term=PRJNA257197&retmax=0&usehistory=y&edirect_os=darwin&edirect=7.70&tool=edirect&[email protected]'
Result of do_post http request is
$VAR1 = bless( {
'_headers' => bless( {
'client-date' => 'Sat, 27 Jan 2018 14:18:01 GMT',
'::std_case' => {
'client-warning' => 'Client-Warning',
'client-date' => 'Client-Date'
},
'client-warning' => 'Internal response',
'content-type' => 'text/plain'
}, 'HTTP::Headers' ),
'_rc' => 501,
'_content' => 'LWP will support https URLs if the LWP::Protocol::https module
is installed.
',
'_msg' => 'Protocol scheme 'https' is not supported (LWP::Protocol::https not installed)',
'_request' => bless( {
'_headers' => bless( {
'content-type' => 'application/x-www-form-urlencoded',
'user-agent' => 'libwww-perl/6.31'
}, 'HTTP::Headers' ),
'_content' => 'db=sra&term=PRJNA257197&retmax=0&usehistory=y&edirect_os=darwin&edirect=7.70&tool=edirect&email=[email protected]',
'_method' => 'POST',
'_uri' => bless( do{(my $o = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi')}, 'URI::https' )
}, 'HTTP::Request' )
}, 'HTTP::Response' );

WebEnv value not found in search output - WebEnv1

help

After a paired-end sequencing under illumina I have for my TC31A sample of two fastq files: TC31A_S189_R1.fastq.dsrc2 and TC31A_S189_R2.fastq.dsrc2
my question is how can I evaluate the size of my sample and the number of reads that compose it?

Trouble assigning a role the the number 2 in point 3 about the ls output.

At the beginning of the Unix bootcamp, you introduce ls:

ls
Applications Desktop Documents Downloads

You then make 4 points about the output from that command. The third point says

The output of the ls command lists two things. In this case, they are directories, but they could also be files. We'll learn how to tell them apart later on. These directories were created as part of a specific course that used this bootcamp material. You will therefore probably see something very different on your own computer.

I'm not sure how to interpret this because there are four directories listed. Maybe "two" refers to two types of things, but my best guess here is that it refers to the number of directories and is out of date.

need help for "Analyzing SAM files"

i downloaded:
efetch -db=nuccore -format=gb -id=AF086833 > AF086833.gb

and then typed
samtools view -c bwa.bam AF086833:470-2689

but it showed:
[main_samview] region "AF086833:470-2689" specifies an unknown reference name. Continue anyway.

Could anyone help me to figure out why this error occur? Thanks!

what is the best way to report typos in the handbook?

Hi

I have been going through the handbook every now and then and I have seen typos at places.

What is the best method and platform to report them? By method, I mean a "format" to report what typo occurs in which section at which page number.

Thanks
Vijay

the ebola project PRJNA257197 update

It seems that this project has been updated these 2-3 days, I can't see the protein and nucleotide data, only SRA data was kept. However, esearch and efetch tools still can work:)

capture

Typo?

On the page https://www.biostarhandbook.com/ is the sentence "The book is available in over the web, as a PFD, an eBOOK and in Kindle formats.". Did you mean PDF instead of PFD?

Most annotated human genes and proteins

For first time learners of Unix, it would be helpful if this section (page 420) contained an explanation of the commands/pipelines called to sort the list of genes and grab the top ten most highly annotated genes and proteins. Particularly the last step sort -k1,1nr is unclear. Are we not meant to understand this yet?

help

conda create -y --name sequana_env python=3.6
Solving environment: failed

>>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

$ /home/stanislas/miniconda3/bin/conda create -y --name sequana_env python=3.6

environment variables:
CIO_TEST=
CONDA_ROOT=/home/stanislas/miniconda3
CONDA_SHLVL=0
DEFAULTS_PATH=/usr/share/gconf/ubuntu.default.path
MANDATORY_PATH=/usr/share/gconf/ubuntu.mandatory.path
PATH=/home/stanislas/src/edirect:/home/stanislas/miniconda3/bin:/home/stani
slas/bin:/home/stanislas/.local/bin:/usr/local/sbin:/usr/local/bin:/us
r/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
REQUESTS_CA_BUNDLE=
SSL_CERT_FILE=
WINDOWPATH=2

 active environment : None
        shell level : 0
   user config file : /home/stanislas/.condarc

populated config files : /home/stanislas/.condarc
conda version : 4.5.4
conda-build version : not installed
python version : 3.6.5.final.0
base environment : /home/stanislas/miniconda3 (writable)
channel URLs : https://conda.anaconda.org/bioconda/linux-64
https://conda.anaconda.org/bioconda/noarch
https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
https://conda.anaconda.org/r/linux-64
https://conda.anaconda.org/r/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/free/linux-64
https://repo.anaconda.com/pkgs/free/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
https://repo.anaconda.com/pkgs/pro/linux-64
https://repo.anaconda.com/pkgs/pro/noarch
package cache : /home/stanislas/miniconda3/pkgs
/home/stanislas/.conda/pkgs
envs directories : /home/stanislas/miniconda3/envs
/home/stanislas/.conda/envs
platform : linux-64
user-agent : conda/4.5.4 requests/2.18.4 CPython/3.6.5 Linux/4.15.0-34-generic ubuntu/18.04 glibc/2.27
UID:GID : 1000:1000
netrc file : None
offline mode : False

V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V

CondaHTTPError: HTTP 404 NOT FOUND for url https://conda.anaconda.org/r/noarch/repodata.json
Elapsed: 00:00.232108
CF-RAY: 45d3af7ffe1fc03b-MRS

The remote server could not find the noarch directory for the
requested channel with url: https://conda.anaconda.org/r

As of conda 4.3, a valid channel must contain a noarch/repodata.json and
associated noarch/repodata.json.bz2 file, even if noarch/repodata.json is
empty. please request that the channel administrator create
noarch/repodata.json and associated noarch/repodata.json.bz2 files.
$ mkdir noarch
$ echo '{}' > noarch/repodata.json
$ bzip2 -k noarch/repodata.json

You will need to adjust your conda configuration to proceed.
Use conda config --show channels to view your configuration's current state.
Further configuration help can be found at https://conda.io/docs/config.html.

A reportable application error has occurred. Conda has prepared the above report.
Upload successful.

help

how can i uncompress a compress file in dsrc format

command not work: fastq-dump SRR1553607

in section: Accessing the Short Read Archive (SRA)
when I type this command as listed:
fastq-dump SRR1553607

it shows:

2017-03-31T03:39:48 fastq-dump.2.3.5 err: error unexpected while resolving tree within virtual file system module - failed to resolve accession 'SRR1553607' - Obsolete software. See https://github.com/ncbi/sra-tools/wiki ( 406 )

Redirected!!!

2017-03-31T03:39:48 fastq-dump.2.3.5 err: name incorrect while evaluating path within network system module - Scheme is 'https'
2017-03-31T03:39:48 fastq-dump.2.3.5 err: item not found while constructing within virtual database module - the path 'SRR1553607' cannot be opened as database or table

Linux Bash Shell on Windows 10- Known Perl Installation Issues

Has anybody successfully installed all necessary components from Conda as described in the biostar handbook?

It seems there is an issue with perl running on windows 10. It's possible to fix this issue but that would mean installing the programs from source instead of conda. The particular issue with conda installing perl can be found here.

I'm afraid I haven't found an easy way around this. I'm using the anniversary build Windows 10. From the first link, it seems Microsoft is aware of the issue and has fixed it (along with adding additional improvements) in the creators update rumored to be released around April 17th.

My question is has anybody encountered and surmounted this obstacle in conda? If not, it would probably be better to hold off on creating a solution until after the creator update when they can apply a patch.

Missing file: find-ebola-variants.sh

As reported in:

https://www.biostars.org/p/225812/#242326

In the section How to visualize genomic variation (What would realistic and good data look like?) it says that the script simulate-experimental-data.sh will generate a file called results.bam. It actually generates a file called align.bam.

In the section Variant effect prediction (How do I use snpEff?) the link http://data.biostarhandbook.com/variant/find-ebola-variants.sh results in a file not found error. Please check, thanks.

Fragment length and SD in Zika example

This concerns the Zika example of kallisto quantification.

Where do the fragment length of 187 bp and standard deviation of 70 bp come from? It would be helpful to the reader to know how to get this information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.