scharch / sonar Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 10.0 52.42 MB

Software for Ontogenic aNalysis of Antibody Repertoires

License: GNU General Public License v3.0

Python 66.74% Perl 23.37% R 8.37% Dockerfile 0.59% Shell 0.93%

sonar's People

Contributors

Stargazers

Watchers

Forkers

shengzizhang gyc2016 longerham bopopescu wangsiling123 madisonberry shawhahnlab semir2 menchant asjureka

sonar's Issues

X11 error

When operate the code "sonar get_island output/tables/cap256-week34H_goodVJ_unique_id-div.tab --mab CAP256-VRC26.01 --mab CAP256-VRC26.08", error is displayed as below. I run docker as "docker run -it -e DISPLAY=$DISPLAY -v ~:/host/home scharch/sonar".

Loading required package: misc3d
Error in .External2(C_X11, d$display, d$width, d$height, d$pointsize, :
unable to start device X11
Calls: getIsland -> X11
In addition: Warning message:
In X11(type = "Xlib", family = "misc") :
unable to open connection to X11 display ''

Vignette breaks down with lack of islandSeqs.txt page 3

Hello! Thank you for your wonderful program. I am running through the vignette and I ran into a problem. The generation of the islandSeqs.txt file from the processed data from SRR1057705 and SRR1057707 was not discussed. It is necessary to proceed with the tutorial. Also the sonar command 'getfasta' must be changed to getFastaFromList due to conflicts with other getfasta

sonar getfasta -l output/tables/islandSeqs.txt -f output/sequences/nucleotide/cap256-week34H_goodVJ_unique.fa -o output/sequences/nucleotide/cap256-week34H_islandSeq.fa
Input program 'getfasta' is unclear. Did you mean one of the following?
	getFastaFromAIRR.py
	getFastaFromList.py
	getReadsByAnnotation.py
	getListFromFasta.py
	2.2-get_island_interactive.R
root@a714d8cfed7e:/host/home/cap256-week34H# sonar getFastaFromList -l output/tables/islandSeqs.txt -f output/sequences/nucleotide/cap256-week34H_goodVJ_unique.fa -o output/sequences/nucleotide/cap256-week34H_islandSeq.fa

Mon Mar  9 20:49:01 2020 -- SONAR v4.2-29-g2c920bf-dirty run with command:
	/SONAR/utilities/getFastaFromList.py -l output/tables/islandSeqs.txt -f output/sequences/nucleotide/cap256-week34H_goodVJ_unique.fa -o output/sequences/nucleotide/cap256-week34H_islandSeq.fa

Traceback (most recent call last):
  File "/SONAR/utilities/getFastaFromList.py", line 92, in <module>
    main()
  File "/SONAR/utilities/getFastaFromList.py", line 67, in main
    for line in fileinput.input(arguments['-l']):
  File "/usr/lib/python3.6/fileinput.py", line 250, in __next__
    line = self._readline()
  File "/usr/lib/python3.6/fileinput.py", line 362, in _readline
    self._file = open(self._filename, self._mode)
FileNotFoundError: [Errno 2] No such file or directory: 'output/tables/islandSeqs.txt'

Could you provide the necessary commands to produce the islandSeqs.txt file?

Errors with vignette during module 3.2

I'm stepping through the example dataset both using runVignette.sh in Docker and running the commands myself as an exercise, and all goes well up to the longitudinal part. I'm having trouble when it reaches the igphyml step. It looks like there are a few different things happening but I'm not sure if they're related.

First in phylogeny/3.2-run_IgPhyML.py line 381, it checks arugments[--natives], finds it is not None, and continues, but for some reason that argument is the literal string "False":

loading sequence info from /SONAR/germDB/IgHV.fa...
loading sequence info from False...
Traceback (most recent call last):
  File "/SONAR/phylogeny/3.2-run_IgPhyML.py", line 380, in <module>
    natives = load_fastas( arguments['--natives'] )
  File "/SONAR/__init__.py", line 277, in load_fastas
    reader, result = SeqIO.parse(open(f, "r"), "fasta"), dict()
FileNotFoundError: [Errno 2] No such file or directory: 'False'

I got past that by replacing this:

if arguments['--natives'] is not None:

With this, assuming that it should be equivalent to None in this case:

if arguments['--natives'] is not None and arguments['--natives'] != "False":

It continues, but later on this happens:

stats:
Error running 'perl -I /SONAR/third-party /SONAR/third-party/ancReconstructHLP17.pl /host/home/cap256-longitudinal/work/phylo/ar.config':
Use of uninitialized value $statsfile in concatenation (.) or string at /SONAR/third-party/ancReconstructHLP17.pl line 131.
kappa needs to be specified at /SONAR/third-party/ancReconstructHLP17.pl line 269.
Exit code 255

Looks like that's inside main() around line 300 where it sets up that config file and passes it to the perl script:

            with open("%s/ar.config"%prj_tree.phylo, "w") as handle:
                    handle.write( "length\t%d\n" % (align_len/3) )
                    handle.write( "rooted\t1\noutdir\t%s\n" % prj_tree.phylo )
                    handle.write( "seqfile\t%s/infile\n" % prj_tree.phylo )
                    handle.write( "rootid\t%s\n" % germ_id )
                    handle.write( "igphyml\t%s/%s\n" % (SCRIPT_FOLDER, "third-party") )
                    handle.write( "stats\t%s/infile_igphyml_stats.txt_hlp17\n" % prj_tree.phylo )
                    handle.write( "tree\t%s/infile_igphyml_tree.txt_hlp17\n" % prj_tree.phylo )
                    handle.write( "ambigfile\t%s/ambigfile.txt\n" % prj_tree.phylo )
                    handle.write( "stem\t%s\n" % prj_name )

                    s = subprocess.Popen( ["perl", "-I", "%s/third-party"%SCRIPT_FOLDER, reconstruct, "%s/ar.config"%prj_tree.phylo], universal_newlines=True, stderr=subprocess.PIPE )
                    o,e = s.communicate()

It looks like kappa can come from the config or stats file, and I do have a infile_igphyml_stats.txt_hlp17 file with the line ". Transition/transversion ratio: 1.74581" that looks like it should do the trick but it must not be getting set as kappa in the Perl script for some reason.

Any ideas? This happens when I set up my own environment with dependencies and run the steps manually, or if I run from scratch with Docker using runVignette.sh. Maybe something has changed since version 4.0 and I should give that a try, too; I just realized the Docker setup goes with the latest master branch so I haven't tried using the latest stable release instead for any of this. Thanks!

No identity/divergence plot file with only one antibody selected

I noticed I wasn't getting the islandSeqs.png output for a particular id-div.tab file and tracked it down to the fact that there's only one antibody column in the input file. If I use a table with multiple antibody columns but only select one, the plot image also isn't saved. It looks like this is because plotting the final selections is skipped if there's only one antibody and that part includes the ggsave call to save the image. Is that behavior intentional? If I disable the check I get a png with just the one plot, which seems to work fine.

ID/DIV plots don't shown sequences for certain edge cases

Hi Chaim,

I've found some edge cases where no tile is drawn in the ID/DIV plots even though there's a sequence there.

If I use this example id-div.tab file:

sequence_id v_gene    germ_div mab 
1           IGHV1-AFS 0        80  
2           IGHV1-AFS 5        85  
3           IGHV1-AFS 10       90  
4           IGHV1-AFS 15       95  
5           IGHV1-AFS 20       100

I then get tiles drawn for sequences 2 and 4, but not 1, 3, or 5:

1 and 5 are apparently just getting removed by ggplot's X/Y scaling. The docs say:

Note that setting limits on positional scales will remove data outside of the limits. If the purpose is to zoom, use the limit argument in the coordinate system (see coord_cartesian()).

(I guess they define the limits as an open interval but just don't say so?) If I nudge the limits a little or switch to an equivalent coord_cartesian in plot_all, it then shows those two points:

Sequence 3 is due to a weirder issue. I think it's because the MASS::kde2d call the plot function uses splits the contribution of that one point perfectly between two bins in X and two bins in Y, so it smears it across four bins. Then the color scaling in scale_fill_gradientn misses it, since the maximum value around that region is just a tad too low to apparently count as one sequence's worth of stuff. If I tweak the .../2 on this line I can make the tile get drawn, but I'm not sure about the details of these variables:

b     <- (sum(g$z) / length(data$germ_div))/2

If I instead do .../5 it shows up:

..but that throws off the definition of what counts as one sequence, so I'm not sure what the right approach would be.

Since I'm not familiar with these kernel density estimations here's a quick demo of what I'm trying to describe, with one observation on a 10x10 grid positioned either "just wrong" (split evenly across four bins) or "just right" (centered on one bin):

g <- MASS::kde2d(0, 0, h = 1, n = 10, lims = c(-1, 1, -1, 1))
pheatmap::pheatmap(g$z, cluster_cols = F, cluster_rows = F)
g <- MASS::kde2d(1/9, 1/9, h = 1, n = 10, lims = c(-1, 1, -1, 1))
pheatmap::pheatmap(g$z, cluster_cols = F, cluster_rows = F)

In practical terms this is less of a problem than that clipping issue above, though, since the individual point is still there and can be selected in the interactive plot (it gets drawn when the plot adds the individual sequences for the selected island). In contrast I don't think I can access sequences 1 and 5 during island selection unless the limits of the plot area are expanded a little.

Cluster number for each sequence id

Dear SONAR team,
After running the "cluster_Sequences" command, where can I find the corresponding cluster number for each sequence id?
Thanks!

igphyml wrapper crashes when given custom V library

I get this exception when trying to use the --lib option to sonar igphyml:

Traceback (most recent call last):
  File "/SONAR/phylogeny/3.2-run_IgPhyML.py", line 358, in <module>
    if not os.path.isfile(arguments['-lib']):
KeyError: '-lib'

I think there's a missing second dash in a few places for the lib argument:

if arguments['--lib'] is not None:
    if not os.path.isfile(arguments['-lib']):
        sys.exit("Can't find germline file %s" % argument['-lib'])

It works if I replace that with:

if arguments['--lib'] is not None:
    if not os.path.isfile(arguments['--lib']):
        sys.exit("Can't find germline file %s" % argument['--lib'])

(This is in https://github.com/scharch/SONAR/blob/master/phylogeny/3.2-run_IgPhyML.py#L365)

Assigned germline V gene not necessarily used as root in 3.2-run_IgPhyML.py

The docstring for 3.2-run_IgPhyML.py says that -v is the "assigned germline V gene of known antibodes, for use in rooting the trees," but I'm running into some instance where it doesn't use this sequence ID for the root. I think this is because it figures out the sequence ID for the root of the tree based on a regular expression, and it can inadvertently pick up a different sequence depending on the full set of sequence IDs. The steps I see in 3.2-run_IgPhyML.py are:

germ_seq is defined via -v argument
germ_seq is written into the to-align file, along with the collected and native sequences
germ_id defined by regex-matching each sequence ID from the aligned file
germ_id is passed to igphyml as --root

In my case I have a "_LightSeq" suffix on each sequence in my natives.fa and re.search("(IG|VH|VK|VL|HV|KV|LV)", seq.id, re.I) matches the "ig" in each of those, overwriting the correct "IGLV..." sequence ID matched earlier in the file.

I can't override this by adding --root, either, since it's mutually-exclusive with -v. Would there be any downside to automatically setting arguments['--root'] = arguments['-v'] for the arguments['-v'] is not None case? Then that would get used as germ_id and passed to igphyml as the correct root.

SONAR with single cell sequencing??

Hello! I was wondering if you could separate each barcoded cells reads into a new fastq file, 1000 to 10,000 fastq files per 10x genomics experiment. Do you think SONAR would be able to process them and draw useful comparisons? What would be the major limitations of doing this? I am sure that using the framework regions of the antibodies would allow only barcoded cells that contained them to be analyzed. Also, barcoded cells with framework regions could be merged into a in-silico "bulk" RNA-seq data set. Has anyone done this? I expect that the "LIBRA-seq" 10X genomics is going to be proprietary software, so not as useful to the public research community.

Long processing time on 1.5_single_cell_statistics.py

Hi,

I was recently running the SONAR workflow on our HPC cluster, and I noticed that it was taking quite a long time to finish that step. When I went through the script, it appears that this particular script isn't threaded (or not obviously, please correct me if I'm wrong). Would it be possible to add threading to this script to help it utilize available processing power more efficiently?

Thank you!

cp sonar ~/bin. No such file

Excellent work. I was trying to run it locally in my Mac OS10.14.5

Following the installation instructions, when I ran
cp sonar ~/bin
It showed
cp: sonar: No such file or directory

Could you please help me? Thanks.

FYI: Biopython 1.74 breaks backwards compatibility in NcbiblastnCommandline

This is more just a heads up than a request/bug report. I found that the blastn steps fail when I have the newest Biopython installed, apparently because they changed how arguments to NcbiblastnCommandline are parsed. If I have version 1.74, I need to remove the extra single quotes wrapping the outfmt string in blastProcess to make it work. With 1.73 or earlier (looks like 1.69 in a SONAR docker container?) it works as-is, though.

biopython/biopython#2055
biopython/biopython#2071

Tools in third-party dir cannot execute

Excellent works! Thanks for sharing and updating.
The issue is that, when I tried the updated SONAR, the tools in third-party dir cannot execute, including muscle and vsearch. I haven't tried others till now. Maybe the problem is that, I am using MacOS but the tools are in binary code.
The Error looks like: 'Bio.Application.ApplicationError: Non-zero return code 126 from './sonar/third-party/muscle -in ****N3/work/lineage/align/align000001_temp.fa -out ****N3/work/lineage/align/align000001_temp.aln', message '/bin/sh: ./sonar/third-party/muscle: cannot execute binary file' and 'OSError: [Errno 8] Exec format error: './sonar/third-party/vsearch'

I can find muscle for MacOS from elsewhere and replace the file in third-party dir, but I can't find 'vsearch'. Would you please tell where is the third-party tool vsearch for MacOS, or add tools in the new version of sonar?

By the way, it cause problems after the dir name 'sonar' was changed to 'SONAR'. Every time using 'import sonar', it can not find the module called sonar. It can be fixed when I changed the dir name back.
Thanks for your great work!

id-div script now looks for project tree dir before it's defined

Small bug I think introduced in c8c19f2: If I now call sonar id-div with neither -g nor --species given, it looks for the gene_locus file, but does so before defining the project dir farther down the script.

Traceback (most recent call last):
  File "/SONAR/lineage/2.1-calculate_id-div.py", line 328, in <module>
    if os.path.exists(f"{prj_tree.internal}/gene_locus.txt"):
NameError: name 'prj_tree' is not defined

I think you can probably just shift that assignment around to fix it.

Warning message: In fun(libname, pkgname) : couldn't connect to display "localhost:10.0" while "sonar get_island"

Hi there! I am now following the vignette to run the example 'cap256-week34H'. However, it stops with the warning message above. You can see the whole message and the command below.

root@059beae7bb41:/cap256-week34H$ sonar get_island output/tables/cap256-week34H_goodVJ_unique_id-div.tab --mab CAP256-VRC26.01 --mab CAP256-VRC26.08
Loading required package: misc3d
Warning message:
In fun(libname, pkgname) : couldn't connect to display "localhost:10.0"
Error in .External2(C_X11, d$display, d$width, d$height, d$pointsize, :
unable to start device X11
Calls: getIsland -> X11
In addition: Warning message:
In X11(type = "Xlib", family = "misc") :
unable to open connection to X11 display ''

I tried some advice to fix the problem, but it did not work. I am running the example on a server through Xshell. The problem exists when SONAR is installed locally or by DOCKER. Hope you can help me, thank you!

error in blastn output format

I'm trying to run blast_V on a dataset and I'm getting this error. It looks like the outfmt isn't specified correctly.

Thu Feb 18 10:38:07 2021 -- SONAR v4.2-40-g9931941-dirty run with command:
/SONAR/annotate/1.1-blast_V.py --fasta heavies.fasta --locus H --derep --species rhesus -f

vsearch v2.8.1_linux_x86_64, 340.2GB RAM, 40 cores
https://github.com/torognes/vsearch

Reading file /datacommons/dhvi/mb488/SONAR/test/RM10N011/work/internal/tempForDerep.fa 100%
1261706 nt in 2078 seqs, min 303, max 1002, avg 607
Dereplicating 100%
Sorting 100%
2078 unique sequences, avg cluster 1.0, median 1, max 1
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
TOTAL: 2078 processed, 1061 good
Starting blast of /datacommons/dhvi/mb488/SONAR/test/RM10N011/work/annotate/vgene/RM10N011_001.fasta against /SONAR/germDB/IgHV_BU_DD.fasta...
Traceback (most recent call last):
File "/SONAR/annotate/init.py", line 35, in blastProcess
cline()
File "/hpc/home/mb488/.local/lib/python3.6/site-packages/Bio/Application/init.py", line 569, in call
raise ApplicationError(return_code, str(self), stdout_str, stderr_str)
Bio.Application.ApplicationError: Non-zero return code 1 from '/SONAR/third-party/blastn_linux64 -out /datacommons/dhvi/mb488/SONAR/test/RM10N011/work/annotate/vgene/RM10N011_001.txt -outfmt "'6 qseqid sseqid pident length mismatch gaps qstart qend sstart send evalue bitscore sstrand'" -query /datacommons/dhvi/mb488/SONAR/test/RM10N011/work/annotate/vgene/RM10N011_001.fasta -db /SONAR/germDB/IgHV_BU_DD.fasta -evalue 0.001 -word_size 7 -max_target_seqs 10 -gapopen 5 -gapextend 2 -penalty -1 -reward 1', message "BLAST query/options error: ''6' is not a valid output format"