Giter VIP home page Giter VIP logo

diegozea / mitos.jl Goto Github PK

View Code? Open in Web Editor NEW
74.0 2.0 18.0 57.73 MB

A Julia package to analyze protein sequences, structures, and evolutionary information

Home Page: https://diegozea.github.io/MIToS.jl/stable/

License: Other

Julia 98.63% Parrot 1.37%
bioinformatics protein-sequence-analysis protein-structure protein-evolution coevolution mutual-information conservation structural-bioinformatics pfam pdb

mitos.jl's Introduction

MIToS MIToS

🐉 MIToS: Mutual Information Tools for protein Sequence analysis

A Julia Package to Analyze Protein Sequences, Structures, and Evolutionary Information


DOCUMENTATION:

Linux, OSX & Windows: Status Code Coverage: Coverage Status codecov.io

NOTE: Some breaking changes were introduced between MIToS 2.15 and MIToS 3.0, inclusive. See the NEWS.md file to migrate code from an old version of MIToS. Most breaking changes will show a deprecation warning with a hint on how to perform the migration. If you need more help migrating code towards MIToS v3, you can write an email to diegozea at gmail dot com asking for assistance.

MIToS provides a comprehensive suite of tools for the analysis of protein sequences and structures. It allows working with Multiple Sequence Alignments (MSAs) to obtain evolutionary information in the Julia language [1]. In particular, it eases the analysis of coevoling position in an MSA using Mutual Information (MI), a measure of covariation. MI-derived scores are good predictors of inter-residue contacts in a protein structure and functional sites in proteins [2,3]. To allow such analysis, MIToS also implements several useful tools for working with protein structures, such as those available in the Protein Data Bank (PDB) or predicted by AlphaFold 2.

MIToS starting point was an improvement of the algorithm published by Buslje et al. [2]. A BLOSUM62-based pseudo-count strategy, similar to Altschul et al.[4], , was implemented to improve performance in the range of MSAs with a low number of sequences [1]. MIToS offers all the tools for using, developing, and testing MI-based scores—in fact, any measure based on reside frequencies in an MSA—in different modules.

Modules

MIToS tools are separated into different modules for different tasks.

  • MSA This module defines multiple functions and types for dealing with MSAs and their annotations. It also includes facilities for sequence clustering and shuffling, among others.
  • PDB This module defines types and methods to work with protein structures from different sources, such as PDB or AlphaFold DB. It includes functions to superpose structures, measure the distance between residues, and much more.
  • Information This module defines residue contingency tables and methods on them to estimate information measures. This allow to measure evolutionary information on MSAs positions. It includes functions to estimate corrected mutual information (ZMIp, ZBLMIp) between MSA columns, as well as conservation estimations using Shannon entropy and the Kullback-Leibler divergence.
  • SIFTS This module allows access to SIFTS residue-level mapping of UniProt, Pfam, and other databases with PDB entries.
  • Pfam This module uses the previous modules to work with Pfam MSAs. It also offers useful functions for parameter optimization using Pfam alignments.
  • Utils It exports common utils functions and types used in different modules of this package.

Installation

To install MIToS, you need to execute the following code in Julia:

using Pkg; Pkg.add("MIToS")

To update your installed version, you can execute:

using Pkg; Pkg.update("MIToS")`

Scripts

The MIToS_Scripts package offers a set of easy-to-use scripts to access some functionalities MIToS offers from the terminal. These scripts are designed for researchers familiar with command-line interfaces (CLI) but without experience coding in Julia. The available scripts include:

  • Buslje09.jl: Calculates corrected Mutual Information (MI/MIp) based on Buslje et al., 2009.
  • BLMI.jl: Computes corrected mutual information using BLOSUM62-based pseudo-counts, as described in the MIToS publication [1].
  • Conservation.jl: Calculates Shannon entropy and Kullback-Leibler divergence for each MSA column.
  • Distances.jl: Computes inter-residue distances in a PDB file.
  • PercentIdentity.jl: Calculates the percentage identity between all sequences in an MSA and provides statistical summaries.
  • MSADescription.jl: Provides statistics for a given Stockholm file, including clustering information and sequence coverage.

This list is not exhaustive; more scripts are available in the MIToS_Scripts.jl repository. Visit the repository for more details and to access these scripts.

Order versions

MIToS 3.0 requires Julia 1.9 or higher. It is recommended that you use these versions to get the best experience coding with Julia and MIToS. If you need to use MIToS in a Julia version lower than 1.0, you will need to look at the older MIToS v1 documentation.

Citation

If you use MIToS, please cite:

Diego J. Zea, Diego Anfossi, Morten Nielsen, Cristina Marino-Buslje; MIToS.jl: mutual information tools for protein sequence analysis in the Julia language, Bioinformatics, Volume 33, Issue 4, 15 February 2017, Pages 564–565, https://doi.org/10.1093/bioinformatics/btw646

References

  1. Zea, Diego Javier, et al. "MIToS. jl: mutual information tools for protein sequence analysis in the Julia language." Bioinformatics 33, no. 4 (2016): 564-565.
  2. Buslje, Cristina Marino, et al. "Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information." Bioinformatics 25.9 (2009): 1125-1131.
  3. Buslje, Cristina Marino, et al. "Networks of high mutual information define the structural proximity of catalytic sites: implications for catalytic residue identification." PLoS Comput Biol 6.11 (2010): e1000978.
  4. Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic acids research 25.17 (1997): 3389-3402.

Acknowledgments

MIToS was initially developed at the Structural Bioinformatics Unit of the Fundación Instituto Leloir (FIL) in Argentina. Its development now continues at the Molecular Assemblies and Genome Integrity group of the Institute for Integrative Biology of the Cell (I2BC) in France.

We want to thank all contributors who have helped improve MIToS. We also thank the Julia community and all the MIToS users for their feedback and support.

FIL and I2BC FIL and I2BC

mitos.jl's People

Contributors

brnmdrt avatar cossio avatar cristinot avatar diegozea avatar elin- avatar ellisvalentiner avatar femtocleaner[bot] avatar github-actions[bot] avatar javieriserte avatar juliatagbot avatar kool7d avatar louiseb25 avatar timholy avatar tkelman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mitos.jl's Issues

New zlib error reading Stockholm files

I think this is a new error associated with a recent change regarding headers.

Input code:

downloadpfam("PF00062")
msa1 = MIToS.MSA.read("PF00062.stockholm.gz", Stockholm, generatemapping=true, useidcoordinates=true)

Error output:

ERROR: zlib error: incorrect header check (code: -3)
Stacktrace:
  [1] changemode!(stream::TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompressor, IOStream}, newmode::Symbol)
    @ TranscodingStreams C:\Users\kool7\.julia\packages\TranscodingStreams\MQucZ\src\stream.jl:742
  [2] callprocess(stream::TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompressor, IOStream}, inbuf::TranscodingStreams.Buffer, outbuf::TranscodingStreams.Buffer)
    @ TranscodingStreams C:\Users\kool7\.julia\packages\TranscodingStreams\MQucZ\src\stream.jl:668
  [3] fillbuffer(stream::TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompressor, IOStream}; eager::Bool)
    @ TranscodingStreams C:\Users\kool7\.julia\packages\TranscodingStreams\MQucZ\src\stream.jl:596
  [4] fillbuffer
    @ C:\Users\kool7\.julia\packages\TranscodingStreams\MQucZ\src\stream.jl:582 [inlined]
  [5] eof(stream::TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompressor, IOStream})
    @ TranscodingStreams C:\Users\kool7\.julia\packages\TranscodingStreams\MQucZ\src\stream.jl:201
  [6] eof(stream::TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompressor, IOStream})
    @ TranscodingStreams C:\Users\kool7\.julia\packages\TranscodingStreams\MQucZ\src\stream.jl:199
  [7] iterate (repeats 2 times)
    @ .\io.jl:1062 [inlined]
  [8] _pre_readstockholm(io::TranscodingStreams.TranscodingStream{CodecZlib.GzipDecompressor, IOStream})
    @ MIToS.MSA C:\Users\kool7\.julia\packages\MIToS\LlUD0\src\MSA\Stockholm.jl:55
  [9] #parse#25
    @ C:\Users\kool7\.julia\packages\MIToS\LlUD0\src\MSA\Stockholm.jl:89 [inlined]
 [10] #parse#29
    @ C:\Users\kool7\.julia\packages\MIToS\LlUD0\src\MSA\Stockholm.jl:131 [inlined]
 [11] _read(::String, ::String, ::Type{Stockholm}; kargs::Base.Pairs{Symbol, Bool, Tuple{Symbol, Symbol}, NamedTuple{(:generatemapping, :useidcoordinates), Tuple{Bool, Bool}}})
    @ MIToS.Utils C:\Users\kool7\.julia\packages\MIToS\LlUD0\src\Utils\Read.jl:100
 [12] read(::String, ::Type{Stockholm}; kargs::Base.Pairs{Symbol, Bool, Tuple{Symbol, Symbol}, NamedTuple{(:generatemapping, :useidcoordinates), Tuple{Bool, Bool}}})
    @ MIToS.Utils C:\Users\kool7\.julia\packages\MIToS\LlUD0\src\Utils\Read.jl:130

Increase test coverage to 95%

Actual test coverage is ~89%. That's great! However, would be great to increase that number to make this package more reliable and avoid future regressions ;)

It's easy to see what lines are missing tests using codecov and coveralls.

We need to find a way to test MIToS.Utils.Scripts and plot recipes. Also we need to test more Base.show and Base.print (example to test show).

ArgParse needed for MIToS' scripts

D:\>"C:\Users\Diego\AppData\Local\Julia-1.2.0\bin\julia.exe" C:\Users\Diego\.julia\packages\MIToS\Hha81\scripts\SplitStockholm.jl -h
ERROR: LoadError: ArgumentError: Package ArgParse not found in current path:
- Run `import Pkg; Pkg.add("ArgParse")` to install the ArgParse package.

404 error from function 'getpdbdescription'

I am getting a 404 error from function 'getpdbdescription'.

`
HTTP/1.1 404 Not Found
X-Powered-By: Express
Strict-Transport-Security: max-age=31536000000; includeSubDomains
Cache-Control: private, no-cache, max-age=0, must-revalidate
Content-Type: text/html; charset=utf-8
Content-Length: 39674
ETag: W/"9afa-t3cWyFWTs7aO0Iy60YWTWEN5EGo"
Vary: Accept-Encoding
Date: Thu, 31 Dec 2020 00:07:16 GMT
X-Backend-Server: 10.100.50.12

""")
in top-level scope at Scratch1.jl:302
in getpdbdescription at MIToS\IWuWP\src\PDB\PDBMLParser.jl:269
in #getpdbdescription#36 at MIToS\IWuWP\src\PDB\PDBMLParser.jl:269
in downloadpdbheader at MIToS\IWuWP\src\PDB\PDBMLParser.jl:244
in #downloadpdbheader#35 at MIToS\IWuWP\src\PDB\PDBMLParser.jl:248
in download_file at MIToS\IWuWP\src\Utils\Read.jl:28
in #download_file#3 at MIToS\IWuWP\src\Utils\Read.jl:28
in open at HTTP\IAI92\src\HTTP.jl:348
in #open#6 at HTTP\IAI92\src\HTTP.jl:348
in request##kw at HTTP\IAI92\src\HTTP.jl:314
in #request#4 at HTTP\IAI92\src\HTTP.jl:314
in at HTTP\IAI92\src\RedirectRequest.jl:21
in #request#1 at HTTP\IAI92\src\RedirectRequest.jl:24
in at HTTP\IAI92\src\BasicAuthRequest.jl:21
in #request#1 at HTTP\IAI92\src\BasicAuthRequest.jl:28
in request##kw at HTTP\IAI92\src\MessageRequest.jl:28
in #request#1 at HTTP\IAI92\src\MessageRequest.jl:51
in at HTTP\IAI92\src\RetryRequest.jl:31
in #request#1 at HTTP\IAI92\src\RetryRequest.jl:44
in 56#58##kw at base\error.jl:284
in at base\error.jl:288
in at HTTP\IAI92\src\ExceptionRequest.jl:19
in #request#1 at HTTP\IAI92\src\ExceptionRequest.jl:22`

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Make use of PkgBenchmark

We've started to port MIToS benchmarks to this repository (in the benchmark folder) to use PkgBenchmark.

Actually, old bechmarks for MIToS are in:

We should update those repositories to newer MIToS and Julia versions. Also, we need to include some of those MIToS' bechmarks into the benchmark suite of this package. That is going to allow us to use PkgBenchmark to test improvements or regressions on speed.

Benchmarks for other packages/languages must not be here. We only need to integrate small MIToS' benchmarks in this package. Using data in the test folder would be useful.

Using a limit distance for speed contact calculation between residues

Calculate CA distance between two residues A and B and contrast against the maximum distance between CA for a pair of residues (A, B) for a limit contact distance

maximum distances ~ len(A) + len(B) + limit + 2 * van der Waals radius + error

Using FAUJ880104 from AAindex for getting len(A) and len(B). len is not taking into account post-translational modifications.

Carbon van der Waals radius is 1.7

Error should be large enough for not missing real contacts, but low enough for speed the calculation avoiding the calculation of not useful distances.

SIFTS download error

Here is the code first:
sfile = downloadsifts("2VB1")

I am getting this error:

ERROR: ArgumentError: missing or unsupported scheme in URL (expected http(s) or ws(s)): ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/vb/2vb1.xml.gz

This happens with basically any of the stuff that requires downloading from ftp. Also I can't find their XML files, it's weird. Only JSON.

Some overloads violate function intent

Some of the overloads of Base methods violate the intended API of the function. For example, print(x, y, z...) should print all of x, y, and z to the terminal, yet we have methods that add a format argument that isn't displayed:

By comparison:

julia> print("Hello", Float32)
HelloFloat32

The way that Base intends you to customize output of print is with an IOContext wrapper around the io object.

Deal with None in SIFTS

Example:

shell> zgrep "None" /home/diego/SIFTS/3cx5.sifts.xml.gz | head -3
          <crossRefDb dbSource="NCBI" dbCoordSys="UniProt" dbAccessionId="10090" dbResNum="None" dbResName="None"/>
          <crossRefDb dbSource="InterPro" dbCoordSys="UniProt" dbAccessionId="IPR013783" dbResNum="None" dbResName="None" dbEvidence="G3DSA:2.60.40.10"/>
          <crossRefDb dbSource="InterPro" dbCoordSys="UniProt" dbAccessionId="IPR007110" dbResNum="None" dbResName="None" dbEvidence="PS50835"/>

This causes:

ArgumentError("invalid base 10 digit 'N' in \"None\"")

Distances.jl: Incremental RAM usage on file loop

Since all the structure is loaded and compared against itself, the scripts use a lot of RAM. But when is used with a list of files, some data is not garbage collected in each iteration (using gc() and empty!() doesn't work). The script is killed by the system because of this (after ~1000 iterations on a PC with 60 Gb of RAM). This is not a problem for small list or for using with -f.

MSA shown representation

It looks like how the MSAs are shown as changed with Julia 1.8:

julia> msa
AnnotatedMultipleSequenceAlignment with 2 annotations : Residue[V L … - -; P L … - -; … ; V L … S -; P I … - -]

Error reading structures generated with extract_from_pdb

Reading a pdb file created using extract_from_pdb (from T-Coffee) with MIToS throw errors, as the ATOM lines have only 66 characters. However, BioStructures' read function works fine. So, the following function uses BioStructures to save an intermediate file that MIToS can read.

"""
This function reads a PDB file created with extract_from_pdb and returns a MIToS' PDB object.
"""
function read_pdb(pdb_file::String)
    # Read the PDB file with BioStructures
    tmp = BioStructures.read(pdb_file, BioStructures.PDB)
    # Save the BioStructures' PDB object into a temporary file
    tmp_file = tempname()
    BioStructures.writepdb(tmp_file, tmp)
    # Read the temporary file with MIToS
    pdb = read(tmp_file, PDBFile)
    # Delete the temporary file
    rm(tmp_file)
    return pdb
end

From one of those PDB files:

REMARK Output of the program extract_from_pdb (Version 2.43)
REMARK Legal PDB format not Guaranteed
REMARK This format is not meant to be used in place of the PDB format
REMARK The header refers to the original entry
REMARK The sequence from the original file has been taken in the field: ATOM
REMARK extract_from_pdb, 2001, 2002, 2003, 2004, 2005 2006 (c) CNRS and Cedric Notredame

Extra whitespace when writing PDB files with specific coordinates

When reading the this PDB file, changing its coordinates and writing these modified versions, PDBs with coordinates 10.000 get an extra whitespace before the coordinates.

Grepping the output files:

grep "10\.000" *pdb
14_1svd_1M14_A.pdb:ATOM    386  CB  THR    49      37.515 -10.000  36.203  1.00  0.00           C
19_1svd_1M14_A.pdb:ATOM   2169  N   ASP   272       10.000  13.255  76.512  1.00  0.00           N
45_1svd_1M14_A.pdb:ATOM    780  CA  ARG   101      10.000  11.768  53.211  1.00  0.00           C
51_1svd_1M14_A.pdb:ATOM    659  OE1 GLN    85      30.041 -10.000  45.708  1.00  0.00           O
55_1svd_1M14_A.pdb:ATOM    853  CA  GLN   110       10.000  11.317  70.605  1.00  0.00           C
58_1svd_1M14_A.pdb:ATOM      6  CG  LEU     1      38.475 -10.000  63.420  1.00120.89           C
62_1svd_1M14_A.pdb:ATOM    909  CZ3 TRP   115      12.990   10.000  62.725  1.00  0.00           C
69_1svd_1M14_A.pdb:ATOM   1248  C   LYS   158      40.377  10.000  59.300  1.00  0.00           C
70_1svd_1M14_A.pdb:ATOM    748  CB  LEU    97      16.920  10.000  54.280  1.00  0.00           C
82_1svd_1M14_A.pdb:ATOM    819  CA  ASP   105       2.233   10.000  60.879  1.00  0.00           C

Files 19, 55, 62 and 82 are misaligned. I included these files on the public guist.

Note that the problem does not appear when writing files with coordinates 20.000, 30.000, etc...

> grep "20\.000" *pdb 
100_1svd_1M14_A.pdb:ATOM    287  N   LYS    37      18.454 -20.000  51.996  1.00  0.00           N
100_1svd_1M14_A.pdb:ATOM    700  N   MET    91      20.000  -4.533  61.817  1.00  0.00           N
10_1svd_1M14_A.pdb:ATOM    308  CB  PRO    39      20.000  -9.583  60.310  1.00  0.00           C
33_1svd_1M14_A.pdb:ATOM   1340  C   ALA   169      42.771  20.000  50.788  1.00  0.00           C
35_1svd_1M14_A.pdb:ATOM   1651  CA  THR   207      11.747  20.000  57.741  1.00  0.00           C
40_1svd_1M14_A.pdb:ATOM    245  O   PRO    31      29.420 -20.000  61.352  1.00  0.00           O
45_1svd_1M14_A.pdb:ATOM   1125  O   LEU   142      20.000   4.175  60.041  1.00  0.00           O
5_1svd_1M14_A.pdb:ATOM    694  C   LEU    90      20.000  -4.129  55.154  1.00  0.00           C
54_1svd_1M14_A.pdb:ATOM   1927  C   MET   243      20.000  26.687  68.258  1.00  0.00           C
55_1svd_1M14_A.pdb:ATOM   1698  CB  TYR   213      20.000  25.617  49.568  1.00  0.00           C
56_1svd_1M14_A.pdb:ATOM   1604  N   VAL   202      20.000  17.469  61.130  1.00  0.00           N
57_1svd_1M14_A.pdb:ATOM    281  CA  VAL    36      22.390 -20.000  61.070  1.00  0.00           C
59_1svd_1M14_A.pdb:ATOM   1094  O   ALA   138      20.000   9.640  56.193  1.00  0.00           O
63_1svd_1M14_A.pdb:ATOM   1680  C   LYS   211      16.675  20.000  50.749  1.00  0.00           C
81_1svd_1M14_A.pdb:ATOM   2080  O   GLU   261      25.065  20.000  73.731  1.00  0.00           O
8_1svd_1M14_A.pdb:ATOM    918  CA  VAL   117      20.000  10.717  69.046  1.00  0.00           C
94_1svd_1M14_A.pdb:ATOM   1932  CE  MET   243      20.000  32.105  66.505  1.00  0.00           C
96_1svd_1M14_A.pdb:ATOM   2129  CZ  PHE   266      20.000  20.517  68.683  1.00  0.00           C
98_1svd_1M14_A.pdb:ATOM    900  O   TRP   115      20.000  10.810  68.305  1.00  0.00           O

downloadpfam fails for full alignments

downloadpfam seems to fail when attempting to download full alignments:

MIToS.Pfam.downloadpfam("PF11591"; alignment="full")

ERROR: HTTP.Exceptions.StatusError(404, "GET", "/interpro/wwwapi/entry/pfam/PF11591/?annotation=alignment:full&download", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Server: gunicorn
Vary: Accept,Cookie,Origin
Content-Type: application/json
InterPro-Version: 99.0
Access-Control-Expose-Headers: Content-Length, InterPro-Version, InterPro-Version-Minor, Cached
Strict-Transport-Security: max-age=0
Date: Sun, 28 Apr 2024 10:53:34 GMT
Access-Control-Max-Age: 3600
InterPro-Version-Minor: 0
Access-Control-Allow-Origin: *
X-Content-Type-Options: nosniff
Connection: Keep-Alive
Allow: GET, HEAD
X-Frame-Options: DENY
Access-Control-Allow-Methods: GET, HEAD, OPTIONS, POST, PUT
Access-Control-Allow-Headers: x-requested-with, Content-Type, origin, authorization, accept, client-security-token
Referrer-Policy: same-origin
Content-Length: 35

""")

However, downloading seed alignments works fine:

MIToS.Pfam.downloadpfam("PF11591"; alignment="seed") # succeeds

Roadmap to MIToS v1.0

  • Scripts
    • Write documentation
    • Use -p in scripts
    • Move general parts to Utils module
    • Use STDIN & STDOUT (pipes)
    • Use real flags
  • MSA
    • Add keepinserts option to read/parse
    • Test mean PID
  • PDB
    • distance and contact taking vectors

Optional

  • cMI
  • Pfam Doc
    • pMI

PFAM is not always round-trippable?

If I load a Pfam file, perform some manipulations, and then write it back out to disk, I can't load what I write:

julia> using MIToS, MIToS.Pfam, MIToS.MSA

julia> pfamfile = "PF18883.alignment.full.gz"
"PF18883.alignment.full.gz"

julia> pfamfile = downloadpfam("PF18883"; filename=pfamfile)
"PF18883.alignment.full.gz"

julia> msa = read(pfamfile, Stockholm, generatemapping=true, useidcoordinates=true);

julia> filtersequences!(msa, coverage(msa) .>= 0.9);

julia> open("PF18883.alignment.fullcoverage", "w") do io
           print(io, msa, Stockholm)
       end

julia> msa2 =  read("PF18883.alignment.fullcoverage", Stockholm, generatemapping=true, useidcoordinates=true)
ERROR: The last residue number 685 in sequence 1 isn't the end coordinate 686
Stacktrace:
 [1] _to_msa_mapping(sequences::Vector{String}, ids::Vector{String})
   @ MIToS.MSA ~/.julia/dev/MIToS/src/MSA/GeneralParserMethods.jl:56
 [2] _generate_annotated_msa(annot::Annotations, IDS::Vector{String}, SEQS::Vector{String}, keepinserts::Bool, generatemapping::Bool, useidcoordinates::Bool, deletefullgaps::Bool)
   @ MIToS.MSA ~/.julia/dev/MIToS/src/MSA/GeneralParserMethods.jl:142
 [3] #parse#25
   @ ~/.julia/dev/MIToS/src/MSA/Stockholm.jl:93 [inlined]
 [4] #parse#29
   @ ~/.julia/dev/MIToS/src/MSA/Stockholm.jl:133 [inlined]
 [5] _read(::String, ::String, ::Type{Stockholm}; kargs::Base.Pairs{Symbol, Bool, Tuple{Symbol, Symbol}, NamedTuple{(:generatemapping, :useidcoordinates), Tuple{Bool, Bool}}})
   @ MIToS.Utils ~/.julia/dev/MIToS/src/Utils/Read.jl:105
 [6] read(::String, ::Type{Stockholm}; kargs::Base.Pairs{Symbol, Bool, Tuple{Symbol, Symbol}, NamedTuple{(:generatemapping, :useidcoordinates), Tuple{Bool, Bool}}})
   @ MIToS.Utils ~/.julia/dev/MIToS/src/Utils/Read.jl:135
 [7] top-level scope
   @ REPL[7]:1

Logos

This issue is to get the URL for the logos used through the README and docs.

IO performance

The updated mitos-benchmarks/MIToS benchmark shows small regression in read/write of MSA for Julia 1.0.2 and MIToS 2.3.1. Maybe, it's related to the use of length(::String): https://discourse.julialang.org/t/performance-of-length-string/12672
Also, @profile shows that most of the time is spent in cache inefficient access to the big MSA matrix: setindex! in _convert_to_matrix_residues.
Maybe https://github.com/JuliaArrays/TiledIteration.jl or using 8 bits instead of 32/64 bits for Residue can help to avoid cache related problems.

using MIToS.MSA, Profile, ProfileView
@profile read("../data/PF00089.sth", Stockholm, MultipleSequenceAlignment, deletefullgaps=false)

image

322 ./task.jl:259; (::getfield(Revise, Symbol("##58#60")){REPL.REPLBackend})()
 322 /home/elin/.julia/packages/Revise/gStbk/src/Revise.jl:771; run_backend(::REPL.REPLBackend)
  322 ...rker/package_linux64/build/usr/share/julia/stdlib/v1.0/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
   322 ./boot.jl:319; eval(::Module, ::Any)
    322 ...ckage_linux64/build/usr/share/julia/stdlib/v1.0/Profile/src/Profile.jl:25; top-level scope
     322 ./none:0; #read
      322 /home/elin/.julia/packages/MIToS/fEKlk/src/Utils/Read.jl:109; #read#10(::Base.Iterators.Pairs{Symbol,Bool,Tuple{Symbol},NamedTuple{(:deletefullgaps,),Tuple{Bool}}}, ::...
       322 ./none:0; (::getfield(MIToS.Utils, Symbol("#kw##_read")))(::NamedTuple{(:deletefullgaps,),Tuple{Bool}}, ::typeof(MIT...
        322 /home/elin/.julia/packages/MIToS/fEKlk/src/Utils/Read.jl:79; #_read#9(::Base.Iterators.Pairs{Symbol,Bool,Tuple{Symbol},NamedTuple{(:deletefullgaps,),Tuple{Bool}}}, ::...
         322 ./none:0; (::getfield(Base, Symbol("#kw##parse")))(::NamedTuple{(:deletefullgaps,),Tuple{Bool}}, ::typeof(parse), :...
          322 /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:113; #parse#27
           322 ./none:0; #parse
            86  /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:101; #parse#26(::Bool, ::Function, ::IOStream, ::Type{Stockholm}, ::Type{NamedArrays.NamedArray{Residue,2,A...
             54 /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:75; _pre_readstockholm_sequences(::IOStream)
              1  /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:4; _fill_with_sequence_line!
               1 ./strings/util.jl:25; startswith(::String, ::Char)
              48 /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:5; _fill_with_sequence_line!
               48 /home/elin/.julia/packages/MIToS/fEKlk/src/Utils/GeneralUtils.jl:36; get_n_words(::String, ::Int64)
                22 ./strings/string.jl:250; getindex(::String, ::UnitRange{Int64})
                 22 ./strings/string.jl:60; _string_n
                  1 ./essentials.jl:355; cconvert
                   1 ./number.jl:7; convert
                    1 ./boot.jl:722; Type
                     1 ./boot.jl:692; toUInt64
                      1 ./boot.jl:581; check_top_bit
                       1 ./boot.jl:571; is_top_bit_set
                26 ./strings/string.jl:253; getindex(::String, ::UnitRange{Int64})
                 1  ./pointer.jl:118; unsafe_store!
                 1  ./range.jl:575; iterate
                  1 ./promotion.jl:425; ==
                 15 ./strings/string.jl:87; codeunit
                  15 ./strings/basic.jl:193; checkbounds
                   9 ./strings/basic.jl:185; checkbounds
                    8 ./int.jl:428; <=
              3  /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:7; _fill_with_sequence_line!
               3 /home/elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_set.jl:26; in
                3 .../elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:367; haskey
                 2 ...elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:232; ht_keyindex(::OrderedCollections.OrderedDict{String,Nothing}, ::String, ::Bool)
                  2 ./array.jl:731; getindex
                 1 ...elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:234; ht_keyindex(::OrderedCollections.OrderedDict{String,Nothing}, ::String, ::Bool)
              1  /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:14; _fill_with_sequence_line!
               1 ./array.jl:856; push!
                1 ./array.jl:814; _growend!
              1  /home/elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_set.jl:28; _fill_with_sequence_line!
               1 .../elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:307; setindex!(::OrderedCollections.OrderedDict{String,Nothing}, ::Nothing, ::String)
                1 .../elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:276; _setindex!(::OrderedCollections.OrderedDict{String,Nothing}, ::Nothing, ::String, ::Int64)
             1  /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:76; _pre_readstockholm_sequences(::IOStream)
              1 ./strings/util.jl:53; startswith
             31 /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:77; _pre_readstockholm_sequences(::IOStream)
              31 ./io.jl:881; iterate(::Base.EachLine{IOStream}, ::Nothing)
               31 ./none:0; #readline
                31 ./iostream.jl:433; #readline#296
            236 /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Stockholm.jl:102; #parse#26(::Bool, ::Function, ::IOStream, ::Type{Stockholm}, ::Type{NamedArrays.NamedArray{Residue,2,A...
             234 ...e/elin/.julia/packages/MIToS/fEKlk/src/MSA/GeneralParserMethods.jl:132; _generate_named_array(::Array{String,1}, ::OrderedCollections.OrderedSet{String})
              7   /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Residues.jl:177; _convert_to_matrix_residues(::Array{String,1}, ::Tuple{Int64,Int64})
              23  /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Residues.jl:249; _convert_to_matrix_residues(::Array{String,1}, ::Tuple{Int64,Int64})
               23 ./boot.jl:409; Type
                23 ./boot.jl:396; Type
              3   /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Residues.jl:251; _convert_to_matrix_residues(::Array{String,1}, ::Tuple{Int64,Int64})
               3 ./iterators.jl:138; iterate
                3 ./iterators.jl:139; iterate
                 3 ./strings/string.jl:176; iterate
                  3 ./strings/string.jl:176; iterate
                   3 ./operators.jl:286; >
                    3 ./int.jl:49; <
              201 /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Residues.jl:252; _convert_to_matrix_residues(::Array{String,1}, ::Tuple{Int64,Int64})
               173 ./array.jl:771; setindex!
                5 /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Residues.jl:176; convert
                 5 ./char.jl:50; Type
                  5 ./char.jl:53; codepoint
                   5 ./char.jl:114; Type
                9 /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/Residues.jl:178; convert
                 9 ./array.jl:731; getindex
               23  ./iterators.jl:139; iterate
                1  ./strings/string.jl:176; iterate
                9  ./strings/string.jl:178; iterate
                 9 ./int.jl:450; <<
                  9 ./int.jl:443; <<
                13 ./strings/string.jl:179; iterate
                 13 ./strings/string.jl:17; between
                  13 ./int.jl:429; <=
             2   ...e/elin/.julia/packages/MIToS/fEKlk/src/MSA/GeneralParserMethods.jl:133; _generate_named_array(::Array{String,1}, ::OrderedCollections.OrderedSet{String})
              2 /home/elin/.julia/packages/MIToS/fEKlk/src/MSA/GeneralParserMethods.jl:116; _ids_ordered_dict(::OrderedCollections.OrderedSet{String}, ::Int64)
               1 .../elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:301; setindex!(::OrderedCollections.OrderedDict{String,Int64}, ::Int64, ::String)
                1 .../elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:253; ht_keyindex2(::OrderedCollections.OrderedDict{String,Int64}, ::String)
                 1 .../elin/.julia/packages/OrderedCollections/Pr9Pa/src/dict_support.jl:6; hashindex
                  1 ./hashing.jl:18; hash
                   1 ./hashing2.jl:179; hash
               1 .../elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:307; setindex!(::OrderedCollections.OrderedDict{String,Int64}, ::Int64, ::String)
                1 .../elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:290; _setindex!(::OrderedCollections.OrderedDict{String,Int64}, ::Int64, ::String, ::Int64)
                 1 ...elin/.julia/packages/OrderedCollections/Pr9Pa/src/ordered_dict.jl:183; rehash!(::OrderedCollections.OrderedDict{String,Int64}, ::Int64)
                  1 .../elin/.julia/packages/OrderedCollections/Pr9Pa/src/dict_support.jl:6; hashindex
                   1 ./hashing.jl:18; hash
                    1 ./hashing2.jl:179; hash

Write PDB files

A function for writing PDB can be useful for passing structural information to external programs; useful for contact determination, h-bonds prediction, visualization and others.

Goal: unify on BioStructures for PDB reading

There's a growing ecosystem for handling protein structure in Julia, including two visualization packages https://github.com/MurrellGroup/ProtPlot.jl and https://github.com/BioJulia/BioMakie.jl, neither of which supports the representation used by MIToS.

I started a conversation on the Julia slack (@diegozea was invited) about unifying around a single reader for PDB files, and the conversation has spilled out to public fora like BioJulia/BioStructures.jl#48. I propose that common reader be BioStructures, and that package has undergone several recent changes to make it more palatable as a common foundation. Regardless of what decision we end up making, I think it makes a lot of sense for Julia to have a common PDB reader so that everyone can leverage all the cool tools. In my opinion, MIToS is a bit "heavy" (it does so many things which I love!) for it to serve well as the common reader that everybody depends on. That would mean migrating MIToS to work with the format read by BioStructures. Alternatively, if the community ends up preferring the MIToS representation, one could split out the PDB component of MIToS into a smaller, standalone package.

Feel free to comment here or elsewhere about any obstacles, technical or otherwise, that would stand in the way of converging on a common core.

`count!` performance

count!, when GAPs aren't used in the ResidueContingencyTables, tests if each residue is a gap in order to avoid indexing outside the bonds. This is expensive and prevent us to use @simd. ResidueContingencyTables should have always 22x22 values, even if only 20 will be used.

Make hobohmI() faster

A lot of time of hobohmI looks to be spent on percentageidentity. Maybe making percentageidentity faster can help, but:

  • Check lengths in percentageidentity is not a problem, takes the same.
  • @inbounds doesn't help in percentageidentity.

Maybe using msa.msa' can help in some way for getting the sequences faster...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.