Giter VIP home page Giter VIP logo

hrishidhondge / cromast Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 937 KB

CroMaSt (Cross Mapper of domain Structural instances) is an automated iterative workflow to clarify domain definition by cross-mapping of domain structural instances between domain databases.

Home Page: https://workflowhub.eu/workflows/390

License: Other

Common Workflow Language 55.73% Python 44.27%
cath data-integration pfam protein-domains domain-databases

cromast's Introduction

CroMaSt: A workflow for domain family curation through cross-mapping of structural instances between protein domain databases

CroMaSt (Cross Mapper of domain Structural instances) is an automated iterative workflow to clarify domain definition by cross-mapping of domain structural instances between domain databases. CroMaSt (for Cross-Mapper of domain Structural instances) will classify all structural instances of a given domain type into 3 different categories (core, true and domain-like).

Requirements

  1. Conda or Miniconda
  2. Kpax
    Download and install conda (or Miniconda) and Kpax by following the instructions from their official site.

Get it running

(Considering the requirements are already met)

  1. Clone the repository and change the directory
git clone https://gitlab.inria.fr/capsid.public_codes/CroMaSt.git
cd CroMaSt
  1. Create the conda environment for the workflow
conda env create --file yml/environment.yml
conda activate CroMaSt
  1. Change the path of variables in paramter file
sed -i 's/\/home\/hdhondge\/CroMaSt\//\/YOUR\/PATH\/TO_CroMaSt\//g' yml/CroMaSt_input.yml 
  1. Create the directory to store files from PDB and SIFTS (if not already)
mkdir PDB_files SIFTS
  1. Download the source input data
cwl-runner Tools/download_data.cwl yml/download_data.yml

Basic example

1. First, we will run the workflow for the KH domain with family identifiers RRM_1 and RRM in Pfam and CATH, respectively.

Run the workflow -

cwl-runner --parallel  --outdir=Results/  CroMaSt.cwl yml/CroMaSt_input.yml

2. Once the iteration is complete, check the new_param.yml file from the outputdir (Results), if there is any family identifier in either pfam or cath; run the next iteration using following command (Until there is no new families explored by workflow) -

cwl-runner --parallel  --outdir=Results/  CroMaSt.cwl Results/new_param.yml

Extra: Start the workflow with multiple families from one or both databases

If you would like to start the workflow with multiple families from one or both databases, then simply add a comma in between two family identifiers.

pfam: ['PF00076', 'PF08777']
cath: ['3.30.70.330']
  • Pro Tip: Don't forget to give different path to --outdir option while running the workflow multiple times or at least move the results to some other location after first run.

Run the workflow for protein domain of your choice

1. You can run the workflow for the domain of your choice by simply changing the family identifers in yml/CroMaSt_input.yml file.

Simply replace the following values of family identifiers (for pfam and cath) with the family identifiers of your choice in yml/CroMaSt_input.yml file.

pfam: ['PF00076']
cath: ['3.30.70.330']

Data files used in current version are as follows:

Files in Data directory can be downloaded as follows:

  1. File used from Pfam database: pdbmap.gz

  2. File used from CATH database: cath-domain-description-file.txt

  3. Obsolete entries from RCSB PDB obsolete_PDB_entry_ids.txt

CATH Version - 4.3.0 (Ver_Date - 11-Sep-2019) FTP site Pfam Version - 35.0 (Ver_Date - November-2021) FTP site

Reference

Poster - 
1. Hrishikesh Dhondge, Isaure Chauvot de Beauchêne, Marie-Dominique Devignes. CroMaSt: A workflow for domain family curation through cross-mapping of structural instances between protein domain databases. 21st European Conference on Computational Biology, Sep 2022, Sitges, Spain. ⟨hal-03789541⟩

Acknowledgements

This project has received funding from the Marie Skłodowska-Curie Innovative Training Network (MSCA-ITN) RNAct supported by European Union’s Horizon 2020 research and innovation programme under granta greement No 813239.

cromast's People

Stargazers

 avatar

Watchers

 avatar  avatar

cromast's Issues

Residue-mapping error because of sequence change in UniProt

The sequence of PTBP1_HUMAN (P26599) has changed in UniProt. It has 4 RRM domains, 2 each from RRM_1 (PF00076) and RRM_5 (PF13893).

  • RRM_5:
    • (38, 152),
    • (309, 433) --> This Structural InstancStI is causing the error
  • RRM_1:
    • (186, 252),
    • (456, 521)

The above sequence numbering matches with UniProt sequence versions=238, but the sequence changed in versions=239.
With the change in UniProt version, the corresponding SIFTS file has also changed accordingly, but not Pfam (v35.0).
This sequence change is causing workflow errors during the residue-mapping step of P26599_A (309, 433), recognizing only the domain end residue.
According to the new sequence, the RRM domain starts at 335, and the corresponding PDB (2EVZ) starts from the 350th amino acid residue of the UniProt sequence.

The issue can be replicated with the current version of the workflow (v1.0.0) by using the CWL tool Tools/resmapping_pfam_instances_subwf.cwl (sub-workflow for residue-mapping of Pfam StIs).
Query this tool for RRM_5 (PF13893) Pfam family.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.