Giter VIP home page Giter VIP logo

chanjo2's Introduction

chanjo2

Build Status - GitHub PyPI Version Code style: black Coverage Status GitHub commits latest GitHub commit rate

Chanjo2 is coverage analysis tool for human clinical sequencing data using the d4 (Dense Depth Data Dump) format. It's implemented in Python FastAPI and provides API endpoints to communicate with a d4tools software in order to return coverage and coverage completeness over genomic intervals (genes, transcripts, exons as well as custom intervals) over single d4 files or samples stored in the database with associated d4 files.

Run a software demo containing test data

A demo REST server connected with a temporary SQLite database can be launched using Docker:

docker run -d --rm  -p 8000:8000 --expose 8000 clinicalgenomics/chanjo2:latest

The endpoints of the app will be now reachable and described from any web browser: http://0.0.0.0:8000/docs or http://localhost:8000/docs

From a terminal, you can use the API to access the data contained in the demo database of this Chanjo2 instance:

Examples of endpoints usage:

Return available cases (cases are collections of related samples):

curl -X 'GET' \
  'http://localhost:8000/cases/' \
  -H 'accept: application/json'

This will return a json response describing the demo case and its associated sample:

[
  {
    "display_name": "643594",
    "name": "internal_id",
    "id": 1,
    "samples": [
      {
        "coverage_file_path": "/home/worker/app/src/chanjo2/demo/panelapp_109_example.d4",
        "display_name": "NA12882",
        "track_name": "ADM1059A2",
        "name": "ADM1059A2",
        "case_id": 1,
        "created_at": "2023-06-01T08:05:12",
        "id": 1
      }
    ]
  }
]

The available demo sample contains a d4 coverage file with a limited amount of genes in genome build GRCh37, those present in PanelApp gene panel 109 (Cerebral folate deficiency): .

Loading genes to the database

In order to perform coverage queries over genes, transcripts and exons, these genomic intervals should be saved into the database first. Genes, transcripts and exons definitions are collected from Ensembl using a BioMart service.

To load genes in genome build GRCh37 from into the database send the following POST request to the server:

curl -X 'POST' \
  'http://localhost:8000/intervals/load/genes/GRCh37' \
  -H 'accept: application/json' \
  -d ''

The response will return the number of genes inserted in the database:

{
  "detail": "57849 genes loaded into the database"
}

Return coverage data over genes of database sample

Sequencing coverage and coverage completeness statistics can be returned for genes, transcripts and exons by providing a list of genes. The provided gene list accepts genes in the following formats:

  • Ensembl gene IDs

  • HGNC ids

  • HGNC symbols

  • The user should also specify a valid genome build - either GRCh37 or GRCh38.

For instance, to retrieve coverage stats for the demo sample (mean gene coverage and coverage completeness with sequencing depth of for instance 30, 20 and 10) over the genes the Cerebral folate deficiency PanelAPP panel (DHFR, FOLR1, MTHFR, SLC46A1 genes), send the following POST request:

curl -X 'POST' \
  'http://localhost:8000/coverage/samples/genes_coverage' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "build": "GRCh37",
   "completeness_thresholds": [
    30, 20.10
  ],
  "hgnc_gene_symbols": [
    "FOLR1", "DHFR", "MTHFR", "SLC46A1"
  ],
  "case": "internal_id"
}'

That will return this response, containing the requested statistics over the list of 4 genes:

{
  "ADM1059A2": [
    {
      "mean_coverage": 22.76,
      "completeness": {
        "10": 0.89,
        "20": 0.74,
        "30": 0.21
      },
      "interval_id": null,
      "interval_type": "genes",
      "inner_intervals": [],
      "hgnc_id": 2861,
      "hgnc_symbol": "DHFR",
      "ensembl_gene_id": "ENSG00000228716"
    },
    {
      "mean_coverage": 22.48,
      "completeness": {
        "10": 1,
        "20": 0.77,
        "30": 0.03
      },
      "interval_id": null,
      "interval_type": "genes",
      "inner_intervals": [],
      "hgnc_id": 3791,
      "hgnc_symbol": "FOLR1",
      "ensembl_gene_id": "ENSG00000110195"
    },
    {
      "mean_coverage": 22.07,
      "completeness": {
        "10": 1,
        "20": 0.69,
        "30": 0.07
      },
      "interval_id": null,
      "interval_type": "genes",
      "inner_intervals": [],
      "hgnc_id": 7436,
      "hgnc_symbol": "MTHFR",
      "ensembl_gene_id": "ENSG00000177000"
    },
    {
      "mean_coverage": 22.2,
      "completeness": {
        "10": 0.99,
        "20": 0.7,
        "30": 0.09
      },
      "interval_id": null,
      "interval_type": "genes",
      "inner_intervals": [],
      "hgnc_id": 30521,
      "hgnc_symbol": "SLC46A1",
      "ensembl_gene_id": "ENSG00000076351"
    }
  ]
}

To find more information on how to set up a REST server running chanjo2 please visit the software's documentation pages. Here you'll find also instructions on how to populate the database with custom cases and different genomic intervals.

Coverage report and genes coverage overview

Chanjo2 can be directly used to create the same types of report produced by chanjo-report in conjunction with chanjo[chanjo].

Given a running demo instance of chanjo2, with gene genes and transcripts from genome build GRCh37 loaded in the database, an example of the coverage report based on PanelAPP gene panel described is provided by this demo report endpoint: http://0.0.0.0:8000/report/demo:

image

Similarly, an example report containing an overview of the genes with partial coverage at the given coverage thresholds is provided by the demo overview endpoint:

image

Follow the instructions present in the documentation pages to learn how to use the report and the overview endpoints to create customised gene coverage report using this software.

chanjo2's People

Contributors

dependabot[bot] avatar northwestwitch avatar ramprasadn avatar vince-janv avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

chanjo2's Issues

Track code coverage using codecov

To do:

  • Add codecov report part in GitHub tests action
  • Register the report in codecov
  • Send the first coverage report from main branch to codecov

Demo app is launched twice because of gunicorn settings

After introducing the logs, I've realised that the demo app is launched twice:

Basically it's creating the tables twice and printing the Running a demo instance of Chanjo2 twice. This is not unexpected, but annoying. It's caused by gunicorn running with 2 workers

We could solve this in 2 ways:

  • Changing the gunicorn config to use 1 processor
  • Launching the demo app with the uvicorn command instead: uvicorn src.chanjo2.main:app, which is also easier..

I would opt for the second option. We just need to change a line in the README file. I'll fix!

Endpoint to load gene intervals

Create an endpoint that loads the intervals for each single gene (the entire gene). There will be a similar thing for transcripts and exons as well.

it should accept genome build.

Should download and parse the genes file and create one entry for it in the intervals table with the following cols:

  • id
  • chromosome
  • start
  • stop

Additionally it should create and link tags for the gene above so it's searchable using the following parameters:

  • build (required)
  • one of the following items: [Ensembl id, HGNC symbol, HGNC ID] (optional)

Establish MVP for chanjo-report

  • Determine minimum viable product which replaces current functionality of chanjo-report

Discussion might be needed with Scout team and potentially some Scout customers.
Are all important metrics present in current report?
Do we need an additional microservice for chanjo-report or should chanjo2 have a report functionality?

VARCHAR requires a length on dialect mysql

Testing the app with a real MySQL database. I get the following error:

sqlalchemy.exc.CompileError: (in table 'samples', column 'coverage_file_path'): VARCHAR requires a length on dialect mysql

The demo is working because it's using another SQL dialect (sqlite)

Some warnings from the docker-image-push GitHub action

Warning: The save-state command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
Docker info
Buildx version
Warning: The set-output command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/

Add instructions on how to use a dockerized chanjo2 on the fly to get coverage data from files

For instance having a d4 file and a bed file with the intervals.

Chanjo2 can be used also in projects not related at all with Scout, but as a container that can be runned to gather quick coverage data when running a pipeline, for instance.

Another useful application for non-scout users: run it as a container and provide a d4 file and a list of genomic coordinates (doesn't require a populated database at all).

Updating genes can't be done unless transcripts are removed first

This happens in a database pre-populated with data, If database already contains genes AND transcripts, you get an error, since transcripts table contains reference to the genes one, so the genes can't be removed.

What to do:

When the command to update genes is launched, then also eventual old transcripts and exons data should be dropped, which makes sense!

Can't launch the app via docker-compose or direct gunicorn command

Hi @Vince-janv and @ramprasadn. I'd like to revive this project and I'm trying to launch the application.

Using docker-compose (docker-compose up) doesn't work. I get the following error:

(chanjo2) chiararasi@ChiaraRMBP:~/Documents/work/GITs/chanjo2$ docker-compose up
WARNING: The D4DB_NAME variable is not set. Defaulting to a blank string.
WARNING: The D4DB_USER_PASSWORD variable is not set. Defaulting to a blank string.
WARNING: The MYSQL_HOST_PORT variable is not set. Defaulting to a blank string.
WARNING: The MYSQL_CONTAINER_PORT variable is not set. Defaulting to a blank string.
WARNING: The DATA_VOLUME variable is not set. Defaulting to a blank string.
WARNING: The D4DB_USER_NAME variable is not set. Defaulting to a blank string.
WARNING: The CHANJO_HOST_PORT variable is not set. Defaulting to a blank string.
WARNING: The CHANJO_CONTAINER_PORT variable is not set. Defaulting to a blank string.
ERROR: The Compose file './docker-compose.yml' is invalid because:
services.chanjo2-demo.ports contains an invalid type, it should be a number, or an object
services.d4database.ports contains an invalid type, it should be a number, or an object
(chanjo2) chiararasi@ChiaraRMBP:~/Documents/work/GITs/chanjo2$

Same when I invoke the gunicorn command: gunicorn --config gunicorn.conf.py src.chanjo2.main:app (I guess?):

  File "/Users/chiararasi/Documents/work/GITs/chanjo2/src/chanjo2/main.py", line 9, in <module>
    from chanjo2.dependencies import engine, get_session
  File "/Users/chiararasi/Documents/work/GITs/chanjo2/src/chanjo2/dependencies.py", line 11, in <module>
    engine = create_engine(mysql_url, echo=True)
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlmodel/engine/create.py", line 139, in create_engine
    return _create_engine(url, **current_kwargs)
  File "<string>", line 2, in create_engine
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlalchemy/util/deprecations.py", line 309, in warned
    return fn(*args, **kwargs)
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlalchemy/engine/create.py", line 530, in create_engine
    u = _url.make_url(url)
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlalchemy/engine/url.py", line 731, in make_url
    return _parse_rfc1738_args(name_or_url)
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlalchemy/engine/url.py", line 787, in _parse_rfc1738_args
    components["port"] = int(components["port"])
ValueError: invalid literal for int() with base 10: 'None'
[2023-01-11 10:34:06 +0100] [49258] [INFO] Worker exiting (pid: 49258)
[2023-01-11 10:34:06 +0100] [49255] [WARNING] Worker with pid 49258 was terminated due to signal 15
[2023-01-11 10:34:06 +0100] [49255] [INFO] Shutting down: Master
[2023-01-11 10:34:06 +0100] [49255] [INFO] Reason: Worker failed to boot.

I see where the error is in both cases. If you don't mind I'd like to write a basic fix and some instructions to run the app in README?

Enable d4 on hasta

  • Set up d4 on hasta using singularity.
  • Document installation and maintenance instructions

Create a skeleton

Description

Create a skeleton of the app that describes how the structure should be.

This can be closed when

  • There is a skeleton with a couple of relevant endpoints

Deprecated code in dbutil.py

It will raise an error when switching to SQLAlchemy 2

/Users/chiararasi/Documents/work/GITs/chanjo2/src/chanjo2/dbutil.py:32: MovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings. Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
Base = declarative_base()

Generate d4 test data

Test files in *.d4 format generated

  • MIP-DNA TGS
  • MIP-DNA WES
  • MIP-DNA WGS
  • BALSAMIC TGS
  • BALSAMIC WES
  • BALSAMIC WGS

Add option to display only MANE transcripts in coverage module

(I don't know if this issue should be reported in the chanjo repository instead, feel free to redirect)

In the Chanjo Report in Scout, it would be nice to have an option (perhaps a tick box) to narrow down the information about fully/incompletely covered transcripts so only information on MANE Select (and perhaps also Plus Clinical) transcripts is shown when requested.

Validate region files on post

Description

When region definition files gets added to the database they should be validated.
It is not decided what format these files should be on yet, bed is probably the way to go. csv could be an alternative.

This can be closed when

  • The app validates that the file exists
  • The app validates that the file is on the correct format

Logs printed twice

I have the feeling that metadata object is called/imported from the wrong place. Bug is present in current main.

Create database models

  • Create a sketch of database models and relationships
  • Create mock objects of database entries according to the model for development and testing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.