clinical-genomics / chanjo2 Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 4.38 MB

Persistent coverage analysis tool using the d4 format

Home Page: https://clinical-genomics.github.io/chanjo2/

Python 91.71% Dockerfile 0.40% CSS 0.18% HTML 7.72%

chanjo2's People

Contributors

Stargazers

Watchers

chanjo2's Issues

Track code coverage using codecov

To do:

Add codecov report part in GitHub tests action
Register the report in codecov
Send the first coverage report from main branch to codecov

Move code running the queries to a separate API

It shouldn't be inside the endpoints methods

Enable d4 on hasta

Set up d4 on hasta using singularity.
Document installation and maintenance instructions

Description

When region definition files gets added to the database they should be validated.
It is not decided what format these files should be on yet, bed is probably the way to go. csv could be an alternative.

This can be closed when

The app validates that the file exists
The app validates that the file is on the correct format

Create the demo database in a temp directory, so it is deleted when app is closed

Or something like this, because the db file now is persistent, and the file stays there when you stop the app

Created docs via mkdocs or some other package

Would be nice to have the basics on README file and a more detailed documentation somewhere else

Updating genes can't be done unless transcripts are removed first

This happens in a database pre-populated with data, If database already contains genes AND transcripts, you get an error, since transcripts table contains reference to the genes one, so the genes can't be removed.

What to do:

When the command to update genes is launched, then also eventual old transcripts and exons data should be dropped, which makes sense!

Create endpoint for fetching coverage data from a genomic range

establish what input arguments are to be provided to the API and how those should be formatted (should be easy to parse/embed Schug json response)
establish how response model should be formatted (what is the response expected by scout/chanjoreport?)

d4 coverage output format doesn't include all fields of a provided regions BED file

For instance, using example the demo regions file from this repo:

When the library is used to compute the coverage, then the gene names are lost:

While this is not a big deal when calculating the coverage over a list of genes, it perhaps requires special care over a list of transcripts or exons.

Run multi-interval queries over .d4 files

So far one can query only one interval or a chromosome

Change this repo from private to public

Perhaps we might wait that it's functional before doing so but since it has to be used by other centers I think the code of this repo should be public

Endpoint to load gene intervals

Create an endpoint that loads the intervals for each single gene (the entire gene). There will be a similar thing for transcripts and exons as well.

it should accept genome build.

Should download and parse the genes file and create one entry for it in the intervals table with the following cols:

id
chromosome
start
stop

Additionally it should create and link tags for the gene above so it's searchable using the following parameters:

build (required)
one of the following items: [Ensembl id, HGNC symbol, HGNC ID] (optional)

Direct support for an env file when launching the app

Startup when env file right now works only if the env file is used in docker to mock env vars. Modify the main file instead to accept and use and env file also when started from conda env.

Deprecated code in dbutil.py

It will raise an error when switching to SQLAlchemy 2

/Users/chiararasi/Documents/work/GITs/chanjo2/src/chanjo2/dbutil.py:32: MovedIn20Warning: Deprecated API features detected! These feature(s) are not compatible with SQLAlchemy 2.0. To prevent incompatible upgrades prior to updating applications, ensure requirements files are pinned to "sqlalchemy<2.0". Set environment variable SQLALCHEMY_WARN_20=1 to show all deprecation warnings. Set environment variable SQLALCHEMY_SILENCE_UBER_WARNING=1 to silence this message. (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
Base = declarative_base()

Coverage not updated when merging PRs

This repo should have a coverage > 95% now but it is still showing 64% like some weeks ago

Wrong build value in genes, transcripts, exons

It should be "GRCh37", not "build_37"

I must have introduced this error in one of my latest PRs:

Create endpoint for fetching coverage statistics from a bed file

Define input and output models of endpoint
Evaluate response time of the endpoint given different approaches

Support update genes/transcripts/exons from pre-downloaded files

Load genes from file
Load transcripts from file
Load exons from file

Establish MVP for chanjo-report

Determine minimum viable product which replaces current functionality of chanjo-report

Discussion might be needed with Scout team and potentially some Scout customers.
Are all important metrics present in current report?
Do we need an additional microservice for chanjo-report or should chanjo2 have a report functionality?

Minimize the size of the docker image

Docker image of the app is currently 1.31 GB in size, and it'd be nice too trim it down a bit.

Poor performance of Docker image on ARM architecture

I've realised just today, switching to an ARM M2 processor computer

Create a skeleton

Description

Create a skeleton of the app that describes how the structure should be.

This can be closed when

There is a skeleton with a couple of relevant endpoints

Add logging system to the app

Will be useful for debugging and in development!

Add option to display only MANE transcripts in coverage module

(I don't know if this issue should be reported in the chanjo repository instead, feel free to redirect)

In the Chanjo Report in Scout, it would be nice to have an option (perhaps a tick box) to narrow down the information about fully/incompletely covered transcripts so only information on MANE Select (and perhaps also Plus Clinical) transcripts is shown when requested.

Demo data in docker-compose

We could provide an example with real data by following this tutorial. There are also real files that can be used!

Schug and chanjo2 libs are not compatible

Create tests in an environment that mocks the availability of a database

There is no way to test the endpoints unless we mock an environment that has the availability of a database. This is because the database connection and tables analysis is a prerequisite for the app to start

Create a dockerfile

Create dockerfile for local development and staging

Add support for multitrack D4 files

So far we support D4s with only one track

Infinite loop when loading genes in genome build 38

Using the API to load genes in build 38 works, but then the app gets stuck in printing this line:

Do not crash on SQLAlchemy connection error

Display the error instead

Create database models

Create a sketch of database models and relationships
Create mock objects of database entries according to the model for development and testing

Use SQLAlchemy 2.0

Current app is using lib version 1.4.31, but 2.0 has great new features that will make code more readable and operations faster:

https://docs.sqlalchemy.org/en/20/changelog/migration_20.html

Some warnings from the docker-image-push GitHub action

Warning: The save-state command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
Docker info
Buildx version
Warning: The set-output command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/

Generate d4 test data

Test files in *.d4 format generated

Add completeness to the coverage metrics returned

Just like in chanjo

Move coverage endpoints to a dedicated file

Separate them from the intervals code

Filter available intervals (genes, transcripts, exons) by gene (both hgnc id, symbol or ensembl id)

This code will also be used when extracting the intervals to calculate the coverage on the d4 files

Modify the endpoint to create samples to accept d4 files hosted on a static HTTP server

It should check that the provided URL points to the resource.

Since the analysis is slower, the resource should have an index (check that it exists as well)

Coverage for all samples of a case

Add one more parameter to the query so that user can specify a list of samples

VARCHAR requires a length on dialect mysql

Testing the app with a real MySQL database. I get the following error:

sqlalchemy.exc.CompileError: (in table 'samples', column 'coverage_file_path'): VARCHAR requires a length on dialect mysql

The demo is working because it's using another SQL dialect (sqlite)

Logs printed twice

I have the feeling that metadata object is called/imported from the wrong place. Bug is present in current main.

Demo app is launched twice because of gunicorn settings

After introducing the logs, I've realised that the demo app is launched twice:

Basically it's creating the tables twice and printing the Running a demo instance of Chanjo2 twice. This is not unexpected, but annoying. It's caused by gunicorn running with 2 workers

We could solve this in 2 ways:

Changing the gunicorn config to use 1 processor
Launching the demo app with the uvicorn command instead: uvicorn src.chanjo2.main:app, which is also easier..

I would opt for the second option. We just need to change a line in the README file. I'll fix!

Add instructions on how to use a dockerized chanjo2 on the fly to get coverage data from files

For instance having a d4 file and a bed file with the intervals.

Chanjo2 can be used also in projects not related at all with Scout, but as a container that can be runned to gather quick coverage data when running a pipeline, for instance.

Another useful application for non-scout users: run it as a container and provide a d4 file and a list of genomic coordinates (doesn't require a populated database at all).

Speed up computation of coverage completeness over a list of genes

When it comes to calculate coverage completeness the responses get slow

Can't launch the app via docker-compose or direct gunicorn command

Hi @Vince-janv and @ramprasadn. I'd like to revive this project and I'm trying to launch the application.

Using docker-compose (docker-compose up) doesn't work. I get the following error:

(chanjo2) chiararasi@ChiaraRMBP:~/Documents/work/GITs/chanjo2$ docker-compose up
WARNING: The D4DB_NAME variable is not set. Defaulting to a blank string.
WARNING: The D4DB_USER_PASSWORD variable is not set. Defaulting to a blank string.
WARNING: The MYSQL_HOST_PORT variable is not set. Defaulting to a blank string.
WARNING: The MYSQL_CONTAINER_PORT variable is not set. Defaulting to a blank string.
WARNING: The DATA_VOLUME variable is not set. Defaulting to a blank string.
WARNING: The D4DB_USER_NAME variable is not set. Defaulting to a blank string.
WARNING: The CHANJO_HOST_PORT variable is not set. Defaulting to a blank string.
WARNING: The CHANJO_CONTAINER_PORT variable is not set. Defaulting to a blank string.
ERROR: The Compose file './docker-compose.yml' is invalid because:
services.chanjo2-demo.ports contains an invalid type, it should be a number, or an object
services.d4database.ports contains an invalid type, it should be a number, or an object
(chanjo2) chiararasi@ChiaraRMBP:~/Documents/work/GITs/chanjo2$

Same when I invoke the gunicorn command: gunicorn --config gunicorn.conf.py src.chanjo2.main:app (I guess?):

  File "/Users/chiararasi/Documents/work/GITs/chanjo2/src/chanjo2/main.py", line 9, in <module>
    from chanjo2.dependencies import engine, get_session
  File "/Users/chiararasi/Documents/work/GITs/chanjo2/src/chanjo2/dependencies.py", line 11, in <module>
    engine = create_engine(mysql_url, echo=True)
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlmodel/engine/create.py", line 139, in create_engine
    return _create_engine(url, **current_kwargs)
  File "<string>", line 2, in create_engine
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlalchemy/util/deprecations.py", line 309, in warned
    return fn(*args, **kwargs)
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlalchemy/engine/create.py", line 530, in create_engine
    u = _url.make_url(url)
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlalchemy/engine/url.py", line 731, in make_url
    return _parse_rfc1738_args(name_or_url)
  File "/Users/chiararasi/miniconda3/envs/chanjo2/lib/python3.8/site-packages/sqlalchemy/engine/url.py", line 787, in _parse_rfc1738_args
    components["port"] = int(components["port"])
ValueError: invalid literal for int() with base 10: 'None'
[2023-01-11 10:34:06 +0100] [49258] [INFO] Worker exiting (pid: 49258)
[2023-01-11 10:34:06 +0100] [49255] [WARNING] Worker with pid 49258 was terminated due to signal 15
[2023-01-11 10:34:06 +0100] [49255] [INFO] Shutting down: Master
[2023-01-11 10:34:06 +0100] [49255] [INFO] Reason: Worker failed to boot.

I see where the error is in both cases. If you don't mind I'd like to write a basic fix and some instructions to run the app in README?

clinical-genomics / chanjo2 Goto Github PK

chanjo2's People

Contributors

Stargazers

Watchers

chanjo2's Issues

Description

This can be closed when

Description

This can be closed when

Recommend Projects

Recommend Topics

Recommend Org