dieterich-lab / gear Goto Github PK

This project forked from igs/gear

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.

Home Page: https://umgear.org

License: GNU Affero General Public License v3.0

Python 29.24% Shell 0.05% Dockerfile 0.14% Jupyter Notebook 25.90% HTML 17.34% R 0.43% PHP 3.27% Perl 0.01% CSS 2.94% JavaScript 20.68%

gear's People

Contributors

Stargazers

Watchers

gear's Issues

No profile (layout) fails on dataset explorer

Traceback (most recent call last):
  File "/var/www/cgi/get_users_layout_members.cgi", line 36, in <module>
    main()
  File "/var/www/cgi/get_users_layout_members.cgi", line 28, in main
    if layout.user_id == user.id:
AttributeError: 'NoneType' object has no attribute 'user_id'

geardb.get_layout_by_id returns None if no layout is found, hence in get_users_layout_members.cgi there should be a check before evaluating if layout.user_id == user.id.

What other consequences does this have if a user has no profile?

See New portal with no profiles fails on dataset explorer page

Dataset curator: violin faceted by columns show an unexpected behaviour

Is there an existing issue for this?

I checked the documentation and found no answer
This bug has not already been reported

A clear and concise description of what the issue is.

Go to dataset curator, select violin, and facet by columns. All facets have their their own y-axis scales (which are not displayed). The figure only shows one y-axis, which appears to be "tied" to the leftmost facet.

e.g. choose

Dataset: Pathogenic variants damage cell composition and single-cell transcription in cardiomyopathies (pseudobulk)
Display Type: Violin
Gene: MYH6
x-axis: genotype
y-axis: expression
Facet Column: celltype

The resulting plot suggests that all cell types have the same expression of MYH6, but hovering over the data, it is clear that e.g. CM have over 25K, while other cell types are mostly below 100. Re-ordering the cell types toggles a change in y-axis.

Output or error messages.

No response

Anything else?

Until this issue is fixed, we just need to be aware that violin faceted by columns can show an unexpected behaviour.
This can be "corrected" by hand on a case by case basis using Y Tick Range. The max range of data can be found by hovering over the plot, and Y Tick Range set manually. For other displays, just reset Y Tick Range to nothing/null.

What browser were you using?

Firefox, Chrome

Create account notification and log-in/log-out issues

After Create account, there are no notification, no message. We are sent back to the main page.
When selecting pages from the drop-down menu account user (or e.g. sometimes when hitting the back button)

after the new page has loaded, the user drop-down menu disappears (or there is a delay before it appears again). As a results, we are left with the general login template (user cannot access profile, or cannot log out, etc. )

This is unclear if the user is logged out, in fact it looks like it is not.

In general, hitting the back button has inconsistent behaviour.

Data management

Issues and features related to data upload.

Data upload (H5AD)

~~[BUG] Currently, if uploading in H5AD format, the original data is left under uploads/files. We can either handle this case differently, or just remove the original data.~~
[FEATURE] We need to determine how best to allow using existing unstructured metadata, layers, or observation/variable-level matrices (such as UMAP, etc. )

Data upload (large data) - [ENHANCEMENT]

For relatively large datasets (e.g. 10G H5AD file), the current upload is not suitable, this will take forever, or will be interrupted.
Meanwhile, I added a new apache2 config unlimited_uploads.conf with LimitRequestBody 0, and further raised the PHP limits

post_max_size = 0
upload_max_filesize = 30000M
max_execution_time = 300

but there may be other timeout configurations that may also interrupt PHP execution. We need to think of a longer term solution. See #14 , I think this will work, at least for now. If we keep this solution, we should clean the PHP upload script and add proper logging.

Data upload (general)

~~[QUESTION] It looks like the original metadata is left under uploads/files. This might raise security issues. Should we remove it?~~
[DOCUMENTATION] We need to update the documentation, in particular for H5AD (and prioritize this format, at least for scRNA-seq).
[REMARK] File names must match exactly, otherwise upload will fail without any meaningful error message, e.g. if using gene.tab instead of genes.tab. Documentation should either be clear about this, or we allow some fuzziness in file names during upload, or we make sure an appropriate error message is displayed.
[REMARK] For failed uploads, some files may remain under /tmp or files/uploads.

metadata fields

We need to think how to handle annotation source and release.

The problem is that when we try to automate the preparation of e.g. GEO datasets, in many instances there is no information on annotation source, and even less for the version number, but often only things like assembly hg19, mm10., ...

Also, the definition of annotation_source, i.e. Ensembl or Genbank in the portal (metadata) is not clear. What is actually Genbank? The Ensembl and GENCODE gene models are more or less the same, but GENCODE uses the UCSC convention. Is Genbank meant to represent NCBI RefSeq? See differences and common gene tracks on the UCSC website.

Couldn't we allow an empty field in annotation_release_number?
Shouldn't we allow annotation_source to be empty (or redefine choices), and/or add a assembly token to be more general?
This is unrelated to annotation, but I also think we should add an additional field for scRNA-seq, to flag whether the data is raw or processed/normalised. My current understanding is that it is possible to upload raw scRNA-seq, and we aim to extend the functionalities of the workbench to properly deal with this case (integrate multiple samples, normalise, etc. ).

If annotation_release_number is not numeric, we currently do not check at metadata validation step, unless I'm mistaken. Suppose that we wrote hg38, instead of e.g. 102 (assuming we wanted hg38/GRCh38 102), then we end up with

Exception: Failed to insert metadata: 1366 (22007): Incorrect integer value: 'hg38' for column `gear_portal`.`dataset`.`annotation_release` at row 1: /var/www/cgi/load_dataset_finalize.cgi

The problem is that there is no meaningful warning displayed in the portal, e.g. Annotation release must be numeric, please change metadata....

Back up adata fills up disk space

In h5ad_identify_variable_genes.cgi, line 74 adata.filename = dest_datafile_path + ".backed.h5ad" writes adata as a dense matrix. There was this comment: This next command hits a memory issue if not backed. But the previous commands fail for various reasons if backed. So back up now.

The problem is that writing a dense matrix can take up to ~50GB or even more, even after filtering, and thus not only quickly fills diskspace under /tmp, but takes a lot of time to write... In all cases, in our current test setup, this ends up with No space left on device...

test from https://dhart.dieterichlab.org/contact.html

From: Etienne Boileau

Email: [email protected]

Server IP: 10.250.140.151

Msg: Using the contact form with public data, attaching a file.

Attaching a file results in

[Thu Jul 21 16:37:58.493326 2022] [php7:notice] [pid 4033881] [client 10.250.140.13:52836] PHP Notice:  Trying to access array offset on value of type null in /dhart/gEAR/www/contact_screenshots/UploadHandler.php on line 469, referer: https://dhart.dieterichlab.org/contact.html
[Thu Jul 21 16:37:58.493898 2022] [php7:notice] [pid 4033881] [client 10.250.140.13:52836] Function not found: imagecreatetruecolor, referer: https://dhart.dieterichlab.org/contact.html
[Thu Jul 21 16:37:58.493916 2022] [php7:notice] [pid 4033881] [client 10.250.140.13:52836] Function not found: imagecreatetruecolor, referer: https://dhart.dieterichlab.org/contact.html
[Thu Jul 21 16:37:58.493941 2022] [php7:info] [pid 4033881] [client 10.250.140.13:52836] PHP Deprecated:  implode(): Passing glue string after array is deprecated. Swap the parameters in /dhart/gEAR/www/contact_screenshots/UploadHandler.php on line 1048, referer: https://dhart.dieterichlab.org/contact.html

Tags: ['']

Screenshot: https://dhart.dieterichlab.org/contact_screenshots/f71cdd6b-697e-4269-903f-87f65c1f70d2.jpg

Package versions, missing packages, and dependencies

There are many pinned (older) versions of python packages, and dependencies will eventually be a problem...
The gEAR developers have already started to work on this, see below.

Minor changes are only documented in CHANGELOG.
See also Ansible install playbook.
We should eventually move to testing.

Regarding the diffxpy install, we need to make sure it is installed in the virtual environment as www-data user.
We can make a test install using

pip install pip-install-test
pip show pip_install_test
python -c 'import pip_install_test'

then

pip install git+https://github.com/adkinsrs/diffxpy.git@b2ebeb0fb7c6c215d51264cd258edf9d013ff021

But we should find a better solution in the long term.

We also found out that we had issues with dash (probably due to the force re-install), see
ImportError: cannot import name 'get_current_traceback' from 'werkzeug.debug.tbtools
We could try pip install Werkzeug==2.0.0 in the virtual environment, or solve the dependency problem if possible.
Now we are left with

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dash-bio 0.6.1 requires dash>=1.6.1, but you have dash 1.3.0 which is incompatible.

We also found out that ModuleNotFoundError: No module named 'rpy2'.
This is used in www/api/resources/projectr.py, and

line 30 from resources.projectr import ProjectR
line 42 api.add_resource(ProjectR, '/projectr/<dataset_id>')

in www/api/api.py.

But this is not mentioned anywhere except in docker/requirements.txt? And the install won't be successful unless
we have a running R installation... so how is this working?

See Implement ProjectR

There is a software_update_bugfixes branch on gEAR, apparently started on IGS@cc89c46, with following updated dependencies

anndata==0.8.0 \
biocode==0.10.0 \
biopython==1.79 \
dash-bio==1.0.2 \
Flask==2.1.0 \
Flask-RESTful==0.3.9 \
h5py==3.6.0 \
itsdangerous==2.1.2 \
jupyter==1.0.0 \
kaleido==0.2.1 \
llvmlite==0.38.0 \
mod-wsgi==4.9.0 \
MulticoreTSNE==0.1 \
mysql-connector-python==8.0.28 \
numba==0.55.1 \
numexpr==2.8.1 \
numpy==1.21.5 \
opencv-python==4.5.5.64 \
pandas==1.4.1 \
Pillow==9.0.1 \
plotly==5.6.0 \
python-dotenv==0.20.0 \
requests==2.27.1 \
scanpy==1.8.2 \
scanpy[louvain]==1.8.2 \
scikit-learn==1.0.2 \
scipy==1.8.0 \
SQLAlchemy==1.4.32 \
xlrd==2.0.1

but it is not yet integrated. See also Commits software_update_bugfixes, and also Backup production and update OS and all libraries on devel.

It seems as if some earlier changes e.g. commit ae11f63449d7bda768bf6f9503aa3ec2ff42cda7 on www/api/resources/multigene_dash_data.py are lost...

It is not easy to find associated commits that would resolve issues associated with upgrade...

Deprecated Feature Used

There are a lot of Deprecated Feature Used:

[core:info] [pid 3637213] [client 10.250.100.26:33912] AH00128: File does not exist: /var/www/js/vendor/jsrender.min.js.map
[core:info] [pid 3625131] [client 10.250.100.26:33948] AH00128: File does not exist: /var/www/css/vendor/bs-stepper.min.css.map
[core:info] [pid 3637213] [client 10.250.100.26:33912] AH00128: File does not exist: /var/www/js/vendor/jsrender.min.js.map
[core:info] [pid 3625131] [client 10.250.100.26:33948] AH00128: File does not exist: /var/www/js/vendor/bs-stepper.min.js.map

Also, we get

DevTools failed to load source map: Could not load content for http://10.250.135.19/js/vendor/jsrender.min.js.map: HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE

test api 2

From: Etienne Boileau

Email: [email protected]

Server IP: 10.250.135.22

Msg: Posting issue

Tags: ['test']

Upload fails silently for large files

Is there an existing issue for this?

I checked the documentation and found no answer
This bug has not already been reported

A clear and concise description of what the issue is.

This is related to #8 : Data upload (large data) - [ENHANCEMENT].

For large data (how large, what is the limit?), the upload fails silently, i.e. I cannot find any errors in the apache logs (or any other logs). I modified /etc/php/7.4/apache2/php.ini to set all error handling and logging more verbose, specifying the location of a php log e.g. /var/log/php_errors.log, to no avail.

This happens in js/upload_dataset.js in validate_expression. The call to cgi/validate_expression.cgi returns with an error, but filename is not None, so it is as if PHP succeeded. However, the dataset uploader fails to read it because the file is actually not there. Effectively, uploads/files/ does not contain the expression data, but only the metadata.

Output or error messages.

Oops! [Errno 2] Unable to open file (unable to open file: name = '../uploads/files/EGAS00001006374 (1).h5ad', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Anything else?

No response

What browser were you using?

Chrome

test api

From: Etienne Boileau

Email: [email protected]

Server IP: 10.250.135.22

Msg: Posting issue

Tags: ['test']

Data download and redirection goes to umgear.org

Is there an existing issue for this?

I checked the documentation and found no answer
This bug has not already been reported

A clear and concise description of what the issue is.

This happened when I clicked on Download H5AD for a given dataset

Error: Dataset tarball could not be found. Unable to download data file.
Redirecting... Click here if you are not redirected

In download_source_file.cgi, if dtype == 'h5ad' and os.path.isfile(tarball_path) results in NOT downloading H5AD if there is not a tar.gz file present. This could be the case if we uploaded in H5AD format (or in xlsx I suppose, although I haven't tested).
We need to change the path!

Output or error messages.

No response

Anything else?

Also, we need to change the redirection to DHART.

What browser were you using?

Firefox, Chrome

Take notes option broken

Is there an existing issue for this?

I checked the documentation and found no answer
This bug has not already been reported

A clear and concise description of what the issue is.

The option Take notes is broken. In get_dataset_notes.cgi, user notes are queried using SELECT ... n.date_last_changed, but in the DB this is date_last_change.

This can be solved with ALTER TABLE note RENAME COLUMN date_last_change TO date_last_changed;. I also modified the file create_schema.sql to reflect this for future install.

Output or error messages.

The end of the traceback

[Wed Jul 20 12:04:26.422884 2022] [cgi:error] [pid 3976182] [client 10.250.140.13:52132] AH01215:     raise errors.get_exception(packet): /var/www/cgi/get_dataset_notes.cgi, referer: https://dhart.dieterichlab.org/index.html?multigene_plots=0&share_id=f7891966&gene_symbol_exact_match=1&gene_symbol=nppa
[Wed Jul 20 12:04:26.422931 2022] [cgi:error] [pid 3976182] [client 10.250.140.13:52132] AH01215: mysql.connector.errors.ProgrammingError: 1054 (42S22): Unknown column 'n.date_last_changed' in 'field list': /var/www/cgi/get_dataset_notes.cgi, referer: https://dhart.dieterichlab.org/index.html?multigene_plots=0&share_id=f7891966&gene_symbol_exact_match=1&gene_symbol=nppa

Anything else?

We need to change the color scheme of the side panel notes (sidepanel_notes.css) to reflect the DHART style.
The green button to add notes (Create a new note below/Add a note) is missing! This might be due to our re-definition of some margins? This needs to be checked...

What browser were you using?

Firefox, Chrome

File upload

File upload is too complex. We need to think how this can be simplified/harmonized and/or standardised.
The current 3-tab or 10X-like formats (or even worst Excel) are not viable for large datasets.

Added h5ad upload minimal support. See below.

I do not understand how public datasets are handled. If I am logged-in as admin with curator privilege, the data that I upload will not be seen as public, unless I select this choice in the dataset explorer. But then this is the same for any user. This data will be located under www/analyses/by_user, and not under www/analyses/by_dataset, where it should be, according to the documentation. Obviously, if I change the settings in the DB, e.g. UPDATE dataset SET is_public = 1 WHERE id = "id"; this only affects the status of the dataset, not it's actual location. So how datasets are correctly uploaded for public access?

I think I understand better how this work now...

I also document a few more minor issues:

To avoid PHP Warning: failed to open stream: Permission denied, we need to make sure that

cd www
sudo chmod 777 datasets analyses/* uploads/files/

This was mentioned on the gEAR documentation, but somehow overlooked. This has been added to the Ansible playbook.

To avoid PHP Warning: POST Content-Length of 16172687560 bytes exceeds the limit of 3145728000 bytes, limits were set in the apache2/php.ini (not in cli/php.ini). We faced this when trying to upload tar files (as per documentation). In fact, the files should be compressed (tar.gz)! But even then, I could not upload compressed large files.
Minor changes to lib/gear/metadata.py, see CHANGELOG. In particular related to cgi.escape gone in Python3.8.

dieterich-lab / gear Goto Github PK

gear's People

Contributors

Stargazers

Watchers

gear's Issues

Is there an existing issue for this?

A clear and concise description of what the issue is.

Output or error messages.

Anything else?

What browser were you using?

Is there an existing issue for this?

A clear and concise description of what the issue is.

Output or error messages.

Anything else?

What browser were you using?

Is there an existing issue for this?

A clear and concise description of what the issue is.

Output or error messages.

Anything else?

What browser were you using?

Is there an existing issue for this?

A clear and concise description of what the issue is.

Output or error messages.

Anything else?

What browser were you using?

Recommend Projects

Recommend Topics

Recommend Org