sulab / biogps_dataset Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 2.0 1013 KB

BioGPS.org dataset & dataset loading code repository

Home Page: http://biogps.org/

Python 26.45% HTML 20.32% CSS 35.13% JavaScript 18.10%

biogps_dataset's Introduction

Detailed steps needed to have a local development version of BioGPS, for dataset loading

Setting up your local development environment:

Make sure you use git for version control (May 2016 Biogps_dataset was migrated to Github)

Clone repository and make virtual environment (use your Github username of course!)

git clone https://github.com/JTFouquier/biogps_dataset.git
virtualenv biogps
source biogps/bin/activate
pip install -r requirements.txt

You must have three main components running in order to see datasets

1) SSH into the remote BioGPS database

Because the dataset database is much too large to install on computer for local development, you need to request a connection to our dev db server

2) Run the local host server

The settings_dev file is a "secret file." Please see Chunlei or BioGPS project manager.

python3 manage.py runserver_plus --settings=biogps_dataset.settings_dev

3) Run Elastic Search

Install Elastic Search using these directions:

Elastic search is a search server based on Lucene.

It is a full-text search engine with an HTTP web interface and schema-free JSON documents.

Elastic search is developed in Java and is released as open source under the terms of the Apache License.

From within the elasticsearch folder that you set up, run:

./bin/elasticsearch

Next, get data from a BioGPS user/researcher

You will need to get an info sheet, factors sheet and RNAseq data/matrix file from a scientist.

Does the local dataset you are loading have gene symbols in it and is it an RNAseq dataset?

If yes, then you must run reporter_to_entrezgene.py, which will use mygene.info to replace gene symbols with Entrezgene IDs.

Entrezgene IDs are absolutely necessary for Biogps.org data display, but for microarray datasets, keep the probe set reporters.

Dataset Parsers:

load_ds command which will load remote datasets (Microarray data from ArrayExpress) to remote server for dev or prod.
load_ds_local command will load local datasets to the remote server for dev or prod. (written for RNAseq)

Run the command like this using Django manage.py, where "load_ds_local" can be other commands:

python3 manage.py load_ds_local --settings=biogps_dataset.settings_dev

Then you must use the command es_index to "index the data", then the newly loaded dataset should appear in the chart file:

python3 manage.py es_index --settings=biogps_dataset.settings_dev Use the "-c" argument if you want to clear previous indexing
python3 manage.py es_index -c --settings=biogps_dataset.settings_dev

Output looks something like this:

added 16 platform, added 5914 dataset

Open this url and you should see bar charts!

http://localhost:8000/static/data_chart.html

Must sometimes restart the localhost and server that is containing the database, as well as elasticsearch.

For help:

python3 manage.py load_ds --help --settings=biogps_dataset.settings_dev

Instances (models) to create during dataset loading:

If you don't know what a model is, then read about Django!

dataset:
- Model with information about a certain dataset including metadata.
dataset_matrix:
- is the dataset matrix that contains the entire dataset from the RNA seq run. Meaning, you likely do not want to display an instance of this model all at once!
dataset_data:
- is one reporter gene, and all of it's expression information for all samples.
Dataset Platform:
- We created a new platform since now we're loading a sequencing (not microarray) dataset. This is a sequencing platform, so does not have to be recreated every time.
- Example input information:
  - RNA seq
  - reporters empty list
  - name = "generic RNA seq platform for mouse"
  - species = mouse

Biogps takes the average of samples for you so you don't need user average

Misc. information for testing/developing BioGPS:

urls from mygene.info used to get the Entrezgene IDs from gene symbol (from reporter_to_entrezgene):

http://mygene.info/v2/query?q=symbol:CDK2
http://mygene.info/v2/query?q=symbol:0610005C13Rik

To access the dataset via the shell:

python3 manage.py shell_plus --settings=biogps_dataset.settings_dev

Run these commands from shell:

This returns the dataset object which is the foreign key for dataset data and dataset matrix:

ds = BiogpsDataset.objects.get(geo_gse_id="BDS_00015")

This returns all the metadata (from info sheet and factors):

ds.__dict__

Viewing datasets on your BioGPS localhost

Dropdown menu in "probeset" is also considered the reporter gene on BioGPS

Go to the URL for the specific gene and dataset name (primary key of dataset or geo_gse_id) geo_gse_id is also important: will be BDS_XXXXX next number in sequence)

Example dataset viewing urls:

http://localhost:8000/static/data_chart.html?gene=67669&dataset=10044
http://localhost:8000/static/data_chart.html?gene=12566&dataset=BDS_00015
http://localhost:8000/static/data_chart.html?gene=100152011&dataset=10078

Example admin:

http://localhost:8000/admin/dataset/biogpsdataset/2427/

Standard test gene is 1017, which is a human gene! So if you are using a mouse dataset, this will understandably be missing:

CDK2 cyclin-dependent kinase 2, Homo sapiens (human) Gene ID: 1017, updated on 6-Mar-2016

Cdk2 cyclin-dependent kinase 2, Mus musculus (house mouse) Gene ID: 12566, updated on 6-Mar-2016

You can also check the "fixed reporters" data file to see which Entrezgene IDs are actually in your dataset for viewing.

To view the full dataset (api) for a dataset and gene:

http://localhost:8000/dataset/full-data/geo_gse_id%20test/gene/12566/
http://localhost:8000/dataset/full-data/E-GEOD-16054/gene/1017/
http://localhost:8000/dataset/full-data/BDS_00001/gene/1017/

Misc. Information

Does your dataset have interesting tissue groups or organ systems?

If so, then change the color_idx in the json metadata (ex: admin/dataset/biogpsdataset/2509/) accordingly to group samples into meaningful groups. This is done manually due to the numerous variations of possible sample groupings

Make sure to run Flake8 (to check for Pep8 compliance), prior to pushing code to biogps_dataset repository.

biogps_dataset's People

Contributors

Watchers

Forkers

digideskio cyrus0824

biogps_dataset's Issues

in search, also index factor names and values

For example, a search for "DU145" should return this data set: http://biogps.org/dataset/E-GEOD-10832/transcription-profiling-by-array-of-human-prostate/

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/11
Originally reported by: Andrew Su
Originally created at: 2015-09-10T17:45:45.412

testing hipchat integration

Possible to enable "back button" for each individual "plugin" iframe window?

It would be nice to have this feature, but we need to investigate if that's possible.

Ref: https://stackoverflow.com/questions/3254985/back-and-forward-buttons-in-an-iframe

loading error for "E-GEOD-38516"

#!python

[INFO, L:74] -process experiment E-GEOD-38516-
[INFO, L:155] get experiment info from http://www.ebi.ac.uk/arrayexpress/json/v2/experiments/arrays.txt
[INFO, L:169] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/arrays.txt
[INFO, L:177] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.hyb.sdrf.txt
[INFO, L:177] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.seq.sdrf.txt
[INFO, L:219] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/E-GEOD-38516
[INFO, L:232] get sample file: http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-38516/E-GEOD-38516.processed.1.zip
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 84, in handle
    data_matrix = setup_dataset(e)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 268, in setup_dataset
    splited[i] = float(splited[i])
IndexError: list index out of range

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/3
Originally reported by: Chunlei Wu
Originally created at: 2014-03-18T18:18:04.834

set up monitoring

pingdom or similar...

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/6
Originally reported by: Andrew Su
Originally created at: 2015-07-16T18:27:42.903

loading error for "E-GEOD-26688"

#!python

[INFO, L:57] -process experiment E-GEOD-26688-
[INFO, L:138] get experiment info from http://www.ebi.ac.uk/arrayexpress/json/v2/experiments/arrays.txt
[INFO, L:152] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/arrays.txt
[INFO, L:160] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.hyb.sdrf.txt
[INFO, L:160] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.seq.sdrf.txt
[INFO, L:202] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/E-GEOD-26688
[INFO, L:219] sample file exists
[INFO, L:69] write database
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 103, in handle
    ds_matrix = np.array(data_matrix.values(), np.float32)
ValueError: setting an array element with a sequence.

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/2
Originally reported by: Chunlei Wu
Originally created at: 2014-03-18T02:22:39.731

Click a Y-axis label (usually, a sample name) to display more info about this sample

It would be a nice usability improvement to add this feature. Right now, users will need to switch between different labels from the "Label" drop-down to tell, or click the dataset detail page.

NameError: global name 'basestring' is not defined (possible issue)

I am using Python 3.4 and this known issue (NameError: global name 'basestring' is not defined) is closed and documented from 2008. https://bugs.python.org/issue1931... odd that I am even seeing this...

I added this try/except to widgets.py (see file path) source code in Python site packages. I'm sure that's not good practice but I couldn't get it to work otherwise. Can close this issue, just thought I'd document it.

load neuronal iPSC data set for Baldwin lab

Please see https://bitbucket.org/sulab/biogps_dataset/issues/13/load-neuronal-ipsc-data-set-for-baldwin for information about this issue. (issue failed to migrate correctly from Bitbucket to Github during merge).

Missing default datasets

When searching for Tmem173 in humans, only 2 of the usual 6 default human datasets are displayed. Gene Atlas U133A, gcrma is NOT one of the default datasets, although it used to be before (and there are values for this gene in this missing default dataset).

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/12
Originally reported by: gtsueng
Originally created at: 2015-10-06T21:14:51.012

reporting search terms on search results page

Two minor changes. First, keep the search terms in the search box at the top for easy refinement. Second, multi-word searches look like they are incorrectly reported in the bread crumb.

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/9
Originally reported by: Andrew Su
Originally created at: 2015-09-10T17:39:28.654

make dataset searching the same within data chart plugin and dataset library

Currently the top of the list looks similar, but the overall count is different..

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/8
Originally reported by: Andrew Su
Originally created at: 2015-09-10T17:35:56.431

display error on E-GEOD-22255 when choosing "AGE-AT-EXAMINATION" as label

On http://biogps.org/dataset/E-GEOD-22255/, select "View dataset", then select "AGE-AT-EXAMINATION" as the label. Data chart then shows "No data to display".

Doesn't seem to affect any of the other factors.

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/7
Originally reported by: Andrew Su
Originally created at: 2015-09-10T17:28:17.622

Load E-GEOD-12199 which used to be in BioGPS, but isn't any more

Used to be here according to a user: http://biogps.org/dataset/844/expression-data-from-breast-cancer-cell-lines-mcf-7-and-mda-mb-231/

Can get it from here:
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-12199/

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12199

Assigning to Chunlei, since I don't know who takes care of these. Please assign to appropriate person, if not you.

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/5
Originally reported by: gtsueng
Originally created at: 2015-07-02T17:18:59.471

loading error for "E-GEOD-49355"

#!python

[INFO, L:74] -process experiment E-GEOD-49355-
[INFO, L:155] get experiment info from http://www.ebi.ac.uk/arrayexpress/json/v2/experiments/arrays.txt
[INFO, L:169] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/arrays.txt
[INFO, L:177] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.hyb.sdrf.txt
[INFO, L:177] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.seq.sdrf.txt
[INFO, L:219] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/E-GEOD-49355
[INFO, L:232] get sample file: http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-49355/E-GEOD-49355.processed.1.zip
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 84, in handle
    data_matrix = setup_dataset(e)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 271, in setup_dataset
    raise Exception, 'file format wrong, check columns of file:%s'%(path+'/'+f)
Exception: file format wrong, check columns of file:tmp/unzip_sample/E-GEOD-49355/GSM1197996_sample_table.txt

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/1
Originally reported by: Chunlei Wu
Originally created at: 2014-03-18T02:20:05.386

switch to use "requests" and "requests_cache" modules to call web services

http://docs.python-requests.org/en/latest/

https://github.com/reclosedev/requests-cache

"requests" is a module much better than build-in urllib and urllib2.

"request_cache" provides a cache for requests, regardless the actual caching headers. Very useful for both development and actual data-loading.

Example code:

#!python

import requests
import requests_cache

requests_cache.install_cache('arrayexpress_cache')

res = requests.get('http://www.ebi.ac.uk/arrayexpress/json/v2/experiments/E-GEOD-19279')
data = res.json()

res = requests.get('http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-19279/E-GEOD-19279.sdrf.txt')
data = res.text()

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/4
Originally reported by: Chunlei Wu
Originally created at: 2014-03-18T19:28:57.855

load more complete metadata for CCLE data set (E-GEOD-36133)

CCLE data set is loaded in BioGPS is loaded here: http://biogps.org/dataset/E-GEOD-36133/expression-data-from-the-cancer-cell-line-encyclop/

But the sample metadata table only has these columns: Sample, HISTOLOGY SUBTYPE1, HISTOLOGY, PRIMARY SITE

From the raw metadata file at Array express (http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-36133/E-GEOD-36133.sdrf.txt) need to add column 4 for "Comment [Sample_title]", which is the cell line name.

search terms should be combined with "AND" and not "OR"

Currently search terms are combined by "OR" -- seems counterintuitive when number of matching datasets increases with number of search terms

Bitbucket: https://bitbucket.org/sulab/biogps_dataset/issue/10
Originally reported by: Andrew Su
Originally created at: 2015-09-10T17:42:57.892