Giter VIP home page Giter VIP logo

biogps_dataset's Introduction

Detailed steps needed to have a local development version of BioGPS, for dataset loading

Setting up your local development environment:

Make sure you use git for version control (May 2016 Biogps_dataset was migrated to Github)

Clone repository and make virtual environment (use your Github username of course!)

  • git clone https://github.com/JTFouquier/biogps_dataset.git
  • virtualenv biogps
  • source biogps/bin/activate
  • pip install -r requirements.txt

You must have three main components running in order to see datasets

1) SSH into the remote BioGPS database

Because the dataset database is much too large to install on computer for local development, you need to request a connection to our dev db server

2) Run the local host server

The settings_dev file is a "secret file." Please see Chunlei or BioGPS project manager.

  • python3 manage.py runserver_plus --settings=biogps_dataset.settings_dev

3) Run Elastic Search

Install Elastic Search using these directions:

Elastic search is a search server based on Lucene.

It is a full-text search engine with an HTTP web interface and schema-free JSON documents.

Elastic search is developed in Java and is released as open source under the terms of the Apache License.

From within the elasticsearch folder that you set up, run:

  • ./bin/elasticsearch

Next, get data from a BioGPS user/researcher

You will need to get an info sheet, factors sheet and RNAseq data/matrix file from a scientist.

Does the local dataset you are loading have gene symbols in it and is it an RNAseq dataset?

If yes, then you must run reporter_to_entrezgene.py, which will use mygene.info to replace gene symbols with Entrezgene IDs.

Entrezgene IDs are absolutely necessary for Biogps.org data display, but for microarray datasets, keep the probe set reporters.

Dataset Parsers:

  • load_ds command which will load remote datasets (Microarray data from ArrayExpress) to remote server for dev or prod.
  • load_ds_local command will load local datasets to the remote server for dev or prod. (written for RNAseq)

Run the command like this using Django manage.py, where "load_ds_local" can be other commands:

  • python3 manage.py load_ds_local --settings=biogps_dataset.settings_dev

Then you must use the command es_index to "index the data", then the newly loaded dataset should appear in the chart file:

  • python3 manage.py es_index --settings=biogps_dataset.settings_dev Use the "-c" argument if you want to clear previous indexing
  • python3 manage.py es_index -c --settings=biogps_dataset.settings_dev

Output looks something like this:

  • added 16 platform, added 5914 dataset

Open this url and you should see bar charts!

Must sometimes restart the localhost and server that is containing the database, as well as elasticsearch.

For help:

  • python3 manage.py load_ds --help --settings=biogps_dataset.settings_dev

Instances (models) to create during dataset loading:

If you don't know what a model is, then read about Django!

  • dataset:

    • Model with information about a certain dataset including metadata.
  • dataset_matrix:

    • is the dataset matrix that contains the entire dataset from the RNA seq run. Meaning, you likely do not want to display an instance of this model all at once!
  • dataset_data:

    • is one reporter gene, and all of it's expression information for all samples.
  • Dataset Platform:

    • We created a new platform since now we're loading a sequencing (not microarray) dataset. This is a sequencing platform, so does not have to be recreated every time.

    • Example input information:

      • RNA seq
      • reporters empty list
      • name = "generic RNA seq platform for mouse"
      • species = mouse

Biogps takes the average of samples for you so you don't need user average

Misc. information for testing/developing BioGPS:

urls from mygene.info used to get the Entrezgene IDs from gene symbol (from reporter_to_entrezgene):

  • http://mygene.info/v2/query?q=symbol:CDK2
  • http://mygene.info/v2/query?q=symbol:0610005C13Rik

To access the dataset via the shell:

  • python3 manage.py shell_plus --settings=biogps_dataset.settings_dev

Run these commands from shell:

This returns the dataset object which is the foreign key for dataset data and dataset matrix:

  • ds = BiogpsDataset.objects.get(geo_gse_id="BDS_00015")

This returns all the metadata (from info sheet and factors):

  • ds.__dict__

Viewing datasets on your BioGPS localhost

Dropdown menu in "probeset" is also considered the reporter gene on BioGPS

Go to the URL for the specific gene and dataset name (primary key of dataset or geo_gse_id) geo_gse_id is also important: will be BDS_XXXXX next number in sequence)

Example dataset viewing urls:

  • http://localhost:8000/static/data_chart.html?gene=67669&dataset=10044
  • http://localhost:8000/static/data_chart.html?gene=12566&dataset=BDS_00015
  • http://localhost:8000/static/data_chart.html?gene=100152011&dataset=10078

Example admin:

  • http://localhost:8000/admin/dataset/biogpsdataset/2427/

Standard test gene is 1017, which is a human gene! So if you are using a mouse dataset, this will understandably be missing:

CDK2 cyclin-dependent kinase 2, Homo sapiens (human) Gene ID: 1017, updated on 6-Mar-2016

Cdk2 cyclin-dependent kinase 2, Mus musculus (house mouse) Gene ID: 12566, updated on 6-Mar-2016

You can also check the "fixed reporters" data file to see which Entrezgene IDs are actually in your dataset for viewing.

To view the full dataset (api) for a dataset and gene:

  • http://localhost:8000/dataset/full-data/geo_gse_id%20test/gene/12566/
  • http://localhost:8000/dataset/full-data/E-GEOD-16054/gene/1017/
  • http://localhost:8000/dataset/full-data/BDS_00001/gene/1017/

Misc. Information

Does your dataset have interesting tissue groups or organ systems?

If so, then change the color_idx in the json metadata (ex: admin/dataset/biogpsdataset/2509/) accordingly to group samples into meaningful groups. This is done manually due to the numerous variations of possible sample groupings

Make sure to run Flake8 (to check for Pep8 compliance), prior to pushing code to biogps_dataset repository.

biogps_dataset's People

Contributors

cyrus0824 avatar jtfouquier avatar newgene avatar

Watchers

 avatar  avatar  avatar

biogps_dataset's Issues

loading error for "E-GEOD-38516"

#!python

[INFO, L:74] -process experiment E-GEOD-38516-
[INFO, L:155] get experiment info from http://www.ebi.ac.uk/arrayexpress/json/v2/experiments/arrays.txt
[INFO, L:169] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/arrays.txt
[INFO, L:177] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.hyb.sdrf.txt
[INFO, L:177] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.seq.sdrf.txt
[INFO, L:219] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/E-GEOD-38516
[INFO, L:232] get sample file: http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-38516/E-GEOD-38516.processed.1.zip
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 84, in handle
    data_matrix = setup_dataset(e)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 268, in setup_dataset
    splited[i] = float(splited[i])
IndexError: list index out of range

loading error for "E-GEOD-26688"

#!python

[INFO, L:57] -process experiment E-GEOD-26688-
[INFO, L:138] get experiment info from http://www.ebi.ac.uk/arrayexpress/json/v2/experiments/arrays.txt
[INFO, L:152] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/arrays.txt
[INFO, L:160] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.hyb.sdrf.txt
[INFO, L:160] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.seq.sdrf.txt
[INFO, L:202] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/E-GEOD-26688
[INFO, L:219] sample file exists
[INFO, L:69] write database
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 103, in handle
    ds_matrix = np.array(data_matrix.values(), np.float32)
ValueError: setting an array element with a sequence.

Missing default datasets

When searching for Tmem173 in humans, only 2 of the usual 6 default human datasets are displayed. Gene Atlas U133A, gcrma is NOT one of the default datasets, although it used to be before (and there are values for this gene in this missing default dataset).


Load E-GEOD-12199 which used to be in BioGPS, but isn't any more

Used to be here according to a user: http://biogps.org/dataset/844/expression-data-from-breast-cancer-cell-lines-mcf-7-and-mda-mb-231/

Can get it from here:
http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-12199/

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE12199

Assigning to Chunlei, since I don't know who takes care of these. Please assign to appropriate person, if not you.


loading error for "E-GEOD-49355"

#!python

[INFO, L:74] -process experiment E-GEOD-49355-
[INFO, L:155] get experiment info from http://www.ebi.ac.uk/arrayexpress/json/v2/experiments/arrays.txt
[INFO, L:169] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/arrays.txt
[INFO, L:177] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.hyb.sdrf.txt
[INFO, L:177] get experiment sdrf file from http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-22069/E-GEOD-22069.seq.sdrf.txt
[INFO, L:219] get experiment file info from http://www.ebi.ac.uk/arrayexpress/json/v2/files/E-GEOD-49355
[INFO, L:232] get sample file: http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-49355/E-GEOD-49355.processed.1.zip
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "/home/cwu/opt/bgpsdatapy/local/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 84, in handle
    data_matrix = setup_dataset(e)
  File "/home/cwu/prj/biogps/biogps_dataset/dataset/management/commands/start.py", line 271, in setup_dataset
    raise Exception, 'file format wrong, check columns of file:%s'%(path+'/'+f)
Exception: file format wrong, check columns of file:tmp/unzip_sample/E-GEOD-49355/GSM1197996_sample_table.txt


switch to use "requests" and "requests_cache" modules to call web services

http://docs.python-requests.org/en/latest/

https://github.com/reclosedev/requests-cache

"requests" is a module much better than build-in urllib and urllib2.

"request_cache" provides a cache for requests, regardless the actual caching headers. Very useful for both development and actual data-loading.

Example code:

#!python

import requests
import requests_cache

requests_cache.install_cache('arrayexpress_cache')

res = requests.get('http://www.ebi.ac.uk/arrayexpress/json/v2/experiments/E-GEOD-19279')
data = res.json()

res = requests.get('http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-19279/E-GEOD-19279.sdrf.txt')
data = res.text()


load more complete metadata for CCLE data set (E-GEOD-36133)

CCLE data set is loaded in BioGPS is loaded here: http://biogps.org/dataset/E-GEOD-36133/expression-data-from-the-cancer-cell-line-encyclop/

But the sample metadata table only has these columns: Sample, HISTOLOGY SUBTYPE1, HISTOLOGY, PRIMARY SITE

From the raw metadata file at Array express (http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-36133/E-GEOD-36133.sdrf.txt) need to add column 4 for "Comment [Sample_title]", which is the cell line name.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.