cgs-data

Databases in CGS (HBase, MySQL, Metastore) + additional storage files (AVRO). This package is part of the CGS project and aims to generate database "contents" in the Hadoop framework (based on the Cloudera distribution): table creation, data file conversions, table fields mapping, etc.

cgs-data is

flexible: any data structure can be imported into the CGS system (see below for more info).
extendable: table generators can be written for any types of databases (so far we are compatible with MySQL and HBase, allowing the tsv format).

More specifically this document describes the data in CGS, how they are created, how they are stored. Scripts to import data into the different tables are available and are described below.

Clinical data are stored in MySQL tables. Genomics/Exomics data are stored in HBase (accessible via Hive Metastore) and in AVRO files.

To generate the tables one needs the corresponding rights to do so.

For the moment, only variant information is meant to be stored in HBase and AVRO files (see here for more details about the AVRO format).

Installation

One can install cgs-data with the following command:

pip install git+https://github.com/BRiDGEIris/cgs-data.git#egg=cgsdata

To upgrade an existing package

pip install --upgrade git+https://github.com/BRiDGEIris/cgs-data.git#egg=cgsdata

To make use of this package from another package (this is typically done when an application in cgs-apps needs to be aware of the data structure), first install it on your system and then import it in your application as you would do with any other python package

import cgsdata

Contribute

To contribute to the project, get a clone on your machine

git clone https://github.com/BRiDGEIris/cgs-data.git

Install git flow. This is not mandatory if you only change the master, but is very useful for a better branch management. You can find more info here.

git flow init

Make sure that you develop on the develop branch

git branch

A star should appear in front of the current branch.

Then, from the package root directory, create a new release with

sudo ./update_version.sh 0.1.2

Replace 0.1.2 by the actual version. The update_version.sh script actually starts a release branch, create the docs, commit the docs changes and finish the release branch. Finishing the release branch generates a tag in the master branch.

A Makefile is also available for building a python virtual environment (sudo make build), generating docs (sudo make docs) or testing the package (sudo make tests).

Generation of data structure in CGS

The main goal of this package is to be usable for any type of data schemas. If you are not happy with the data structures proposed, you just need to make your own. It is simple, you just need to edit some configuration files. How does it work? A parser reads a config file which tells what are the data substructures that have to be created, their types (HBase, MySQL, etc), the files where to find the information (source files). A data substructure can be for instance some tables in MySQL or one table in HBase, etc. There is no restrictions about a data substructure except that all the tables it contains should belong to the same database. Then it connects to your Hadoop cluster/machine (can be done locally) and installs the different data substructures as defined in your config file. The description of the config file can be found in the subsection Config file. Connections between tables are likely to happen and corresponding field names may be confusing. We consider 2 levels of mapping or relationships between fields in our data structure: relationships within a data substructure and relationships between data substructures. Relationships within a data substructure are defined in the source file corresponding to that substructure, whereas the relationships between data substructures are defined in a separate file called reference.yml. For instance, the "sample_ID" field in a metastore table may corresponds to the "ID" field in your MySQL "sample" table, this is a "between" relationship. A description of the reference file can be found in the subsection File with reference fields. All files for one data structure are located in one folder: config file, reference file, source files. Some more information about the organization within this folder can be found in the subsection Data structure folder tree.

Data structure folder tree

As mentioned previously, all files specific to a data structure are located in the same folder which has the name of the data structure. Here is an example of files that you will find in the variants data structure:

.
+-- config.yml
+-- reference.yml
+-- sourceFiles
	+-- variants_hbase.yml
	+-- variants_metastore.yml
	+-- variants_api.yml
	+-- clinical_mysql.sql
	+-- ...

Config file

The config file is a YAML file with the following structure:

dataStructureName: variants
substructures:
	- name: clinical
      type: mysql
      description: clinical/phenomic database
      database: default
      source: clinical.sql
	  host: localhost
	  user: admin 
	  password: admin

	- name: variants_hbase
      type: hbase
      description: variants table
      database: default
      source: variants_hbase.yml
	  host: 192.168.1.42
	  user: admin 
	  password: admin

	- name: variants_metastore
	  type: metastore
	  description: variants in hive metastore
	  database: default
	  source: variants_metastore.yml
	  host: 192.168.1.42
	  user: admin 
	  password: admin

	- name: variants_api
	  type: api
	  description: variants resource available through API
	  database: NULL
	  source: variants_api.yml
	  host: 192.168.1.42
	  user: admin 
	  password: admin

where

dataStructureName is the name of the data structure and thus the name of the folder
name is the data substructure name
type is the data substructure type (HBase, MySQL, metastore, AVRO)
description is the description of the data substructure
database is the database name where to add the table
source is the data file where the data substructure is defined
host is the ip/url of the host machine
user is the user to connect to the database
password is the user password to connect to the database

No database will be created or updated for the API resources. The corresponding yaml file (here variants_api.yml) will be copied on the machine in the ~/.cgs directory of the machine. This is used for mapping API resources to fields in our data structure. Note that all info about the data substructures will be copied in a configuration file in~/.cgs/cgs_config_file. This file should be protected (ex: chmod 600 ~/.cgs/cgs_config_file. Here is how the file will look like:

[clinical]
type=mysql
database=default
host=localhost
port=3306
user=admin
password=admin
[variants_hbase]
type=hbase
database=default
host=localhost
port=60000
user=admin
password=admin

File with reference fields

The reference file is a yaml file with the following structure:

<variable>:
	description: <description>
	substructures:
		- <data_substructure>: <field_name>
		- <data_substructure>: <field_name>
		- ...
<variable>:
	description: <description>
	substructures:
		- <data_substructure>: <field_name>
		- <data_substructure>: <field_name>
		- ...

Here is a small example for illustration:

gene_symbol:
	description: "Gene symbol."
	substructures:
	  - variants_api: variants.info.gene_symbol
	  - variants_metastore: variants.info.gene_symbol
	  - variants_hbase: R.GS
pos:
	description: "Position : chromosome and position gives the contig on which the variant occurs. For indels this is actually the base preceding the event."
	substructures:
	  - variants_api: variants.start
	  - variants_metastore: variants.start
	  - variants_hbase: R.P
project_sample_id:
	description: "ID of the sample in the project."
	substructures:
	  - variants_api: readGroupSets.readGroups.sampleID 
	  - variants_metastore: readGroupSets.readGroups.sampleID
	  - variants_hbase: R.SI
	  - clinical_mysql: sample.sample_id

where

description: description of the field/variable (string, mandatory),
substructures: data substructure names where the field/variable is found, names should be the same as in the config file config.yml (string, mandatory),
- The field name is (usually) a construction of <table_name>.<field> (string, mandatory).

The goal of this file is to list all fields common to different data substructures. For instance, the field R.SI in HBase and the field sample.sample_id in MySQL are refering to the same thing, which is the project_sample_id.

Source files for table generation

The source files are located in the folder sourceFiles of the corresponding data structure folder. The syntax to be used in these files depends on the type of the table/schema (HBase, MySQL, Metastore, AVRO, etc). We detail the syntax for:

HBase
MySQL
Metastore
AVRO
API

in the following subsections.

HBase table definition and generation

A source file for HBase is a yaml file with the following structure:

<table_name>:
	- <column_family>:
	  - <column_name>:
		type: <type>
		description: <description> 
	  - <column_name>:
		type: <type>
		description: <description>
	  - ...
	- <column_family>:
	  - <column_name>:
		type: <type>
		description: <description>
	- ...	
<table_name>:
	- <column_family>:
	  - <column_name>:
		type: <type>
		description: <description>

Here is an small example for illustration:

variants:
	- R:
	  - SI:
		type: string
		description: "sample id"
	  - GS:
		type: string
		description: "gene symbol"
	- I:
	  - SB:
		type: float
		description: "strand bias"

where

table_name: table name in HBase (string)
type: type of the field (string)
description: description of the field (string)

MySQL table definition and generation

For MySQL, there are 2 possibilities of format, either .sql or .yml. The .sql is the one you can get when exporting data or schema from MySQL when you use the --no-data option. The .yml is largely inspired from the doctrine project and should be similar to this (all options are not mentioned here):

<table_name>:
	columns:
		<column_name>:
			type: <type>
			primary: <true/false>
			autoincrement: <true/false>
		<column_name>:
			type: <type>
			relations:
				<foreign_table_name>:
					foreign: <foreign_column_name>
<table_name>:
	columns:
		<column_name>:
			type: <type>
			primary: <true/false>
			autoincrement: <true/false>
...

Here is a small example for illustration:

sample:
	columns:
		sample_id:
			type: integer
			primary: true
			autoincrement: true
		patient_id:
			type: integer
			relations:
				patient:
					foreign: patient_id

patient:
	columns:
		patient_id:
			type: integer
		patient_name:
			type: string

In the example above, we have 2 tables sample and patient with a 1-to-M relationship between patient_id and sample_id, meaning that multiple sample_id can be found for the same patient_id.

Metastore table definition and generation

The source file is .yml comparable to what we have for MySQL (see the MySQL subsection).

AVRO schema

An AVRO schema is not per se a table in a database but a file with the extension .avsc. More documentation about the format of schema in AVRO can be found here.

API resources

The source file is .yml comparable to what we have for MySQL (see the MySQL subsection).

Importing data into CGS

Both clinical and genomic data can be imported in CGS. Clinical data do not require specific processing and can be added into relational databases on the fly. On the other hand genomics data requires additional processing to be uploaded into CGS. In the CGS data structure called 'cgs-data-variants', genomics data are stored in HBase and are coming from VCF files. The data workflow for the VCF files is the following:

The VCF file is first converted into a JSON file, which is converted into an AVRO file as explained in the subsection Conversion of VCF files to AVRO. Once the AVRO files are generated, they are uploaded into HBase (using Bulk load) as described in subsection Bulk load of AVRO files into HBase. A Hive metastore is built on top of HBase as defined in subsection Metastore table definition on top of HBase.

Conversion of VCF files to AVRO

The conversion of the VCF files into AVRO files is done in 2 steps:

first, the VCF files are converted into a JSON format
second, the JSON files are converted into AVRO.

The AVRO files are stored in HDFS, whereas the JSON files are deleted on the fly (by default).

The AVRO format has the advantage of being extendable and changeable without breaking anything. If tomorrow, one decide to migrate from HBase to another NoSQL solution, it can start back from the AVRO files. Similarly, if fields names change or the field structure is re-thought, the AVRO format is a convenient solution.

Bulk load of AVRO files into HBase

Under development.

Metastore table definition on top of HBase (accessible via Hive, Pig, Impala, etc)

Under development.

Implementing security in HBase

Under development.

gilles-degols / cgs-data Goto Github PK

cgs-data's Introduction

cgs-data

Installation

Contribute

Importing data into CGS

Implementing security in HBase

cgs-data's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org