Giter VIP home page Giter VIP logo

tripl's Introduction

Tripl

A data format for "all the things", inspired by Datomic and the Semantic Web that

  • has an easy document-store like write semantic
  • can express arbitrary graph data
  • enables explicit, global meaning and context
  • is extensible and polymorphic
  • primarily targets JSON for reach and interoperability
  • rests upon the theoretical underpinnings in RDF, with a simpler data-oriented buy-in model

Tripl can be created and used from any language with a natural interpretation of JSON data. However, getting the most out of the implied graph structure of the data requires some tooling. (Thought not much! The first passable version of this tool was only 120 LOC!)

This repository contains a python library for working with this data programatically from python, as well as a command line tool for working with data from the shell.

Conceptual model

At the heart of the Semantic Web is a data language called the Resource Description Framework, or RDF. Parts of RDF can be a bit complicated, but there are some core ideas as relate to data modelling which are very valuable for data scientists and other programmers.

RDF is based on the Entity Attribute Value (EAV) data modelling pattern, in which facts about entities are stored as a set of (entity, attribute, value) triples. It's possible in this framework to have attributes that point from one entity to another (e.g. (entity1, attribute, entity2)). This gives EAV and thus RDF the ability to very flexibly model arbitrary graph relationships between data.

It's worth contrasting this approach to data modelling with that of other SQL and NoSQL databases. In SQL databases, entities are stored as rows in tables. We end up being locked into the structure of these tables and their relationships. Meanwhile NoSQL databases allow us great flexibility in how we organize collections of documents, but we end up being locked into the de facto schema we create in the process.

With EAV, any entity can have any attribute assigned to it, and attributes between entities (reference attributes) can point to any entity they like. This means that EAV and RDF are inherently and effortlessly polymorphic. But because everything is being represented under the hood as a collection of simple triples/facts, we end up being less locked into the ways we initially organize our relationships between entities.

The tooling

There is a catch though. To get anything meaningful out of a raw set of EAV triples, we really need some tooling: a query language and a convenient write semantic. Additionally, this data tends to be most useful when we know some things about the attributes of our data before hand, so we would like a way for this tooling to interpret some schema for the attributes (ultimately much easier that specifying a SQL schema).

This python library provides both a simple query language and write semantic. The data though can more or less be used from wherever.

Usage

Keep in mind that this is still in draft, so details may change, but the flavor is as follows.

For starters, let's create our triple store:

from tripl import tripl
ts = tripl.TripleStore()

Let's start by imagining we have a project named cft. We have sequences, timepoints, and subjects. We might come up with the following attributes to describe our data:

['cft:type', 'cft.subject:id', 'cft.timepoint:id', 'cft.seq:id', 'cft.seq:string', 'cft.seq:description', 'cft.seq:timepoint', 'cft.seq:subject']

Note that each of the attributes in and of itself, is more or less self descriptive. If something has a cft.seq:timepoint attribute, it's clear that it is a sequence and has timepoint data associated with it.

We can use that to describe Tripl data as follows:

data = [# some subjects
        {'cft.subject:id': 'QA255', 'cft:type': 'cft.type:subject'},
        {'cft.subject:id': 'QA344', 'cft:type': 'cft.type:subject'},
        # sequence and timepoint data
        {'cft.seq:id': 'QA255-092.Vh',
         'cft:type': 'cft.type:seq',
         'cft:description': 'seed sequence for patient QA255',
         'cft.seq:string': 'AGCGGTGAGCTGA',
         'cft.seq:subject': {'cft.subject:id': 'QA255'},
         'cft.seq:timepoint': [
             {'cft.timepoint:id': 'seed-sample', 'cft:type': 'cft.type:timepoint'},
             {'cf:.timepoint:id': 'dpi1204', 'cft:type': 'cft.type:timepoint'}]},
        {'cft.seq:id': '15423-1',
         'cft:type': 'cft.type:seq',
         'cft.seq:string': 'AGCGGTGAGCTGA',
         'cft.seq:subject': {'cft.subject:id': 'QA255'},
         'cft.seq:timepoint': [
             {'cft.timepoint:id': 'dpi234', 'cft:type': 'cft.type:timepoint'},
             {'cft.timepoint:id': 'dpi1204', 'cft:type': 'cft.type:timepoint'}]},
        {'cft.seq:id': '1534-2',
         'cft:type': 'cft.type:seq',
         'cft.seq:string': 'AGCGGTGAGCTGA',
         'cft.seq:subject': {'cft.subject:id': 'QA344'},
         'cft.seq:timepoint': [
             {'cft.timepoint:id': 'L1', 'cft:type': 'cft.type:timepoint'}]}]

There's only one catch here. Each map here is going to get a new entity, while it's likely clear that the intent of this data structure is to have each instance {'cft.subject:id': _} to correspond to a single entity, the data has not explicitly told us this.

There are three things we can do to achieve this:

  1. Simply create a unique ident for these entities (say, via import uuid; uuid.uuid1())
import uuid

subject_255 = uuid.uuid1()

data = [{'db:ident': subject_255, 'cft.subject:id': 'QA255', 'cft:type': 'cft.type:subject'},
        # ...
        {'cft.seq:id': 'QA255-092.Vh',
         'cft:type': 'cft.type:seq',
         'cft:description': 'seed sequence for patient QA255',
         'cft.seq:string': 'AGCGGTGAGCTGA',
         # We can refer to that ident directly, in a `{'db:ident': subject_255}` dict
         'cft.seq:subject': subject_255,
         'cft.seq:timepoint': [
             {'cft.timepoint:id': 'seed-sample', 'cft:type': 'cft.type:timepoint'},
             {'cf:.timepoint:id': 'dpi1204', 'cft:type': 'cft.type:timepoint'}]},
        # ...
        ]
  1. When we assert this data, we can specify that the cft.timepoint:id attribute should be considered unique within the context of that assertion, using the id_attrs option.
data = [
        # as before...
        ]

# Using id_attrs
ts.assert_facts(data, id_attrs=['cft.timepoint:id', 'cft.seq:id', 'cft.subject:id'])
  1. Identity attributes

For attributes we wish to be unique, we should also be able to specify schema asserting this, which effectively fixes this id_attrs setting for us (and, as we'll see, as part of the data itself). However, this should be employed with care. As soon as you have a uniqueness constraint like this, it becomes difficult to (e.g.) compare datasets which might contain overlapping values. For this reason I suggest sticking with the two methods above. Usually, if you are asserting information about something that has already been created, you'll have it in a dictionary, and can just call that_dict['db:ident'] to get the identity for a new set of assertions without too much difficulty. (2) Helps us deal with the process of creating and asserting a particular set of facts from within our language. There's even been some research on contextual ontological constraints, which would effectively allow you to say "within a particular data set, such and such ids are unique". However, doing this raises a lot of questions, and I think the W3C jury is still out on the recommendation here.

Schema

In any case, schema is still a good idea! For one thing, we may wish to specify that the default cardinality should be db.cardinality:one (which is, TBQH, better from a data modelling perspective).

schema = [
    {'db:ident': 'cft.seq:timepoint', 'db:cardinality': 'db.cardinality:one'}]

So we have the ability to It's worth pointing out here that we're implicitly treating each of these cft.timepoint:id values as unique identifiers in the context of this data.

# First let's construct some helpers for creating and working with this data

def cft_cons(name):
    return tripl.entity_cons('cft.type:' + name, 'cft.' + name)

subject = cft_cons('subject')
seq = cft_cons('seq')
timepoint = cft_cons('timepoint')
tree_node = tripl.entity_cons('cft.type:tree_node', 'cft.tree.node')


# Next our schema

schema = {
   'cft.seq:timepoint': {'db:valueType': 'db.type:ref',
                         'db:cardinality': 'db.cardinality:many'},
   'cft.seq:subject': {'db:valueType': 'db.type:ref'}}


# Let's imagine the following data having been transacted in here

ts = tripl.TripleStore(schema=schema, default_cardinality='db.cardinality:one')


# Now we can see our constructors in action :-)

ts.assert_facts([
    subject(id='QA255'),
    subject(id='QA344'),
    seq(id='QA255-092.Vh',
        seq='AGCGGTGAGCTGA',
        timepoint=[timepoint(id='seed-sample'), timepoint(id='dpi1204')],
        **{'cft:description': 'seed sequence for patient QA255'}),
    seq(id='15423-1',
        seq='AGCGGTGAGCTGA',
        timepoint=[timepoint(id='dpi234'), timepoint(id='dpi1204')]),
    seq(id='1534-2',
        seq='AGCGGTGAGCTGA',
        timepoint=[timepoint(id='L1')])],
    id_attrs=['cft.timepoint:id', 'cft.seq:id', 'cft.subject:id'])


# We can query data using a pull query specifying what attributes and references you'd like to extract
pull_expr = ['db:ident', 'cft.seq:id', {'cft.seq:timepoint': ['cft.timepoint:id']}]
pull_data = ts.pull_many(pull_expr, {'cft:type': 'cft.type:seq'})
import pprint
pprint.pprint(list(pull_data))


# Prints out the following:
#
#    [{'cft.seq:id': '1534-2',
#      'cft.seq:timepoints': [{'cft.timepoint:id': 'L1'}]},
#     {'cft.seq:id': '15423-1',
#      'cft.seq:timepoints': [{'cft.timepoint:id': 'dpi1204'},
#                             {'cft.timepoint:id': 'dpi234'}]},
#     {'cft.seq:id': 'QA255-092.Vh',
#      'cft.seq:timepoints': [{'cft.timepoint:id': 'dpi1204'},
#                             {'cft.timepoint:id': 'seed-sample'}]}]


# Save out to file
ts.dump('test.json')

# Reload; note that schema is persisted
ts2 = TripleStore.load('test.json')

# Reproducibility :-)
pprint.pprint(list(ts2.pull_many(pull_expr, {'cft:type': 'cft.type:seq'})))

# We can also do reverse reference lookups, by using `_` after the namespace separator
pull_expr = ['cft.timepoint:id', {'cft.seq:_timepoint': ['*']}]
pprint.pprint(
    list(ts2.pull_many(pull_expr, {'cft:type': 'cft.type:timepoint'})))


# We also have an entity API
e = ts.entity({'cft.timepoint:id': 'seed-sample'})

# These behave as dict like views over the EAV index, that update as the store updates.

print(e['cft.timepoint:id'])
pprint.pprint(e['cft.seq:_timepoint'])

That's all for now! Stay Tuned!

Tests

To run tests you can either install pytest and run it on the tests/ directory (which happens by default if you simply run pytest), or run python2 setup.py test, which will also install pytest and any other required dependencies.

tripl's People

Contributors

devurandom avatar eharkins avatar metasoarous avatar phiweger avatar teodorlu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

tripl's Issues

Add documentation specific to bioinformatic uses

Probably a wiki page or .md file in docs. Should go a little more in depth into how things affect us in bioinformatics, and what these things might look like in a bioinformatic context, with tree/fasta/csv examples. And make reference to nestly work as well.

Assert by reverse lookup attribute

Right now it's possible to query reverse relationships by using an underscore after the namespace separator (so the reverse relation for person:parent would be person:_parent, and would map from parents to children). This should be possible for assertions as well. Imagine you wanted to describe a mother with 10 children.

data = [
    {'person:name': 'Momma Jones',
     'person:_parent': [{'person:name': 'Little Joe Jones'}, {'person:name': 'Wilma Jenkins'}, ...]}]

tripl/tripl.py", line 586, in _entity_lookup - AttributeError: 'NoneType' object has no attribute 'keys'

The following code (mostly copy & paste from the readme):

from tripl import tripl


def cft_cons(name):
    return tripl.entity_cons('cft.type:' + name, 'cft.' + name)


def main():
    subject = cft_cons('subject')

    # Next our schema

    schema = {
        'cft.seq:timepoint': {'db:valueType': 'db.type:ref',
                              'db:cardinality': 'db.cardinality:many'},
        'cft.seq:subject': {'db:valueType': 'db.type:ref'}}

    ts = tripl.TripleStore(schema=schema, default_cardinality='db.cardinality:one')

    ts.assert_facts([
        subject(id='QA255')],
        id_attrs=['cft.timepoint:id', 'cft.seq:id', 'cft.subject:id'])

Causes an exception:

Traceback (most recent call last):
  File ".../venv/bin/...", line 11, in <module>
    load_entry_point('...', 'console_scripts', '...')()
  File ".../.../__init__.py", line 38, in main
    id_attrs=['cft.timepoint:id', 'cft.seq:id', 'cft.subject:id'])
  File "build/bdist.linux-x86_64/egg/tripl/tripl.py", line 521, in assert_facts
  File "build/bdist.linux-x86_64/egg/tripl/tripl.py", line 499, in assert_fact
  File "build/bdist.linux-x86_64/egg/tripl/tripl.py", line 472, in _assert_dict
  File "build/bdist.linux-x86_64/egg/tripl/tripl.py", line 448, in _resolve_eid
  File "build/bdist.linux-x86_64/egg/tripl/tripl.py", line 448, in <dictcomp>
  File "build/bdist.linux-x86_64/egg/tripl/tripl.py", line 591, in match
  File "build/bdist.linux-x86_64/egg/tripl/tripl.py", line 586, in _entity_lookup
AttributeError: 'NoneType' object has no attribute 'keys'

Do you have a hint at what might be causing this?

I am using Python 2.7.16 and unmodified Tripl from master branch.

namespaced/keywords instead of namespaced:keywords

my argument is readibility: when you specify a dict, then there are many ":" which (imho) decreases readibility, compare

        'mock:type': 'mock.type:seq',
        'mock.seq:id': 'a1',
        'mock.seq:string': 'ACTGA',
        'mock:description': 'some foo from bar',

with

        'mock/type': 'mock.type:seq',
        'mock.seq/id': 'a1',
        'mock.seq/string': 'ACTGA',
        'mock/description': 'some foo from bar',

Document entity API

There's an entity api via tp.entity(entity_id) that lets you traverse the graph as a "live" graph of connected dicts. There might also need to be some cardinality or reverse lookup ref work here, but whatever the case, details should be documented in the README.

Finalize all special schema attribute names

Once we're out of alpha/beta we should just NEVER BREAK THE SCHEMA! Because then you'll have to get into having to figure out how to load different data in different versions and that just sucks. So we have to settle on all these little things like: what do we call our primary key? db:ident? tripl:id? Where do we install the schema? On a tripl:schema ident?

Build out the bio namespace

We want a collection of utilities for representing and working with sequence, tree and tabular data. This will involve some data modelling work, tooling for slurping/spitting to standard formats, and in the case of ingest, linking/relating it to the rest of the data (I'm imagining being able to specify a join on some sequence data and a CSV metadata file and representing that as triples, for example). There's also a lot of room here for tooling at the build pipeline level, since these things tend to get into the semantics of the actual data ("for each subject, for each cell cluster, for each ..."; I'll probably specifically build out some thing along these lines for nestly). These things are likely going to have to get broken up into smaller pieces, so this is a bit of an epic issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.