Giter VIP home page Giter VIP logo

Lior's GitHub statsLior's GitHub streak-stats


Coverage Status GitHub license

Run

Latest News

Project YANG-DB

Members:
Contributors:
Evangelist:
License:

GitHub license

Code Coverage:

Coverage Status

Dependencies Tags:

Infrastructure Technologies

Introduction

A Post introducing our new Open source initiative for building a Scalable Distributed Graph DB Over Opensearch https://www.linkedin.com/pulse/making-db-lior-perry/

Another usage of Opensearch as a graph DB https://medium.com/@imriqwe/elasticsearch-as-a-graph-database-bc0eee7f7622

The world of graph databases has had a tremendous impact during the last few years, in particularity relating to social networks and their effect of our everyday activity.

The once mighty (and lonely) RDBMS is now obliged to make room for an emerging and increasingly important partner in the data center: the graph database.

Twitter’s using it, Facebook’s using it, even online dating sites are using it; they are using a relationship graphs. After all, social is social, and ultimately, it’s all about relationships.

There are two main elements that distinguish graph technology: storage and processing.

Graph DB - Storage

Graph storage commonly refers to the structure of the database that contains graph data.

Such graph storage is optimized for graphs in many aspects, ensuring that data is stored efficiently, keeping nodes and relationships close to each other in the actual physical layer.

Graph storage is classified as non-native when the storage comes from an outside source, such as a relational, columnar or any other type of database (most cases a NoSQL store is preferable)

Non-native graph databases usually comprise of existing relational, document and key value stores, adapted for the graph data model query scenarios.

Graph DB - Processing

Graph Processing includes accessing the graph, traversing the vertices & edges and collecting the results.

A traversal is how you query a graph, navigating from starting nodes to related nodes, following relationships according to some rules.

finding answers to questions like "what music do my friends like that I don’t yet own?"

Graph Models

One of the more popular models for representing a graph is the Property Model.

Property model

This model contains connected entities (the nodes) which can hold any number of attributes (key-value-pairs).

Nodes

Nodes have a unique id and list of attributes represent their features and content.

Nodes can be marked with labels representing their different roles in your domain. In addition to relationship properties, labels can also serve metadata over graph elements.

Nodes are often used to represent entities but depending on the domain relationships may be used for that purpose as well.

Relationships

Relationship is represented by the source and target node they are connecting and in case of multiple connections between the same vertices – additional label of property to distinguish (type of relationship)

Relationships organize nodes into arbitrary structures, allowing a graph to resemble a list, a tree, a map, or a compound entity — any of which may be combined into yet more complex structures.

Very much like foreign keys between tables in relational DB model, In the graph model relationship describes the relations between the vertices.

One major difference in this model (compared to the strict relational schema) is that this schema-less structure enables adding / removing relationship between vertices without any constraints.

Additional graph model is the Resource Description Framework (RDF) model.

Why Opensearch

Our use-case is in the domain of the social networks. A very large social graph that must be frequently updated and available for both:

  • simple (mostly textual) search

  • graph based queries.

All the read & write are made in concurrency with reasonable response time and ever growing throughput.

The first requirement was fulfilled using Opensearch – a well known and established NoSql document search and storage engine capable of containing very large volume of data.

For the second requirement we decided that our best solution would be to use Opensearch as the non-native graph-DB storage layer.

As mentioned before, a graph-DB storage layer can be implemented using a non-native storage such as NoSql storage.

In future discussion I’ll get into details why the most popular community alternative for graph-DB – Neo4J, could not fit our needs.

Modeling data as graph

The first issue on our plate is to design the data model representing the graph, as a set of vertices and edges.

With Opensearch we can utilize its powerful search abilities to efficiently fetch node & relation documents according to the query filters.

Opensearch Index

In Opensearch each index can be described as a table for a specific schema, the index itself is partitioned into shared which allow scale and redundancy (with replicas) across the cluster.

A document is routed to a particular shard in an index using the following formula:

shard_num = hash(_routing) % num_primary_shards

Each index has a schema (called type in Opensearch) which defines the documents structure (called mapping in Opensearch). Each index can hold only a single type of mapping (since Opensearch 6)

The vertices index will contain the vertices documents with the properties, the edges index will contain the edges documents with their properties.

Query Language

The way we describe how to traverse the graph (data source)

There are few graph-oriented query languages:

Some of the languages are more pattern based and declarative, some are more imperative – they all describe the logical way of traversing the data.

Cypher

Let’s consider Cypher - a declarative, SQL-inspired language for describing patterns in graphs visually using an ascii-art syntax.

It allows us to state what we want to select, insert, update or delete from our graph data without requiring us to describe exactly how to do it.

alt text

From logical to physical

Once a logical query is given we need to translate it to the physical layer of the data storage which is Opensearch.

Opensearch has a query DSL which is focused on search and aggregations – not on traversing, we need an additional translation phase that will take into account the schematic structure of the graph (and the underlying indices).

Logical to physical query translation is a process that involves few steps:

  • validating the query against the schema

  • translating the labels into real schema entities (indices)

  • creating the physical Opensearch query

This is the process in a high-level review, in practice - there will be more stages that optimize the logical query; in some cases it is possible to create multiple physical plans (execution plans) and rank them according to some efficiency (cost) strategy such as count of elements needed to fetch...

Conclusion

We started with discussing the purpose of graphs DB in today’s business use cases and reviewed different models for representing a graph. Understanding the fundamental logical building blocks that a potential graph DB should consist and discussed an existing NoSql candidate to fulfill the storage layer requirements.

Once we selected Opensearch as the storage layer we took the LDBC Social Network Benchmark graph model and simplified it to be optimized in that specific storage. We discussed the actual storage schema with the redundant properties and reviewed cypher language to query the storage in an sql-like graph pattern language.

We continued to see the actual transformation of the cypher query into a physical execution query that will run by Opensearch.

In the last section we took a simple graph query and drilled down into the details of the execution strategies and the bulking mechanism.

Start Using

Please review the following tutorial pages:

Installaiton tutorial:

Schema creation tutorial:

Data Loading tutorial:

Query the Graph tutorial:

Projection materialization & count tutorial:

YANGDB's Projects

arcadedb icon arcadedb

ArcadeDB Multi-Model Database that supports SQL, Cypher, Gremlin, HTTP/JSON, MongoDB and Redis

arrows icon arrows

JavaScript library for drawing diagrams of small graphs, using D3 to generate SVG. Useful for explaining Neo4j graph modelling concepts in presentations and blogs.

atlas icon atlas

In-memory dimensional time series database.

awesome-dynamic-graphs icon awesome-dynamic-graphs

A collection of resources on dynamic/streaming/temporal/evolving graph processing systems, databases, data structures, datasets, and related academic and industrial work

aws-otel-collector icon aws-otel-collector

AWS Distro for OpenTelemetry Collector (see ADOT Roadmap at https://github.com/orgs/aws-observability/projects/4)

blinkdb icon blinkdb

BlinkDB: Sub-Second Approximate Queries on Very Large Data.

cdr-gen icon cdr-gen

A Call Detail Record (CDR) generator

cqengine icon cqengine

Ultra-fast SQL-like queries on Java collections

crate icon crate

CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.

cypher-shell icon cypher-shell

A command line shell where you can execute Cypher against an instance of Neo4j

data-prepper icon data-prepper

Data Prepper is a component of the OpenSearch project that accepts, filters, transforms, enriches, and routes data at scale.

datahelix icon datahelix

The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation

datashare icon datashare

Better analyze information, in all its forms

dejavu icon dejavu

The Missing Web UI for Elasticsearch: Import, browse and edit data with rich filters and query views, create search UIs visually.

deneb-showcase icon deneb-showcase

A collection of advanced dataviz examples using Vega, Vega-Lite, Deneb and Power BI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.