Giter VIP home page Giter VIP logo

dars's Introduction

Build Status Docker Automated build Rust nightly

๐“ƒข DARS

DARS is an asynchronous DAP/2 server written in Rust aimed at being fast and lightweight. It supports a subset of the OPeNDAP protocol. It aims to only serve the DAP protocol, not common services like a catalog or a WMS.

See below for installation instructions.

OPeNDAP server implementation and file formats

Variable and hyperslab constraints, except strides, are implemented. File formats based on HDF5 are supported:

  • HDF5
  • NetCDF (version 4)
  • NcML (aggregation along existing dimension).

HDF5 is read through hidefix, which is an experimental HDF5 reader for concurrent reading.

Some simple benchmarks

It is difficult to do meaningful benchmarks. However, here is an attepmt to show a simple comparsion between Dars, Thredds and Hyrax. See comparsion/report.md and comparison/benchmarks.sh for more details. wrk is used to measure the maximum requests per second for a duration using 10 concurrent connections. For Thredds and the large dataset wrk2 was used with a limit on 2 request/sec to avoid too many Out-of-memory-errors. The servers were run one at the time using their default docker images. It would be interesting to show latency (hdr)histograms for the different tests, but the performance (acceptable latency at certain requests per second) between the servers is so different that it is difficult to make any meaningful plots. Still, they should be included in further analysis, but done individiually for each server.

Requests per second for Dars, Thredds and Hyrax

Requests per second for Dars, Thredds and Hyrax

It is also interesting to note that the server load was very different during these benchmarks:

CPU and memory during load testing

Installation and basic usage

Set up rustup nightly.

Running from the repository:

$ cargo run --release

or install with:

$ cargo install --path dars

By default a simple catalog can be explored using a browser, if the catalog is disabled a list of datasets and DAP URLs can be queried at: http://localhost:8001/data/ (use curl -Haccept:application/json http://localhost:8001/data/ to get JSON). Use e.g. ncdump -h http://.. to explore the datasets.

Docker

Use gauteh/dars or build yourself:

$ docker build -t dars .
$ docker run -it -p 8001:8001 dars

mount your data at /data.

dars's People

Contributors

gauteh avatar magnusumet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

magnusumet hdfeos

dars's Issues

run cargo clippy

A lot of unneccessary strings are created when handling the errors, using the pattern

e.unwrap(anyhow!("{}", "x failed")

This will always evaluate the anyhow, even when not used. This is a lint in clippy, which also picks up some other "errors"

wrong dataset count

The server on dars.met.no sometimes says

We are currently serving 0 datasets.

while it actually has 1 dataset. This message usually appears after reloading the page a few times.

concurrent and thread-safe reads from netCDF / HDF5

This is a tracking issue for thread-safe and concurrent reads from netCDF files. Either using a pool of file-handles, or even better concurrent read support.

older netCDF formats depend on the netCDF library, while newer depend on HDF5. the following issue will presumably be the best reference for this issue:

hyrax has their own HDF5 implementation, which might be useful to us (maybe through rust-netcdf):

google font but no privacy policy

<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:300,300italic,700,700italic">

I am not sure if it is permitted to include these fonts without a privacy policy statement or similar. Probably it is okay for the source code but not for running a public server (e.g. an unmodified docker container). I would suggest to drop these lines. I do not expect any readability problems when using a different font or simpler CSS.

Data-discovery and index

In gauteh/hidefix#8 a couple of different DBs have been benchmarked. The deserialization of the full index of a large file (4gb) takes about 8 us (on my laptop), its about 8 mb, and takes about 100-150 ns to read from memory-mapped type local databases (sled, heed). Reading it (8 mb binary) from redis, sqlite or similar takes about 3 to 6 ms which is maybe a bit too high. It would be interesting to also try postgres.

  1. We need to keep data-discovery and dataset removal/update in mind:
  • I think datasets should be registered, not auto-discovered by the data-server: the registration could be run by another dedicated service that auto-detects/scrapes sources.
  • When a data-file turns out to be missing, or mtime has changed, we return an error, possibly notifying the scraper-service.
  1. I think that we have to assume internal network-latency is OK, I don't see how we can do much about that, except keeping communication at a minimum.

A solution could be:

  • Keep a central db with the index, DAS, DDS and list of datasets. This could be an SQL server or whatever, it is only written to by the scraper.
  • Each worker has a local cache of datasets (index, DAS, DDS) (e.g. heed, or maybe even just in-memory): to avoid having to verify that a dataset still exists it checks the mtime of the source on request. If the mtime is changed: Update cache from server. In the case of NCML-aggregates this will not be discovered.
  • When the central DB is changed, cache clearing is triggered at the workers. Retrieving new data from the central server is pretty cheap. This will handle NCML-changes.
  • This will make it possible to extend to cloud data-sources since the central-db then would point to e.g. an s3 URL.

Unfortunately this complicates things significantly, but I don't see how to avoid it when scaling up. It would be nice to still support a stand-alone server that does not need a central db, but just caches locally and discovers datasets itself in some way. That would make it significantly easier to test the server out.

Some reasons:

  • Storing the full index of all datasets on every worker takes a lot of space and needs to be kept in sync
  • Network disk of index is slow? Embedded databases like SQLite still too slow, so then need a memory mapped DB anyway
  • Indexing on-demand too slow, especially for aggregated datasets.

Since data is usually on network disks, caching data could possibly be done using large file system cache or maybe something like https://docs.rs/freqfs/latest/freqfs/index.html.

@magnusuMET

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.