dream-lab / elfstore Goto Github PK

View Code? Open in Web Editor NEW

3.0 20.0 2.0 58.67 MB

ElfStore: Edge local federated Store

License: Apache License 2.0

Thrift 2.63% Python 46.61% Java 50.64% Dockerfile 0.12%

elfstore's Introduction

ElfStore: Edge-local federated Store

Sumit K Monga, Sheshadri K R, Yogesh Simmhan.

To Appear as part of the Proceedings of the IEEE International Conference on Web Services, 2019

ElfStore, a first-of-its-kind edge-local federated store for streams of data blocks. It uses reliable fog devices as a super-peer overlay to monitor the edge resources, over federated metadata indexing using Bloom filters, locates data within 2-hops, and maintains approxi mate global statistics about the reliability and storage capacity of edges. Edges host the actual data blocks, and we use a unique diferential replication scheme to select edges on which to replicate blocks, to guarantee a minimum reliability and to balance storage utilization.

Command Line Interface (CLI) , credits : Ishan Sharma

Link to the paper : https://arxiv.org/pdf/1905.08932.pdf

Instructions to setup:

maven compile to generate executable jar at the level of 'src' directory. ( mvn clean compile assembly:single )
For setup and installation of Elfstore, please refer the Readme file which contain step-by-step instructions to test the Elfstore using the Command Line Interface (CLI) in your local machine.

Note: These software are research prototypes and made available on a best-effort basis, without any guarantees 🙂 If you have any questions or comments, you can sent the respective author a note.

Copyright 2019 DREAM:Lab, Indian Institute of Science, Bangalore

Licensed under the Apache License, Version 2.0 (the "License"); you may not use files in this project except in compliance with the License. You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

elfstore's People

Contributors

Stargazers

Watchers

Forkers

nifey tmnnt

elfstore's Issues

Handling of edge failures wherein an edge device may miss heartbeats due to network link failures and comes back

In the current implementation, an edge device is removed from total usage in terms of read() and put() by its managing Fog if it misses a certain number of heartbeats. However this heartbeat misses may be due to the presence of network failures in which case the edge device may come back once the network failure is healed. What should be the proper behaviour of the Fog then in terms of allowing or discarding this edge device once it comes back?

Support for edge rejoining with previous blocks and their metadata

When an edge dies and is removed from the fog/its index, and it rejoins later with replicas, should we include its blocks (after MD5 verification) to available replicas? May cause excess reliability? Handle it using #13 ? What if it joins quickly and re-replication is ongoing?

Do partition tolerance of replicas across buddy pools by round robin selection

Ensure that different replicas are placed on different buddy's or their neighbors. [This Allows the block to survive loss of a single buddy (permanent) and its neighbors (transient).]

Related to loss and recovery of buddy, #22

Comprehensive test cases and analysis of fog contributions to the global allocation table

During the running of ElfStore on D20 setup, some issues were observed wherein some of the quadrants had negative values for allocations. This was caused due to the usage of Math.ceil() instead of Math.floor(). The bug fixes were done, we need to add more comprehensive test cases to address the corresponding bug fixes.

Block recovery in a given setup with incremental edge failures works fine with few failures but all blocks are not recovered for later edge failures

It has been observed with both D20 and D272 setup that as edge devices fail incrementally, the complete recovery of blocks succeeds only for a few (2 in both setups) failures and for the following failures, all the blocks are not recovered properly. Ideally the system should be able to handle failures to the point when there is enough storage available to replicate the lost blocks (link this issue with replica identification to make sure the storage assumption holds true).

How do we deal with excess reliability of a block based on conservative reliability estimation from global matrix?

Handle fog failures

lot of consequences

Print replica allocation state diagram.

As a part of verification for the correctness of the replica identification algorithm, print the following things which enables easy verification.

A state diagram depicting the choices made for replica allocation. (local allocation + remote allocation)
Print the Global Edge allocation table.
Min Replica, Max Replica and Expected Reliability constraints. (all this are stream level constraints)
The Reliability choices of the Edge and Fog which achieved the Expected Reliability.

Enable support for changing edge reliability with time

Currently we are assuming edge reliability to be a fixed value. However that is rarely the case as the reliability of the edge device changes with usage over time. We need to accomodate the changing reliability of the edge device in the future implementation.

Add support for updating the block data

This new feature allow updates for the data using FindBlock setting the force flag to true (details at #39). All replicas are fetched and their content is updated. The block should also maintain block version number as an internal property.

Bloom filter needs to have number of bits based on required false positive %

Handle metadata updates when edge dies

How do we update the bloom filter and metadata index when an edge dies?
Maintain one bloom filter per edge. Construct parent fog bloom filter as OR of all of these.
Loss of an edge requires just local bloom recomputation.

Recovery of blocks doesn't work to completion in presence of simultaneous edge failures

With simultaneous edge failures (2 to be precise) on a D20 setup, the blocks of neither of the two failing edge devices were recovered successfully.

One issue that can be causing failure in recovery might be a lack of implementation of java.util.concurrent.RejectedExecutionHandler which is a handler for tasks that cannot be executed by a ThreadPoolExecutor when there are no more threads or space left in the queue for holding the tasks. In such as case, the implementation of this interface may spin waiting to add the lost block into the recovery queue till it succeeds.

Add retry mechanism to handle write failures of edge

The current implementation finds the replica locations and these locations are then written to by the client. The order we took is : first write the local replica then write the remote ones. However in case if there are fog failures or the edge failures, then the write may not succeed and retry is needed to achieve the expected reliability. To meet the desiderata in terms of number of replicas and expected reliability, one order that can be taken is write the remote replicas first and lastly write the local replica.

Include a simple commandline shell

Have a (python based?) shell to perform add edge, create stream, append file, put block (with and w/o lock), update block, get file, get block, list stream, list block, etc.

Support checkpointing of Fog's state in case a fog restart is required

Currently a Fog maintains all its state in memory. As a result , if a Fog restart is done, all this state is lost which might require all its edge devices sending their data+metadata information along with the neighbors and buddies doing their part. Instead provide a facility wherein the Fog can checkpoint its state and can be restarted back with all the metadata preserved.

StreamMetadata object should also contain the mapping of blockId and its MD5 checksum as a dynamic property

Currently the StreamMetadata object doesn't contain the mapping of the blockId with its MD5 checksum. This mapping is only maintained at the fog where the stream was registered. As a result, when a client or a fog uses getStreamMetadata(), the object returned doesn't contain this mapping. However this mapping should not be returned to the client directly while it is useful for the fog devices (recovery is one of the possible scenarios where this is useful when a fog device dies). So this mapping should be added to the StreamMetadata object as a dynamic property.

Introduce sequence numbers into blocks to give ordering for blocks in a stream

Does the client set this seq no (assuming single put client per stream)?
Does the system set this seq no (assuming concurrent client)? Difficult!
We could have the notion of a session with single client at a time?

This will support putnext and getnext ops.

Stress test verification for storage utlization

The past stress test runs show that even though storage is available on the edge devices, it is not used fully due to Fog devices unable to return heartbeat responses to the edge as well as its buddies and to whom it is a neighbor. This issue was seen when we were using TNonBlockingServer which uses a single network I/O thread which is also used for handling the request itself. Since the current implementation uses TThreadedSelectorServer which allows separate thread pools for network I/O as well as request handling, stress test reruns are required to validate the correct behaviour.

Efficient use of bloom filter to optimise for the number of network calls during GetBlock

For GetBlock, as part of the FindBlock, in the current implementation a fog on receiving the request checks its local data structure to see if it contains the block locally and adds itself to the list of candidate fogs. It then looks into the bloom filters of its neighbours and on possibility of containing the block, it makes an api call to the neighbour to check whether the neighbour actually contains it or not. These neighbours are also added to the list. Similar procedure is done for the buddies with one change, the buddy also contacts its own neighbours as well to find the suitable replica fogs. However this leads to a very high number of network calls which can be reduced by using the successful response from bloom filters as an indicator of possible containment of the block.

This can be done using only the contacted fog's local data and the bloom filters of its neighbours. If the bloom filter of a neighbour indicates that the block may be contained by the fog, then it will be returned as a candidate fog to the client. The client then uses this list and tries to read a block from them. In case the block is not found among these fogs, then the block is not present on the fog and its neighbors and thus the buddies should be contacted to get the data. A similar bloom filter handling can be done for the buddies as well. Also a force flag can be passed which when unset will do the efficient version mentioned above while a set flag uses the current implementation to make sure proper replicas are returned as part of FindBlock.

Support AND queries over metadata over multiple predicates

MD5 checksum for blocks should be part of metadata

StreamMetadata should have a mechanism to allow sequential access to the blocks

The current implementation contains the bug in that the metadata has a last blockId maintained along with only a mapping of blockId and the MD5 checksum of it. Thus there is no way that blocks can be consumed in the order in which they were added to the stream thus not providing a sequential access pattern for the blocks. For the correction, a list of blocks can be used to fix it. Two cases need to be carefully looked into

PutBlock with locking
PutBlock without locking

For case 1, only one writer can append blocks at any time thus there is no race condition. However in case 2, multiple writers may be writing to the stream at the same time and there can be interleaving between the writers. It is not necessary that the last blockId should strictly increase and this interleaving is perfectly fine. So proper handling of race condition should be done i.e. atomic updates for the last blockId, list of blocks and mapping of blockId to the checksum as a whole should be done.

Support for metadata update for streams

Use this to update end time for last block, last seq no, etc.
Need to couple this with a session to write a series of blocks.

Maintain blacklisted fogs during writes to ensure same fogs are not requested in case of write failures

It might happen that replica identification returns a set of fogs but some of the fogs are not able to serve the write request due to several reasons (edge failures, edge storage below the disk watermark, etc). In such a case, during the retry phase, the api for replica identification must support a list of blacklisted fogs which will not be picked for writing again and will only search through the rest of the fogs.

Notification from fog when new blocks are added to stream

Helps retrieve new data by a subscriber, pub-sub

Support searching blocks of data using metadata based query

Currently the system supports searching data blocks via the read() api which takes the blockId as the query parameter. This issue extends the search functionality using the metadata parameters of the data blocks apart from the blockId.

Support for metadata update for blocks

related to #20 #31

Proper exception handling for server and client side

Need proper exception handling on both the server and client side. The server side (FogServer and EdgeServer) has exception handling added which should be revisited while the client side lacks the exception handling currently.

Use 2-level statistics to form global stats

Buddy creates stats for its neighbors and itself. Passes intermediate stats to its buddy pool. O(mn + mb) messages instead of O(m) messages (m: # of peers, b: buddy pool size - 1; n: neighbor count).
This will require 1-level indirection to the buddy to decide the actual fog to hold replica. It will also make global stats more appox.

No need to pass client's edge information during replica identification, a boolean flag will suffice

During replica identification for writing blocks, the current api takes the edge related information such as ip , port to make the Fog aware of it being an edge device. Instead pass a boolean flag to handle it.

Construct DHT for fog, buddy, neighbors, edge from scratch

related to #22

Fog to return actual edge reliabiliyt to client as part of put block op

Each fog where we place replica should return the exact reliability of the edge where the replica is being put on. This will help client calculate exact reliability of the block.
Question: what do we do with this info? Can we "drop" the last replica blocks if required reliability is achieved?

Can we use fog as a storage medium as well?

What role does erasure coding place instead of replication?

Add retry mechanism to handle write failures of fog

Talk to parent fog to get alternative in case fog failure?
Related to general concept of handling fog failures/buddy failures, #22

Return metadata as part of GetBlock. Support GetMeta op.

Use of compressed bloom filters to reduce the transmission cost

The frequent exchange of bloom filters requires a lot of network exchange between the Fog nodes in the system. The use of compressed bloom filters and spectral bloom filters is one area to look into if the regular bloom filter exchange becomes an overhead.

Support additional apis such as readNext(), putNext(),dropStream()

We wish to implement the following features in the future:

readNext() which allows to search data blocks based on the previous search (similar to iterator pattern)
putNext() which allows to write data blocks to the same locations as the previous write without making a call to identify replicas
dropStream() which puts an end to the currently running stream by adding an endtime to it