Giter VIP home page Giter VIP logo

hbase-sdk-for-net's Introduction

Microsoft HBase REST Client Library for .NET

This is C# client library for HBase on Azure HDInsight.

It has been compatible with all HBase versions since 0.96.2 (HDI 3.0).

The communication works through HBase REST (StarGate) which uses ProtoBuf as a serialization format.

Non-HDInsight HBase cluster can use VNET mode which does not require OAuth credentials.

Getting Started

Build

Import the solution file into VS2013 and compile. Retrieve the resulting *.dll files.

We have published the signed binary on nuget.org (https://www.nuget.org/packages/Microsoft.HBase.Client/).

Usage

After compilation, you can easily use the library to get the version of the HBase/HDInsight cluster you're running on:

var creds = new ClusterCredentials(new Uri("https://myclustername.azurehdinsight.net"), "myusername", "mypassword");
var client = new HBaseClient(creds);

var version = client.GetVersionAsync().Result;
Console.WriteLine(version);

// yields: RestVersion: 0.0.2, JvmVersion: Azul Systems, Inc. 1.7.0_55-24.55-b03, OsVersion: Windows Server 2012 R2 6.3 amd64, ServerVersion: jetty/6.1.26, JerseyVersion: 1.8, ExtensionObject:

Table creation works like this:

var creds = new ClusterCredentials(new Uri("https://myclustername.azurehdinsight.net"), "myusername", "mypassword");
var client = new HBaseClient(creds);

var testTableSchema = new TableSchema();
testTableSchema.name = "mytablename";
testTableSchema.columns.Add(new ColumnSchema() { name = "d" });
testTableSchema.columns.Add(new ColumnSchema() { name = "f" });
client.CreateTableAsync(testTableSchema).Wait();

Inserting data can be done like this:

var creds = new ClusterCredentials(new Uri("https://myclustername.azurehdinsight.net"), "myusername", "mypassword");
var client = new HBaseClient(creds);

var tableName = "mytablename";
var testKey = "content";
var testValue = "the force is strong in this column";
var set = new CellSet();
var row = new CellSet.Row { key = Encoding.UTF8.GetBytes(testKey) };
set.rows.Add(row);

var value = new Cell { column = Encoding.UTF8.GetBytes("d:starwars"), data = Encoding.UTF8.GetBytes(testValue) };
row.values.Add(value);
client.StoreCellsAsync(tableName, set).Wait();

Retrieving all cells for a key looks like this:

var creds = new ClusterCredentials(new Uri("https://myclustername.azurehdinsight.net"), "myusername", "mypassword");
var client = new HBaseClient(creds);

var testKey = "content";
var tableName = "mytablename";

var cells = client.GetCells(tableName, testKey).Result;
// get the first value from the row.
Console.WriteLine(Encoding.UTF8.GetString(cells.rows[0].values[0].data));
// with the previous insert, it should yield: "the force is strong in this column"

Scanning over rows looks like this:

var creds = new ClusterCredentials(new Uri("https://myclustername.azurehdinsight.net"), "myusername", "mypassword");
var client = new HBaseClient(creds);

var tableName = "mytablename";

// assume the table has integer keys and we want data between keys 25 and 35
var scanSettings = new Scanner()
{
	batch = 10,
	startRow = BitConverter.GetBytes(25),
	endRow = BitConverter.GetBytes(35)
};
RequestOptions scanOptions = RequestOptions.GetDefaultOptions();
scanOptions.AlternativeEndpoint = "hbaserest0/";
ScannerInformation scannerInfo = null;
try
{
    scannerInfo = client.CreateScannerAsync(tableName, scanSettings, scanOptions);
    CellSet next = null;
    while ((next = client.ScannerGetNextAsync(scannerInfo, scanOptions).Result) != null)
    {
	foreach (var row in next.rows)
        {
    	    // ... read the rows
        }
    }
}
finally
{
    if (scannerInfo != null)
    {
        client.DeleteScannerAsync(tableName, scannerInfo, scanOptions).Wait();
    }
}

There is also a VNET mode which can be used if your application is in the VNET with your HDI HBase cluster. NOTE: VNET mode also works for non-HDI clusters or on-premises HBase clusters.

var scanOptions = RequestOptions.GetDefaultOptions();
scanOptions.Port = 8090;
scanOptions.AlternativeEndpoint = "/";
var nodeIPs = new List<string>();
nodeIPs.Add("10.0.0.15");
nodeIPs.Add("10.0.0.16");
var client = new HBaseClient(null, options, new LoadBalancerRoundRobin(nodeIPs));
var scanSettings = new Scanner { batch = 10 };
ScannerInformation scannerInfo = client.CreateScanner(testTableName, scanSettings, scanOptions);
var options = RequestOptions.GetDefaultOptions();
options.Port = 8090;
options.AlternativeEndpoint = "/";
options.AlternativeHost = scannerInfo.Location.Host;
client.DeleteScanner(testTableName, scannerInfo, options);

hbase-sdk-for-net's People

Contributors

abhiver avatar ajithg-msft avatar anliu avatar danzajork avatar duoxu avatar mwprochaska avatar pawelpabich avatar pshrosbree avatar thomasjungblut avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hbase-sdk-for-net's Issues

GetCells does not work when key contains colons

Hi,

I have a key that follows this pattern: "Guid_DateTimeOffset.ToString("0")". Sample key: "b7c5510bed2c400b9ecf51df24a735eb_2015-06-26T10:35:00.5325853+10:00". When I try to use this key to retrieve the date I get 404. When I use HBase shell I can find the data without any problems.

[HDI] Client timeouts on compression

Take the HBaseClientTests and add a compression to the schema:

_testTableSchema.columns.Add(new ColumnSchema { name = "d", compression = "XXX"});

If you provide "GZ" it works fine, if you provide anything else (e.g. "SNAPPY", although Snappy is installed and actually works on the cluster) it will just timeout.
I guess the problem is on the server side and it hangs somewhere on verifying the compression algorithm somehow (which should just be a config lookup).

Cluster version: 3.1.1.406

We should send an appropriate error message and don't timeout here.

CreateScannerAsync and ScannerGetNextAsync called on differents hosts

When using a VNet and multiple region servers, the calls to CreateScannerAsync and ScannerGetNextAsync are not guaranteed to be sent to the same region server. This is a problem for a scanner because the scanner id information is local to that server. This results in a 404 error from the ScannerGetNextAsync call.

As a work around for this issue I added a "ScannerHost" field to the RequestOptions structure that is passed down with the call. I populate it after the CreateScannerAsync call with the value from the load balancer.

RequestOptions options = RequestOptions.GetDefaultOptions();
scannerInfo = await _hbaseClient.CreateScannerAsync(_tablename, scanSettings, options);
options.ScannerHost = scannerInfo.Location.Host;
_hbaseClient.ScannerGetNextAsync(scannerInfo, options)

I then use this in IssueWebRequestAsync as follows. This allows me to specify the host for the ScannerGetNextAsync and DeleteScannerAsync but leave other calls untouched.

string host = balancedEndpoint.Host;
if (options.ScannerHost != null)
    host = options.ScannerHost;

UriBuilder builder = new UriBuilder(
    balancedEndpoint.Scheme,
    host,
    options.Port,
    options.AlternativeEndpoint + endpoint);

There is probably a cleaner way to set this up, but I wanted to provide the work around for reference.

Add an interface for the client

Hey guys,

I'm just testing things on top of the lib and I forgot to extract an interface for the HBaseClient (for mocking).

Don't know if I manage to create a pull request before you release, but it would be a very great addition to not force people to extract their own.

Fix inconsistency in how async API throws exception

Examples:
CreateScannerAsync wraps exception into Task and throws when awaited

public async Task<ScannerInformation> CreateScannerAsync(string tableName, Scanner scannerSettings, RequestOptions options)
{
    tableName.ArgumentNotNullNorEmpty("tableName");
    scannerSettings.ArgumentNotNull("scannerSettings");
    options.ArgumentNotNull("options");
    return await options.RetryPolicy.ExecuteAsync(() => CreateScannerAsyncInternal(tableName, scannerSettings, options));
}

DeleteScannerAsync throws exception right away:

public Task DeleteScannerAsync(string tableName, ScannerInformation scannerInfo, RequestOptions options)
{
    tableName.ArgumentNotNullNorEmpty("tableName");
    scannerInfo.ArgumentNotNull("scannerInfo");
    options.ArgumentNotNull("options");
    return options.RetryPolicy.ExecuteAsync(() => DeleteScannerAsyncInternal(tableName, scannerInfo, options));
}

A fix can be considered as a breaking change.

Make retry policies visible for usage

There is no way for me to setup a new IRetryPolicyFactory's implementation from a different namespace since all of the classes there are marked internal.

Idea: expose Async API only

Why? The sync API simply calls .Result on the async API which will cause deadlocks in environments like WPF or MVC. This is a trap that confuses many people.

support with .Net Core

Hi, Can you make a version of this project compatible with .Net Core please ? Best regards

Deadlock in WebRequesterBasic

In the WebRequesterBasic class, we have the following method:

public HttpWebResponse IssueWebRequest(string endpoint, string method = "GET", Stream input =   null)
{
   var response =  IssueWebRequestAsync(endpoint, method, input).Result;
   return response;
}

A GUI client (such as a Windows Forms app) or an MVC client will deadlock if it calls the method as follows:
var requester = new Microsoft.HBase.Client.WebRequesterBasic(); requester.IssueWebRequest("https://github.com/hdinsight/hbase-sdk-for-net");

Mixing synchronous and asynchronous waiting is dangerous, see, for example http://blog.stephencleary.com/2012/07/dont-block-on-async-code.html

There are instances of synchronous blocking in WebRequesterSecure and WebRequester as well.
A possible fix would be to configure the awaits IssueWebRequestAsync to not capture the current SynchronizationContext using ConfigureAwait(false)

Add the ability to talk to hbase infrastructure from a different vnet

I'd like to have the ability to talk to the hbase infrastructure from a different vnet. Currently its a requirement for the rest API to be in the same vnet as hbase, this puts additional restrictions on the customer and requires solutions that are short term (aka creating gateways and maintaining these gateways to communicate between the two vnets). I'd like to understand the priority of this feature and how soon it can be completed. In general the best customer experience is to be able to communicate with hbase from any vnet within hdinsight.
Thanks

Add support for multiget requests

The SDK doesn't support multiget requests. Adding support for multiget requests (see pull request #65) will provide the ability to request multiple rows, by row key, in a single API call.

Address filters that do not work with ScannerModel.stringifyFilter

Both FirstKeyValueMatchingQualifiersFilter and FuzzyRowFilter are incompatible with ScannerModel.stringifyFilter. In addition, the following classes are only partially stringified (some state is not): ColumnPaginationFilter, KeyOnlyFilter, and RegexStringComparator

Fix inconsistency in configuring VNET clusters

Right now the API exposes the following constructors for the client:

  • HBaseClient(ClusterCredentials credentials)
  • HBaseClient(int numRegionServers)
  • HBaseClient(ClusterCredentials credentials, IRetryPolicyFactory retryPolicyFactory, ILoadBalancer loadBalancer = null)

The first call is for gateway deployments (agnostic of OS), the second by default for vnet load balancing on Windows clusters.

For linux clusters and VNET you would need to pass a load balancer in the third constructor with the right hostnames created. Despite not beeing able to instantiate a retry factory (see #43), this is quite inconsistent (why are the credentials needed?) and not nice for a user to setup.

Let's add a builder that creates a new HBaseClient:

var client = HBaseClientBuilder
   .New()
   .OnLinux()
   .OnVNet(numNodes: 32, loadBalancer: new LoadBalancerRoundRobin())
   .WithRetryPolicy(new ExponentialRetryPolicy(...))
   .Create();

Which will configure 32 nodes on linux with a vnet and RR balancing.

For the gateway case:

var client = HBaseClientBuilder
   .New()
   .OnWindows()
   .WithClusterUri(new Uri(...))
   .WithCredentials(new ClusterCredentials(...))
   .WithRetryPolicy(new ExponentialRetryPolicy(...))
   .Create();

We also need to refactor all the stuff in between, it has become a big mess with url strings getting passed around and no real structure.

Hard to read complete data

The batch parameter on Scanner data structure controls how many cells (not rows) will be retrieved at a time. This means that for it be useful one needs to write code to combines the data from possibly multiple batches so only complete rows are returned to the client code. This might not be needed with stateless scanners #7 as they have limit parameter which operates on the row level.

I have a piece of code that does that and if there is enough appetite I can open a pull request.
The gist of it is here: https://gist.github.com/pawelpabich/53d5cfc3ef51a083a042. I also have set of tests that covers it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.