Giter VIP home page Giter VIP logo

hbasesqlimport's Introduction

##HBaseSQLImport

Eric Z. Beard

[email protected]

Imports data from SQL Server into HBase.

Mappings are defined between the columns in custom queries and columns in HBase.

This is a command line application written in Java that is similar to Sqoop.

It also offers some convenient functions for getting and scanning rows in HBase, displaying the values according to the schema established when importing data.

DISCLAIMER

This code is very early in its life cycle and has not undergone any serious testing! I don't recommend using it for any mission critical applications yet. The design is still evolving, and it may undergo serious changes before it's ready for production. It's been years since I've done any serious Java programming. In other words, use at your own risk!

Development Configuration: Mac OS X, Java 1.7, Hadoop 1.1.2, HBase 0.94.7 in pseudo-distributed mode, SQL Server 2008/12, and Vim of course. :w!

Build:

Just type "ant".

The build system probably needs some work. I created it with NetBeans, and I also created an IntelliJ IDEA project. Seems like all the cool kids are using Maven these days, so I should get that figured out at some point.

Usage:

java HBaseSQLImport.jar

  • -qn QueryName

    This is given for all commands that need to interact with HBase. It names the query to be used for importing data.

  • -save

    Save a mapping to the "schema" table. This table must be created manually before using hbsqli.

  • -q QueryFile

    Used when saving a table definition. The SQL import script is saved to HBase.

  • -ty Type (Table or Column)

    The HBase schema table holds the basic information for the table as well as the columns.

  • -c SQL Column Name

    Maps a SQL column name from the import script.

  • -t Column Data Type

    int, string, nstring, boolean, byte, short, double, float, datetime, guid, decimal

    These type names represent a relationship between a Java SQL Type and a regular Java type. For example, "guid" represents a SQL Server UNIQEUIDENTIFIER, which is a java.sql.CHAR, which gets converted to a Java String.

  • -k SQL Key

    The key column in the SQL query. This is used as the row key for each row saved to this HBase table. May be composite.

  • -hbt HBase Table

    The name of the HBase table.

  • -hbcf HBase Column Family

    The name of the HBase column family. HBase doesn't actually support many families well per table, so often you just have a single column family called "d".

  • -hbq HBase Column Qualifier

    The HBase column qualifer. Can be composite for nested values.

  • -hbl HBase Logical Column Name

    For documentation purposes. The logical name for the column. A data dictionary is created when defining mappings, so you can see what fields are where.

  • -hbd HBase Column Description

    For documentation purposes. A long description of the field.

  • -hbn HBase Nested Column (denormalized rows)

    You can store nested rows in the HBase table by supplying a composite qualifier. For example if a customer has multiple email addresses, you could have a qualifier of "Email_{EmailAddress}" or "Email_{EmailID}". The data from the field matching the part of the qualifer in braces would get written to the column name (yes, in HBase you store data in column names sometimes, it's Ok).

  • -show [-format]

    Show the schema mapping information for -qn QueryName

  • -dictionary

    Describe the columns in an HBase table, based on the schema mapping descriptions. These descriptions are independent of a any single import, so if two imports describe the same column, the last description processed wins (ideally they would have the same description).

  • -delete

    Delete a table or column schema mapping

  • -import

    Import the data for -qn QueryName

  • -daily

    For imports, add the SQL at the end of the file in a special comment section.

    e.g.

    -- DAILY

    -- where InsertedOn > dateadd(d, -2, getdate())

    -- /DAILY

    This enables you to use the same schema file and SQL for both the initial import of all data, and also for the daily incremental addition of new records. To handle more fine-grained updates to specific columns, you should simply write a new SQL query and a new JSON schema file to match it.

  • -schema FileName.json

    Define the schema using a JSON file instead of typing each one out manually as a separate command. This is how you will define schemas most of the time. There should be a one to one relationship between JSON schema files and SQL import files. If the same column is defined in multiple schema files, the most recent version wins. The schema and dictionary entries will be overwritten each time the schema is saved.

    [ { "qn":"Notifications", "ty":"Table", "hbt":"notification", "k":"{CompanyId}_{NotificationRunId}_{NotificationId}", "q":"/Users/ericzbeard/Hadoop/LoopHadoop/hbsqli/sql/notifications.sql" }, { "qn":"Notifications", "ty":"Column", "c":"NotificationId", "hbcf":"d", "hbq":"nid", "hbl":"Notification ID", "hbd":"The primary key for the Notification, a Guid String", "t":"guid" } ]

  • -get RowKey -hbt TableName

    Get a single row

  • -scan "d:x=y" -hbt TableName

    Scan all rows that match the filter

  • -columns "d:x,d:y"

    Column filter for -get or -scan

  • -examples Output some example commands

Tables and Columns are linked by the Query Name.

The idea is that you write import queries and name them so that as the application is processing rows, it knows where to put the data.

The import query can be any valid SQL Server TSQL query, it just needs to produce a single result set with values that match the expected data types.

Examples:

java -jar dist/HBaseSQLImport.jar -qn Companies -ty Table -hbt company -k CompanyId -save -q sql/companies.sql

java -jar dist/HBaseSQLImport.jar -qn Companies -ty Column -c CompanyName -hbcf d -hbq cn -hbl CompanyName -hbd "The name of the company" -save

java -jar dist/HBaseSQLImport.jar -qn Companies -show

java -jar dist/HBaseSQLImport.jar -qn Companies -show -format

java -jar dist/HBaseSQLImport.jar -qn Companies -delete -ty Table

java -jar dist/HBaseSQLImport.jar -scan "d:asn=Scheduled Maintenance" -hbt notification -columns "d:cid,d:nid"

java -jar dist/HBaseSQLImport.jar -get 123 -hbt company

java -jar dist/HBaseSQLImport.jar -import -qn Companies -sqlh mymachine -sqlu hadoop -sqldb dbname

java -jar dist/HBaseSQLImport.jar -schema ~/company.json

java -jar dist/HBaseSQLImport.jar -scan \* -hbt lookup

hbasesqlimport's People

Contributors

ericzbeard avatar venkateshgovind avatar

Watchers

Jason DiOrio avatar  avatar Oliver Kinzl avatar James Cloos avatar Alexey Lisin avatar

Forkers

loop-dryan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.