The hbase-secondary-index from thushear

#More details, see wiki page(Chinese wiki 中文使用手册): https://github.com/mayanhui/hbase-secondary-index/wiki

###################################################

Methods of building secondary index

###################################################

##0.Environment

hadoop: 1.0.4
hbase: 0.94.0
zookeeper: 3.4.3
hive: 0.9.0
thrift: 0.9.0

##1.Many ways to build index

###1.1 MapReduce

Using integration mapreduce to build hbase index for main table. The main structure is:

(1) scan input table by TableMapper<ImmutableBytesWritable, Writable>

(2) get the rowkey and special colum name and value

(3) create instance of Put with value=rowkey, and rowkey=columnName + "_" +columnValue

(4) use IdentityTableReducer to put data into index table

Index type support:

build single column index
build multi single-column index together
build combined-column index
build json column index. single-field, combined-field index
build rowkey only index

Command to build index:

1. build single column index

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid -s 20130101 -e 20130120 -v 1

1. build multi single-column index together

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid,cf1:age,cf2:msg

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid,cf1:age,cf2:msg -s 20130101 -e 20130120 -v 3

1. build combined-column index

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid,cf1:age,cf2:msg -si false

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:mid,cf1:age,cf2:msg -si false -s 20130101 -e 20130120 -v 1

1. build json column index. single-field, combined-field index

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:msg -j area,type,category

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:msg -j area,type,category -s 20130101 -e 20130120 -v 1

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:msg -j area,type,category -si false

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c cf1:msg -j area,type,category -si false -s 20130101 -e 20130120 -v 1

1. build rowkey only index

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c rowkey -r uid:1,mid:2,isrowkey:1

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c rowkey:cf1:content -r uid:1,mid:2,isrowkey:1

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c rowkey:cf1:content -r uid:1,mid:2,isrowkey:1 -s 20130101 -e 20130120

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i demo_table -o demo_table_index -c rowkey:cf1:content -r uid:1,mid:2,isrowkey:1 -s 20130101 -e 20130120 -v 1

###1.2 ITHBase

$HBASE_HOME/conf/hbase-site.xml：

hbase.hlog.splitter.impl

org.apache.hadoop.hbase.regionserver.transactional.THLogSplitter

hbase.regionserver.class

org.apache.hadoop.hbase.ipc.IndexedRegionInterface

hbase.regionserver.impl

org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer

hbase.hregion.impl

org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegion

###1.3 IHBase

The implementation of this method is from https://github.com/ykulbak/ihbase. However, the code is not available at all due to many classes missing. This method is not recommended because it is invasive.

###1.4 Coprocessor A demo is implemented. This method is proposed from habse-0.92.0 and not perfect now. The characteristic are:

Must implement a train of code for an index. Poor Reusability.
Must disable table before using alter table. unfriendly method for online service.
Better than other online index building methods(invasive).

#####################

MapReduce

#####################

##2 MapReduce Usage

###2.1 Build from source code Download the source code first and then use maven to build jar. go into the project and do:

mvn install

Note: You need to install maven >= 2.2.1

###2.2 use jar You can see the jar file in root directory of project: hbase-secondary-index-0.1.jar You can use it directly!

###2.3 Build index Use the example of buildindex.sh in directory 'src/main/resources' Such as:

hadoop jar hbase-secondary-index-0.1.jar net.hbase.secondaryindex.mapred.Main -i user_behavior_attribute_noregistered -o user_behavior_attribute_noregistered_index -c bhvr:vvmid -s 20130101 -e 20130120 -v 3

usage: Build-Secondary-Index -c family:qualifier [-d] [-e ] -i [-j ] -o [-r ] [-s ] [-si ] [-v ]

-c,--column family:qualifier column to store row data into (must exist). Such as: cf1:age,cf2:tag,cf2:msg or rowkey or rowkey,cf1:age. The last two usage are for 'rowkey' index building.

-d,--debug switch on DEBUG log level

-e,--edate the end date of data to build index(default is today), such as: 20130120

-i,--input the directory or file to read from (must exist)

-j,--json json fields to build index. The max number of fields is 3! This kind of data uses IndexJsonMapper.class.

-o,--output table to import into (must exist)

-r,--rowkey rowkey fields to build index. The max number of fields is 2! This kind of data uses IndexRowkeyMapper.class. The format is: uid:1,msgid:2,isrowkey:1 uid and msgid are the field name, 1 and 2 is the order in the rowkey(like: uid_msgid_ts). isrowkey is the label to define which field is the new rowkey. The separator in rowkey is _ . You can use validate column to build incremental index. If use validate column, you need to add a column to -c parameter, the -c should be 'rowkey,cf1:age'

-s,--sdate the start date of data to build index(default is 19700101), such as: 20130101

-si,--sindex if use single index. true means 'single index', false means 'combined index'(default is true). If build combined index, the max number of columns is 3.

-v,--versions the versions of each cell to build index(default is Integer.MAX_VALUE)

##License Released under the GPLv3 license. For full details, pleasesee the LICENSE file included in this distribution.

thushear / hbase-secondary-index Goto Github PK

hbase-secondary-index's Introduction

Methods of building secondary index

MapReduce

hbase-secondary-index's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent