huawei-hadoop / hindex Goto Github PK
View Code? Open in Web Editor NEWSecondary Index for HBase
License: Apache License 2.0
Secondary Index for HBase
License: Apache License 2.0
Some thing like index To_UPPER(c1).. We can have a set of predefined such functions. (Can refere Oracle/PG) Also allow UDFs(?)
Will add the details later in description
when I insert testData,there is an error.
org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 5000 actions: NotServingRegionException: 5000 times, servers with issues: xxxxx:60020,
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1677)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1453)
at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:738)
I saw this in the future work list.
-> WAL for secondary index can be created from the WAL for main reigon.
-> But the order of region opening may be important.
-> New hooks in the creation/reading of recovered.edits may be needed if current ones are not sufficient.
As i said you can close these issue if you already have plans to implement this. I am just using this as a mode to add comments. Thank you.
Presently when we add new indices we are not checking existences of indices, so if user specifies same index which is present we are doing duplicate index puts.
And also index table descriptor creation also not proper.
Presently while doing split from external client or after compaction, we are having a check for index table region, which is kernel change. Instead we can define a split policy which will return false on explicit split.
We store some 6 bytes of data with every KV getting added into the index table. (One int and one short) We can use vint here? Mostly these int and short can be represented using 1 bytes each. So in effect we can save 4 bytes with each KV in the index table.
Is user region and index region balanced at the same time ?
When put data into Hbase, it cause dead-lock issues, after track the code, i found in HRegion.batchMutate
method.
startRegionOperation();
if (coprocessorHost != null) {
coprocessorHost.postStartRegionOperation(); ------a
}
try {
......
} finally {
closeRegionOperation(); ----------b
if (coprocessorHost != null) {
coprocessorHost.postCloseRegionOperation();
}
}
......
When execute 'a' step, the index region has not been banlanced to this server, it will throw IOException
and do not execute 'b' step, then cause the deadlock when closing this region.
Could you check it?
Now the index can be specified on one or more columns. Adding a part of the RK also into the index specification can be done (?)
A point been raised by James Taylor
Handle points mentioned by Ted at
https://issues.apache.org/jira/browse/HBASE-9203?focusedCommentId=13740511&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13740511
This helps to avoid put failures during split if user daughter regions assigned but index daughter regions assignment in progress.
Presently we don't have merge support for index regions on user regions merge request. it's very much important for >0.95 because we have online merge support. Any time regions can be merged.
This helps to make it more user friendly and we can reuse some methods AbstractHBaseTool.
Now the creation of this new class/ HalfStoreFileReader is happening (to read half files) based on checks on table name in core code. Can check adding a CP hook in core code and implementing this using the new CP hook.
presently we are calling postStartRegionOperation as after startRegionOperation.
{code}
startRegionOperation();
if (coprocessorHost != null) {
coprocessorHost.postStartRegionOperation();
}
{code}
In startRegionOperation we are aquiring read lock. If any exceptions in postStartRegionOperation we will throw out the exception and the lock wont be released. This will cause block region closing or other operations which needs write lock.
If we call this in try block then we can release the lock in closeRegionOperation.
Some test cases are hanging while closing all regions in shutdown because of this issue.
May be we can have IndexAdmin extended from HBaseAdmin
This helps to avoid kernel changes in HBaseAdmin,
Also even if there are any new admin operations related to index we can add to this class.
If we maintain index data on separate family as well instead of separate table, then we can avoid all admin operations like create,enable,disable,split on index table. We can achieve this with minimal changes to our code base. Then most of the region wise operations will be taken care by kernel only.
I get block in below situation:
when do put operation, below is the case:
indexRegion.closing
is set true, but wait for the lock.indexRegion.batchMutateForIndex
method, will log WARN when execute in doMiniBatchMutation
method, because the indexRegion.closing
is true
.try {
acquiredLockId = getLock(providedLockId, mutation.getRow(), shouldBlock);
} catch (IOException ioe) {
LOG.warn("Failed getting lock in batch put, row=" + Bytes.toStringBinary(mutation.getRow()), ioe);
}
doMiniBatchMutation
method always return 0L, so the indexRegion.batchMutateForIndex
is a endless operation.
while (!batchOp.isDone()) {
....
}
it always log WARN to file, until the disk is full.
hi, i use hindex scan data like this:
public void testSingleIndexExpressionWithMoreEqualsExpsAndOneRangeExp() throws Exception {
String indexName = "IDX1";
SingleIndexExpression singleIndexExpression = new SingleIndexExpression(indexName);
byte[] value1 = "g".getBytes();
Column column = new Column(FAMILY1, QUALIFIER9);
EqualsExpression equalsExpression = new EqualsExpression(column, value1);
singleIndexExpression.addEqualsExpression(equalsExpression);
column = new Column(FAMILY1, QUALIFIER2);
byte[] value2_1 = Bytes.toBytes("1383633260000");
byte[] value2_2 = Bytes.toBytes("1383633262000");
RangeExpression re = new RangeExpression(column, value2_1, value2_2, true, false);
singleIndexExpression.setRangeExpression(re);
Scan scan = new Scan();
scan.setAttribute(Constants.INDEX_EXPRESSION, IndexUtils.toBytes(singleIndexExpression));
FilterList fl = new FilterList(Operator.MUST_PASS_ALL);
Filter filter = new SingleColumnValueFilter(FAMILY1, QUALIFIER9, CompareOp.EQUAL, value1);
fl.addFilter(filter);
filter = new SingleColumnValueFilter(FAMILY1, QUALIFIER2, CompareOp.GREATER_OR_EQUAL, value2_1);
fl.addFilter(filter);
filter = new SingleColumnValueFilter(FAMILY1, QUALIFIER2, CompareOp.LESS, value2_2);
fl.addFilter(filter);
scan.setFilter(fl);
HTablePool pool = new HTablePool(configuration, 1000);
HTableInterface table = pool.getTable(tableName);
long current = System.currentTimeMillis();
outputResult(scan, table);
System.out.println(System.currentTimeMillis() - current);
}
When i use scan.setAttribute(Constants.INDEX_EXPRESSION, IndexUtils.toBytes(singleIndexExpression));
, this method can't scan any data. but if i remove this code , i can scan correct data. so i'm confused.
the some log data which remove scan.setAttribute()
2013-11-26 10:58:05,224 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Checking for the best index(s) for the cols combination : [[cf1 : a2, cf1 : a9]]
2013-11-26 10:58:05,224 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Trying to find a best index for the cols : [cf1 : a2, cf1 : a9]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Possible indices for cols [cf1 : a2, cf1 : a9] : [Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Best index selected for the cols [cf1 : a2, cf1 : a9] : null
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Not even one index found for the cols combination : [[cf1 : a2, cf1 : a9]]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Checking for the best index(s) for the cols combination : [[cf1 : a2], [cf1 : a9]]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Trying to find a best index for the cols : [cf1 : a2]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Possible indices for cols [cf1 : a2] : [Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Index Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9] seems to be suitable for the columns [cf1 : a2]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Best index selected for the cols [cf1 : a2] : Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Trying to find a best index for the cols : [cf1 : a9]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Possible indices for cols [cf1 : a9] : [Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]]
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Best index selected for the cols [cf1 : a9] : null
2013-11-26 10:58:05,225 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Index(s) which will be used for columns [cf1 : a2, cf1 : a9] : {[cf1 : a2]=Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]}
2013-11-26 10:58:05,225 INFO org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Index using for the columns [cf1 : a2] : Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]
2013-11-26 10:58:05,231 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionScannerForAND: Removing scanner org.apache.hadoop.hbase.index.coprocessor.regionserver.LeafIndexRegionScanner@3aa49259 from the list.
2013-11-26 10:58:05,873 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Checking for the best index(s) for the cols combination : [[cf1 : a2, cf1 : a9]]
2013-11-26 10:58:05,873 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Trying to find a best index for the cols : [cf1 : a2, cf1 : a9]
2013-11-26 10:58:05,873 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Possible indices for cols [cf1 : a2, cf1 : a9] : [Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]]
2013-11-26 10:58:05,873 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Best index selected for the cols [cf1 : a2, cf1 : a9] : null
2013-11-26 10:58:05,873 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Not even one index found for the cols combination : [[cf1 : a2, cf1 : a9]]
2013-11-26 10:58:05,873 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Checking for the best index(s) for the cols combination : [[cf1 : a2], [cf1 : a9]]
2013-11-26 10:58:05,873 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Trying to find a best index for the cols : [cf1 : a2]
2013-11-26 10:58:05,874 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Possible indices for cols [cf1 : a2] : [Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]]
2013-11-26 10:58:05,874 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Index Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9] seems to be suitable for the columns [cf1 : a2]
2013-11-26 10:58:05,874 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Best index selected for the cols [cf1 : a2] : Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]
2013-11-26 10:58:05,874 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Trying to find a best index for the cols : [cf1 : a9]
2013-11-26 10:58:05,874 DEBUG org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator: Possible indices for cols [cf1 : a9] : [Index : IDX1,Index Columns : [CF : cf1,Qualifier : a2, CF : cf1,Qualifier : a3, CF : cf1,Qualifier : a9]]
Its missed during rebase and observed less performance with secondary index.
is it possible to migrate hindex to a higher version of hbase,like 0.96?
based on my understanding, there is a huge difference between 0.96 and 0.94.8, like project structure's changing, some core api's changing etc..
how can i merge the hindex's code into hbase 0.96?
Thank you
These days i do write performance test. There is two table, on include index(called idx_table), other not(called no_idx_table).
Write no_idx_table is ok, but after write some data to idx_table, region server always throw some exception:
org.apache.hadoop.hbase.NotServingRegionException:
xxxx. is closing
In the http://hmaster:60010, in the "Regions in Transition", there are two region(one is idx_table, the other is idx_table_idx) display
state=PENDING_CLOSE, ts=xxxxx, server=null
who can tell me what is wrong, it always happen this issues.
I think the api is so import to begioning .And another question is What time the second
can support Hadoop 2.X.
Thank you
SplitTransaction deals with ThreadLocal variables to deal with details of parent's daughter regions and the index regions's daughter regions.
I think the CP hooks added can be contributed(may be already it is contribued) but the other changes like adding an atomic Put to the main and index region, Carrying the daughter region's info from the threadlocal has to be either moved out of the core code or we need to find a way to make it in the core code through a CP hook. If we go ahead with CF approach then this may not be needed.
hi, i used bulkload to import data. and there have some code in TableIndexer like this:
String[] tableName = conf.getStrings(TABLE_NAME_TO_INDEX);
if (tableName == null) {
System.out
.println("Wrong usage. Usage is pass the table -Dindex.tablename='table1' "
+ "-Dtable.columns.index='IDX1=>cf1:[q1->datatype& length],[q2],"
+ "[q3];cf2:[q1->datatype&length],[q2->datatype&length],[q3->datatype& lenght]#IDX2=>cf1:q5,q5'");
System.out.println("The format used here is: ");
the TABLE_NAME_TO_INDEX
's actual value is tablename.to.index but the comment is -Dindex.tablename='table1'
,is right?
i have deployed Hadoop and hindex successfully, created table and inserted data , index table also existed, so ,how do i scan for special Qualifier which has index? like the code:
get 'test','rowkey','Family:Qualifier','value' ?
and how to deploy the hindex into cluster,or just like hbase 0.90.4? and how to bulid the project?..
and can it works with hadoop 1.X?
We can avoid the new IndexedHTD
If we create table first and there are some data in the table, now we want to use hindex, I use HtableAdmin.modifyAdmin method, it can create the index table, but it cannot index the data already exist in the table, is there some method to resolve this issues.?
hi, i'm a Chinese coder. and i use your project for secondary index . but put data is very slower than native hbase.
so,thanks to help me。
when I run this mapreduce job, I found it can read from hbase table, but the map output is 0.
I found it always return null when execute IndexUtils.getIndexedHTableDescriptor(tablename, conf) method. So mapper cannot output any records to reducer.
i found some interfaces not implemented when compiled with eclipse using cdh4.3.0 hbase jars ,
does hindex take supportting CDH4 hbase into consideration ?
Long time = indexWALEdit.getKeyValues().get(0).getTimestamp();
ctx.getEnvironment()
.getWAL()
.appendNoSync(indexRegion.getRegionInfo(), Bytes.toBytes(indexTableName), indexWALEdit,
logKey.getClusterId(), time, indexRegion.getTableDesc());
This is added specifically in the core but used only for index scenario. Need to find a better way.
I was able to build the project and run the map reduce bulk insert and load incremental file.
hbase org.apache.hadoop.hbase.index.mapreduce.IndexImportTsv
hbase org.apache.hadoop.hbase.index.mapreduce.IndexLoadIncrementalHFile
But, something strange is happening, after the compeletion of process for just 6GB of data, the size of the hbase table keeps and keeps on increasing till 200 gb after which I had to shutdown the cluster.
Please suggest whats going wrong here ?
Thanks
Can you create a patch for all the core changes (which are not present in 94 code base) and attach as a file in this Git?
hindex compare to phoenix secondary index(https://github.com/forcedotcom/phoenix
),there are some advantages and defects? combine phoenix's sql and hindex's secondary index is a good scheme?
This can be implemented by passing Expression like SkipRangeScan as attribute to scan.
Secure check is bypassed while accessing the indexing tables currently. Basic implementation can be granting the same ACL as usertable. Need to work on the best solution by considering per cell ACL in HBase >= 0.98 versions
You can close this issue. Am not sure how to comment. so am using this facility. If you want to keep this open you can have it.
the data is uploaded to hbase by bulkload tool.
the regionserver log is :
ERROR org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver: Exception occured in postScannerOpen for the table test_CDR2
java.util.NoSuchElementException
at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(LinkedHashMap.java:375)
at java.util.LinkedHashMap$KeyIterator.next(LinkedHashMap.java:384)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator.selectBestFitIndexForColumn(ScanFilterEvaluator.java:1086)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator.selectBestFitAndPossibleIndicesForSCVF(ScanFilterEvaluator.java:1064)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator.evalFilterForIndexSelection(ScanFilterEvaluator.java:480)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator.evaluate(ScanFilterEvaluator.java:128)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver.postScannerOpen(IndexRegionObserver.java:484)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1315)
at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2560)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
2013-08-20 17:46:36,796 ERROR org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver: Exception occured in postScannerOpen for the table test_CDR2
java.util.NoSuchElementException
at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(LinkedHashMap.java:375)
at java.util.LinkedHashMap$KeyIterator.next(LinkedHashMap.java:384)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator.selectBestFitIndexForColumn(ScanFilterEvaluator.java:1086)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator.selectBestFitAndPossibleIndicesForSCVF(ScanFilterEvaluator.java:1064)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator.evalFilterForIndexSelection(ScanFilterEvaluator.java:480)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.ScanFilterEvaluator.evaluate(ScanFilterEvaluator.java:128)
at org.apache.hadoop.hbase.index.coprocessor.regionserver.IndexRegionObserver.postScannerOpen(IndexRegionObserver.java:484)
at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.postScannerOpen(RegionCoprocessorHost.java:1315)
at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2560)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
thanks!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.