airbnb / reair Goto Github PK
View Code? Open in Web Editor NEWReAir is a collection of easy-to-use tools for replicating tables and partitions between Hive data warehouses.
License: Apache License 2.0
ReAir is a collection of easy-to-use tools for replicating tables and partitions between Hive data warehouses.
License: Apache License 2.0
See the code that passes a single ThriftClient to many theads.
I don't think the ThriftClient is thread-safe. We have many databases, so many threads (16 by default as defined in the source code) will be running on a single ThriftClient, which causes unexpected failures on the server side.
Suggestion fixes:
A. Create a new ThriftClient for each database, or
B. Just get rid of the multi-threaded case here since it won't take too much time to list through tens or even hundreds of databases.
one line create a hive client, it must waster more resource and may be oom!
this version reair dont support two namesever , should add man config exp:
dfs.nameservices
srcCluster,destCluster
dfs.ha.namenodes.srcCluster
nn1,nn2
dfs.ha.namenodes.destCluster
namenode317,namenode319
dfs.namenode.rpc-address.srcCluster.nn1
a1-ops-hdnamenode01.hz:8020
dfs.namenode.rpc-address.srcCluster.nn2
a1-ops-hdnamenode02.hz:8020
dfs.namenode.rpc-address.destCluster.namenode317
a2-prod-sh-namenode-XX-124.sh:8020
dfs.namenode.rpc-address.destCluster.namenode319
a2-prod-sh-namenode-XX-123.sh:8020
dfs.client.failover.proxy.provider.srcCluster
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
dfs.client.failover.proxy.provider.destCluster
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
Is it possible to replicate metastore only?
Is it possible to copy metadata between different Hive metastore versions?
if use kerberos , the client cant connect to metadata, i suggest use hvie client
I deploy two hive metastore in one cluster, hive 2.1 and hive 3.0. Now I just want to synchronize hive meta from 2.1 to 3.0 without change hdfs data. Does it support?
Does reair support replicating tables to Secured/Kerberized cluster?
Am facing below exception:
2017-07-07 01:52:06,930 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2017-07-07 01:52:06,951 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: database 0, table some_hive_db.some_table_in_hive got exception
at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapperWithTextInput.map(MetastoreReplicationJob.java:622)
at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapperWithTextInput.map(MetastoreReplicationJob.java:593)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: com.airbnb.reair.common.HiveMetastoreException: org.apache.thrift.transport.TTransportException
at com.airbnb.reair.common.ThriftHiveMetastoreClient.getTable(ThriftHiveMetastoreClient.java:126)
at com.airbnb.reair.incremental.primitives.TaskEstimator.analyzeTableSpec(TaskEstimator.java:84)
at com.airbnb.reair.incremental.primitives.TaskEstimator.analyze(TaskEstimator.java:68)
at com.airbnb.reair.batch.hive.TableCompareWorker.processTable(TableCompareWorker.java:136)
at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapperWithTextInput.map(MetastoreReplicationJob.java:614)
... 9 more
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:340)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:202)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1263)
at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1249)
at com.airbnb.reair.common.ThriftHiveMetastoreClient.getTable(ThriftHiveMetastoreClient.java:121)
... 13 more
I saw code in TableCompareWorker.java#L90 calling getMetastoreClient() of HardCodedCluster.java#L62. I don't see any implementation which checks if the connection need to be in secure mode or not. Based on the security flag in config, we can switch which type of TTransport to use.
Any thoughts on how to implement it?
I was trying to connect to thrift over standalone program based on https://github.com/joshelser/krb-thrift. But no luck. Now I get
Exception in thread "main" org.apache.thrift.transport.TTransportException: SASL authentication not complete
Stnadalone Java program snippet:
TTransport transport = new TSocket(host, port);
Map<String,String> saslProperties = new HashMap<String,String>();
// Use authorization and confidentiality
saslProperties.put(Sasl.QOP, "auth-conf");
saslProperties.put(Sasl.SERVER_AUTH, "true");
System.out.println("Security is enabled: " + UserGroupInformation.isSecurityEnabled());
// Log in via UGI, ensures we have logged in with our KRB credentials
UserGroupInformation.loginUserFromKeytab("someuser","/etc/security/keytabs/someuser.headless.keytab");
UserGroupInformation currentUser = UserGroupInformation.getCurrentUser();
System.out.println("Current user: " + currentUser);
// SASL client transport -- does the Kerberos lifting for us
TSaslClientTransport saslTransport = new TSaslClientTransport(
"GSSAPI", // tell SASL to use GSSAPI, which supports Kerberos
null, // authorizationid - null
args[0], // kerberos primary for server - "myprincipal" in myprincipal/[email protected]
args[1], // kerberos instance for server - "my.server.com" in myprincipal/[email protected]
saslProperties, // Properties set, above
null, // callback handler - null
transport); // underlying transport
// Make sure the transport is opened as the user we logged in as
TUGIAssumingTransport ugiTransport = new TUGIAssumingTransport(saslTransport, currentUser);
ThriftHiveMetastore.Client client = new ThriftHiveMetastore.Client(new TBinaryProtocol(ugiTransport));
transport.open();
Any guidance will help me contribute back patch for enabling support for Kerberos.
change in org.apache.thrift version not working properly,
version change 0.9.1 to 0.12.0 will giving an issue while trying to build.
com.airbnb.reair.incremental.thrift.TReplicationService.AsyncClient.pause_call is not abstract and does not override abstract method getResult() in org.apache.thrift.async.TAsyncMethodCall
Need a understanding about the ReAir -Hows it's helping over the Hadoop - Hive Warehouse.
As Hadoop is already a Big platform to explore enough Big-Data
stage2 map unbalance , many file in one reduce, the result is this task very slowly
with different cluster if they blocksize is different, in the stage 3 , it will be fail, because directoryCopier.equalDirs must return false ,they checksum are different, if file size bigger than one block size ,it must be happen. now i check file use they bytes.
below is they different checksum
/test/123.lzo MD5-of-0MD5-of-512CRC32C
/test/123.lzo MD5-of-0MD5-of-256CRC32C
private static void copyFile(Path srcFile, Path destFile) throws IOException {
String[] copyArgs = { "-cp", srcFile.toString(), destFile.toString() };
FsShell shell = new FsShell();
//1.============Whether it should be like this:======================
String[] copyArgs = { "-put", srcFile.toString(), destFile.toString() };
//2.================================================
I ask, whether the project is still in development
I start the ReAir jar for replicate source warehouse to dest warehouse. But i found the exception as follow, and I don't know why
17/05/23 17:05:33 INFO mapreduce.Job: The url to track the job: http://cdh.master.linesum:8088/proxy/application_1495527145657_0006/ 17/05/23 17:05:33 INFO mapreduce.Job: Running job: job_1495527145657_0006 17/05/23 17:05:38 INFO mapreduce.Job: Job job_1495527145657_0006 running in uber mode : false 17/05/23 17:05:38 INFO mapreduce.Job: map 0% reduce 0% 17/05/23 17:05:41 INFO mapreduce.Job: Task Id : attempt_1495527145657_0006_m_000001_0, Status : FAILED Error: java.util.NoSuchElementException at java.util.AbstractList$Itr.next(AbstractList.java:364) at com.google.common.collect.Iterators.any(Iterators.java:684) at com.google.common.collect.Iterables.any(Iterables.java:620) at com.airbnb.reair.batch.hive.TableCompareWorker.processTable(TableCompareWorker.java:123) at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapper.map(MetastoreReplicationJob.java:570) at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapper.map(MetastoreReplicationJob.java:556) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Thanks for open sourcing this project, I'm having an issue to integrate Spark app to record audit logs using hive-hooks
module. Just curious have you ever tried Spark integration? It seems the Spark supports Hive hooks theoretically, but it doesn't work when initializing remote HiveMetastore client.
This is a bug as discussed in https://groups.google.com/forum/#!topic/airbnb-reair/fP-0F9xCBtI
Here are the 4 places:
The result of these is that we are recreating many connections to Hive Metastore in a very short period of time with hundreds of mappers/reducers running concurrently.
We should cache the MetastoreClient connections in a thread-safe way, using ThreadLocal<>.
Allow a database prefix to be added to the destination cluster. This can be used for production testing . For example, add "test_" to all database names on the destination cluster.
Funny and learn more.
Hi,
If I want to send hive/warehouse replication from one cluster to another cluster its not successfull.
can you send me some parameter are requried for changes and which user to run.how to use blacklist parameter with database and tables.
and one of batch.metastore.replication which configuration file is edit.
one is mysql and second one is without mysql.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.