airbnb / reair Goto Github PK

View Code? Open in Web Editor NEW

278.0 39.0 97.0 962 KB

ReAir is a collection of easy-to-use tools for replicating tables and partitions between Hive data warehouses.

License: Apache License 2.0

Shell 0.13% Java 99.26% Thrift 0.17% HTML 0.44%

reair's People

Stargazers

Watchers

Forkers

pishtiko jeetgangele liuwei3230 handanchen mohamedmaalej brucezhou2012 sambitdixit duanshuaimin satchel9 tomzhang yonglehou mylinyuzhi is00hcw pariyat hongbozeng jameadows rajithd-aurea mfarid fysoft2006 dadohp lipengyu palessandro bolkedebruin dkrapohl singhpraveen2010 easyfmxu ivanliu aatibudhi zshao ronbak orenov ibeco hn5092 rugby110 fermich fxcebx gnuhpc sungjuly charsyam cuulee avkashkana damienclaveau jonasll swiftype-demo xj2jx rubythonode himanshpal symphonyosf aoen hope-onely teamclairvoyant zhengxle kevin00chen vbalaut viniciusmurakami guptam jeromebanks karentycoon huxiao64 ardziv saguziel ruthraiahthulasi royadityak edric-shen kioco reisivan1993 edgarrd borkmachine appcoreopc xiashuijun sputtagunta bhanditz plypaul mahipalj27 sarbashree jack09260812 morganyvm jackwangcs funny2014 yjhzls5 dd-guo simms21 henrywu2019 monikamendiratta thammuio inertance isabella232 roydd hngyymq tungtv289 lxbalex asbaharoon java-cds-club ghas-results

reair's Issues

Have you ever tried Spark integration?

Thanks for open sourcing this project, I'm having an issue to integrate Spark app to record audit logs using hive-hooks module. Just curious have you ever tried Spark integration? It seems the Spark supports Hive hooks theoretically, but it doesn't work when initializing remote HiveMetastore client.

cp: ... No such file or directory

private static void copyFile(Path srcFile, Path destFile) throws IOException {
    String[] copyArgs = { "-cp", srcFile.toString(), destFile.toString() };

    FsShell shell = new FsShell();

//1.============Whether it should be like this：======================
String[] copyArgs = { "-put", srcFile.toString(), destFile.toString() };

//2.================================================
I ask, whether the project is still in development

Incremental Replication

Can you give an " Incremental Replication" example ？
Have you used the " Incremental Replication" in the production environment?

nameserver support

this version reair dont support two namesever , should add man config exp:

dfs.nameservices
srcCluster,destCluster

dfs.ha.namenodes.srcCluster
nn1,nn2

dfs.ha.namenodes.destCluster
namenode317,namenode319

dfs.namenode.rpc-address.srcCluster.nn1
a1-ops-hdnamenode01.hz:8020

dfs.namenode.rpc-address.srcCluster.nn2
a1-ops-hdnamenode02.hz:8020

dfs.namenode.rpc-address.destCluster.namenode317
a2-prod-sh-namenode-XX-124.sh:8020

dfs.namenode.rpc-address.destCluster.namenode319
a2-prod-sh-namenode-XX-123.sh:8020

dfs.client.failover.proxy.provider.srcCluster
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

dfs.client.failover.proxy.provider.destCluster
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

Hive warehouse replicaton and hive batch metastore replication

Hi,
If I want to send hive/warehouse replication from one cluster to another cluster its not successfull.
can you send me some parameter are requried for changes and which user to run.how to use blacklist parameter with database and tables.

and one of batch.metastore.replication which configuration file is edit.
one is mysql and second one is without mysql.

MetastoreScanInputFormat uses a single ThriftClient in many threads

See the code that passes a single ThriftClient to many theads.

I don't think the ThriftClient is thread-safe. We have many databases, so many threads (16 by default as defined in the source code) will be running on a single ThriftClient, which causes unexpected failures on the server side.

Suggestion fixes:
A. Create a new ThriftClient for each database, or
B. Just get rid of the multi-threaded case here since it won't take too much time to list through tens or even hundreds of databases.

bussiness

Funny and learn more.

stage 3 create to many hive client

one line create a hive client, it must waster more resource and may be oom!

Migration of Hive Warehouse -- How ?

Need a understanding about the ReAir -Hows it's helping over the Hadoop - Hive Warehouse.
As Hadoop is already a Big platform to explore enough Big-Data

MetastoreReplicationJob to use step1out for step3's input in step-by-step mode

This is a bug as discussed in https://groups.google.com/forum/#!topic/airbnb-reair/fP-0F9xCBtI

Copying metastore only

Is it possible to replicate metastore only?
Is it possible to copy metadata between different Hive metastore versions?

Does it support synchronize hive metastore withou dataset

I deploy two hive metastore in one cluster, hive 2.1 and hive 3.0. Now I just want to synchronize hive meta from 2.1 to 3.0 without change hdfs data. Does it support?

New Feature: Metastore Database Prefix

Allow a database prefix to be added to the destination cluster. This can be used for production testing . For example, add "test_" to all database names on the destination cluster.

java.util.NoSuchElementException

I start the ReAir jar for replicate source warehouse to dest warehouse. But i found the exception as follow, and I don't know why
17/05/23 17:05:33 INFO mapreduce.Job: The url to track the job: http://cdh.master.linesum:8088/proxy/application_1495527145657_0006/ 17/05/23 17:05:33 INFO mapreduce.Job: Running job: job_1495527145657_0006 17/05/23 17:05:38 INFO mapreduce.Job: Job job_1495527145657_0006 running in uber mode : false 17/05/23 17:05:38 INFO mapreduce.Job: map 0% reduce 0% 17/05/23 17:05:41 INFO mapreduce.Job: Task Id : attempt_1495527145657_0006_m_000001_0, Status : FAILED Error: java.util.NoSuchElementException at java.util.AbstractList$Itr.next(AbstractList.java:364) at com.google.common.collect.Iterators.any(Iterators.java:684) at com.google.common.collect.Iterables.any(Iterables.java:620) at com.airbnb.reair.batch.hive.TableCompareWorker.processTable(TableCompareWorker.java:123) at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapper.map(MetastoreReplicationJob.java:570) at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapper.map(MetastoreReplicationJob.java:556) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

file name hash is unbalance

stage2 map unbalance , many file in one reduce, the result is this task very slowly

Add Kerberos Support -Thrift Based

Does reair support replicating tables to Secured/Kerberized cluster?

Am facing below exception:

2017-07-07 01:52:06,930 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2017-07-07 01:52:06,951 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: database 0, table some_hive_db.some_table_in_hive got exception
	at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapperWithTextInput.map(MetastoreReplicationJob.java:622)
	at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapperWithTextInput.map(MetastoreReplicationJob.java:593)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: com.airbnb.reair.common.HiveMetastoreException: org.apache.thrift.transport.TTransportException
	at com.airbnb.reair.common.ThriftHiveMetastoreClient.getTable(ThriftHiveMetastoreClient.java:126)
	at com.airbnb.reair.incremental.primitives.TaskEstimator.analyzeTableSpec(TaskEstimator.java:84)
	at com.airbnb.reair.incremental.primitives.TaskEstimator.analyze(TaskEstimator.java:68)
	at com.airbnb.reair.batch.hive.TableCompareWorker.processTable(TableCompareWorker.java:136)
	at com.airbnb.reair.batch.hive.MetastoreReplicationJob$Stage1ProcessTableMapperWithTextInput.map(MetastoreReplicationJob.java:614)
	... 9 more
Caused by: org.apache.thrift.transport.TTransportException
	at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
	at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
	at org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:340)
	at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:202)
	at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1263)
	at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1249)
	at com.airbnb.reair.common.ThriftHiveMetastoreClient.getTable(ThriftHiveMetastoreClient.java:121)
	... 13 more

I saw code in TableCompareWorker.java#L90 calling getMetastoreClient() of HardCodedCluster.java#L62. I don't see any implementation which checks if the connection need to be in secure mode or not. Based on the security flag in config, we can switch which type of TTransport to use.

Any thoughts on how to implement it?

I was trying to connect to thrift over standalone program based on https://github.com/joshelser/krb-thrift. But no luck. Now I get

Exception in thread "main" org.apache.thrift.transport.TTransportException: SASL authentication not complete

Stnadalone Java program snippet:

TTransport transport = new TSocket(host, port);
Map<String,String> saslProperties = new HashMap<String,String>();
        // Use authorization and confidentiality
        saslProperties.put(Sasl.QOP, "auth-conf");
        saslProperties.put(Sasl.SERVER_AUTH, "true");
        System.out.println("Security is enabled: " + UserGroupInformation.isSecurityEnabled());
        // Log in via UGI, ensures we have logged in with our KRB credentials
UserGroupInformation.loginUserFromKeytab("someuser","/etc/security/keytabs/someuser.headless.keytab");
UserGroupInformation currentUser = UserGroupInformation.getCurrentUser();
System.out.println("Current user: " + currentUser);
// SASL client transport -- does the Kerberos lifting for us
        TSaslClientTransport saslTransport = new TSaslClientTransport(
                "GSSAPI", // tell SASL to use GSSAPI, which supports Kerberos
                null, // authorizationid - null
                args[0], // kerberos primary for server - "myprincipal" in myprincipal/[email protected]
                args[1], // kerberos instance for server - "my.server.com" in myprincipal/[email protected]
                saslProperties, // Properties set, above
                null, // callback handler - null
                transport); // underlying transport
        // Make sure the transport is opened as the user we logged in as
        TUGIAssumingTransport ugiTransport = new TUGIAssumingTransport(saslTransport, currentUser);
        ThriftHiveMetastore.Client client = new ThriftHiveMetastore.Client(new TBinaryProtocol(ugiTransport));
        transport.open();

Any guidance will help me contribute back patch for enabling support for Kerberos.

Add support for SASL in the Hive metastore client

if use kerberos , the client cant connect to metadata, i suggest use hvie client

file checksum with different blocksize maybe Invalid

with different cluster if they blocksize is different, in the stage 3 , it will be fail, because directoryCopier.equalDirs must return false ,they checksum are different, if file size bigger than one block size ,it must be happen. now i check file use they bytes.
below is they different checksum
/test/123.lzo MD5-of-0MD5-of-512CRC32C
/test/123.lzo MD5-of-0MD5-of-256CRC32C

change in org.apache.thrift version not working properly

change in org.apache.thrift version not working properly,
version change 0.9.1 to 0.12.0 will giving an issue while trying to build.
com.airbnb.reair.incremental.thrift.TReplicationService.AsyncClient.pause_call is not abstract and does not override abstract method getResult() in org.apache.thrift.async.TAsyncMethodCall

TaskEstimator does not cache MetastoreClient connections

Here are the 4 places:

analyzeTableSpec (1, 2)
analyzePartitionSpec (1, 2).

The result of these is that we are recreating many connections to Hive Metastore in a very short period of time with hundreds of mappers/reducers running concurrently.

We should cache the MetastoreClient connections in a thread-safe way, using ThreadLocal<>.

airbnb / reair Goto Github PK

reair's People

Stargazers

Watchers

Forkers

reair's Issues

Recommend Projects

Recommend Topics

Recommend Org