projectnessie / iceberg-catalog-migrator Goto Github PK
View Code? Open in Web Editor NEWCLI tool to bulk migrate the tables from one catalog another without a data copy
License: Apache License 2.0
CLI tool to bulk migrate the tables from one catalog another without a data copy
License: Apache License 2.0
Error: An unexpected error occurred while trying to open file iceberg-catalog-migrator
The same thing can be observed in the released version also
https://github.com/projectnessie/iceberg-catalog-migrator/releases/tag/catalog-migrator-0.1.0
When the user kills the process or another hard failure occurs, the "result files" will not be populated and the user "left in the dark"
Need to fix this.
Currently in this case, users has to do listTables
in target catalog to see what tables are newly registered.
Met the issue while trying to follow the example - https://github.com/projectnessie/iceberg-catalog-migrator/blob/main/README.md#bulk-migrating-all-the-tables-from-hadoop-catalog-to-nessie-catalog-main-branch
java -jar iceberg-catalog-migrator-cli-0.2.0.jar migrate \
--source-catalog-type HADOOP \
--source-catalog-properties warehouse=/tmp/iceberg_checks,type=hadoop \
--target-catalog-type NESSIE \
--target-catalog-properties uri=http://localhost:19120/api/v1,ref=main,warehouse=/tmp/nessie_warehouse
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass$3$1 (file:/Users/harshmaheshwari/Downloads/iceberg-catalog-migrator-cli-0.2.0.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass$3$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
ERROR - Source catalog is a Hadoop catalog and it doesn't support deleting the table entries just from the catalog. Please use 'register' command instead.
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
software.amazon.awssdk:url-connection-client
, software.amazon.awssdk:s3
, software.amazon.awssdk:sts
, software.amazon.awssdk:lakeformation
, software.amazon.awssdk:kms
, software.amazon.awssdk:glue
, software.amazon.awssdk:dynamodb
, software.amazon.awssdk:auth
, software.amazon.awssdk:apache-client
)org.apache.iceberg:iceberg-spark-runtime-3.3_2.12
, org.apache.iceberg:iceberg-dell
)These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.
org.apache.hadoop:hadoop-common
, org.apache.hadoop:hadoop-aws
)org.apache.hadoop:hadoop-common
, org.apache.hadoop:hadoop-aws
).github/workflows/main.yml
actions/checkout v4@b4ffde65f46336ab88eb53be808477a3936bae11
actions/setup-java v4
gradle/gradle-build-action v2
actions/upload-artifact v4
ubuntu 22.04
.github/workflows/release-create.yml
actions/setup-java v4
actions/checkout v4.1.1@b4ffde65f46336ab88eb53be808477a3936bae11
gradle/gradle-build-action v2
gradle/gradle-build-action v2
ubuntu 22.04
.github/workflows/release-publish.yml
actions/setup-java v4
actions/checkout v4.1.1@b4ffde65f46336ab88eb53be808477a3936bae11
actions/checkout v4.1.1@b4ffde65f46336ab88eb53be808477a3936bae11
ubuntu 22.04
buildSrc/src/main/kotlin/Checkstyle.kt
buildSrc/src/main/kotlin/CodeCoverage.kt
buildSrc/src/main/kotlin/Errorprone.kt
buildSrc/src/main/kotlin/Ide.kt
buildSrc/src/main/kotlin/Jandex.kt
buildSrc/src/main/kotlin/Java.kt
buildSrc/src/main/kotlin/PublishingHelperPlugin.kt
buildSrc/src/main/kotlin/ReleaseSupportPlugin.kt
buildSrc/src/main/kotlin/Spotless.kt
buildSrc/src/main/kotlin/Testing.kt
buildSrc/src/main/kotlin/Utilities.kt
buildSrc/src/main/kotlin/VersionTuple.kt
gradle.properties
settings.gradle.kts
build.gradle.kts
api/build.gradle.kts
junit:junit 4.13.2
api-test/build.gradle.kts
buildSrc/settings.gradle.kts
buildSrc/build.gradle.kts
buildSrc/src/main/kotlin/build-conventions.gradle.kts
cli/build.gradle.kts
junit:junit 4.13.2
gradle/baselibs.versions.toml
net.ltgt.gradle:gradle-errorprone-plugin 3.1.0
gradle.plugin.org.jetbrains.gradle.plugin.idea-ext:gradle-idea-ext 1.1.8
com.github.vlsi.gradle:jandex-plugin 1.90
com.github.johnrengelman:shadow 8.1.1
com.diffplug.spotless:spotless-plugin-gradle 6.25.0
gradle/libs.versions.toml
org.assertj:assertj-core 3.25.3
software.amazon.awssdk:apache-client 2.20.18
software.amazon.awssdk:auth 2.20.18
software.amazon.awssdk:dynamodb 2.20.18
software.amazon.awssdk:glue 2.20.18
software.amazon.awssdk:kms 2.20.18
software.amazon.awssdk:lakeformation 2.20.18
software.amazon.awssdk:sts 2.20.18
software.amazon.awssdk:s3 2.20.18
software.amazon.awssdk:url-connection-client 2.20.18
com.puppycrawl.tools:checkstyle 10.12.5
com.google.errorprone:error_prone_annotations 2.26.1
com.google.errorprone:error_prone_core 2.26.1
jp.skypencil.errorprone.slf4j:errorprone-slf4j 0.1.22
com.google.code.findbugs:annotations 3.0.1
com.google.code.findbugs:jsr305 3.0.2
com.google.googlejavaformat:google-java-format 1.21.0
com.google.guava:guava 32.1.3-jre
org.apache.hadoop:hadoop-aws 2.7.3
org.apache.hadoop:hadoop-common 2.7.3
org.apache.iceberg:iceberg-dell 1.3.1
org.apache.iceberg:iceberg-spark-runtime-3.3_2.12 1.3.1
org.immutables:builder 2.10.1
org.immutables:value-annotations 2.10.1
org.immutables:value-processor 2.10.1
org.jacoco:org.jacoco.ant 0.8.11
org.jacoco:org.jacoco.report 0.8.11
org.jacoco:jacoco-maven-plugin 0.8.11
org.jboss:jandex 3.1.7
org.junit:junit-bom 5.10.2
org.junit.jupiter:junit-jupiter-api 5.10.2
ch.qos.logback:logback-classic 1.5.3
io.github.hakky54:logcaptor 2.9.2
info.picocli:picocli 4.7.5
org.slf4j:log4j-over-slf4j 1.7.36
net.ltgt.errorprone 3.1.0
org.projectnessie 0.30.7
io.github.gradle-nexus.publish-plugin 1.3.0
com.github.johnrengelman.shadow 8.1.1
gradle/wrapper/gradle-wrapper.properties
gradle 8.6
My hadoop warehouse is S3a://XXXXXX,
and I add the
--source-catalog-hadoop-conf fs.s3a.access.key=$AWS_ACCESS_KEY_ID,fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY,fs.s3a.endpoint=$AWS_S3_ENDPOINT
Then goes wrong with:
com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:960)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.iceberg.hadoop.HadoopCatalog.isDirectory(HadoopCatalog.java:175)
at org.apache.iceberg.hadoop.HadoopCatalog.isNamespace(HadoopCatalog.java:376)
at org.apache.iceberg.hadoop.HadoopCatalog.listNamespaces(HadoopCatalog.java:306)
at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.getAllNamespacesFromSourceCatalog(CatalogMigrator.java:202)
at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.getMatchingTableIdentifiers(CatalogMigrator.java:97)
at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:136)
at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:38)
at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
at picocli.CommandLine.access$1500(CommandLine.java:148)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273)
at picocli.CommandLine$RunLast.execute(CommandLine.java:2417)
at picocli.CommandLine.execute(CommandLine.java:2170)
at org.projectnessie.tools.catalog.migration.cli.CatalogMigrationCLI.main(CatalogMigrationCLI.java:48)
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:279)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
... 23 more
Caused by: java.lang.RuntimeException: Invalid value for IsTruncated field:
true
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler.endElement(XmlResponsesSaxParser.java:647)
at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:610)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1718)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2883)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
... 29 more
https://github.com/projectnessie/iceberg-catalog-migrator/releases/tag/catalog-migrator-0.2.0
Looks like it is comparing with the same version
My hadoop env is 3.3.1
my hive env is 3.1.0
my iceberg is 1.2.1
my spark-3.2.1
my nessie-0.59.0 server
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf
java -Djavax.net.ssl.trustStore=/etc/security/clientKeys/client-truststore.jks \
-Djavax.net.ssl.trustStorePassword=password \
-jar iceberg-catalog-migrator-cli-0.2.1-SNAPSHOT.jar register \
--source-catalog-type HIVE \
--source-catalog-properties warehouse=hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/,uri=thrift://hive-metastore:9083 \
--identifiers hive_data.t_ers_event_perf,hive_data.T_KWH_MATCH_RECORD_PERF \
--target-catalog-type NESSIE \
--target-catalog-properties uri=http://localhost:19120/api/v1,ref=main,warehouse=hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive
$ cat logs/catalog_migration.log
2023-10-07 00:30:19,852 [main] INFO o.apache.hadoop.hive.conf.HiveConf - Found configuration file file:/usr/lib/spark/conf/hive-site.xml
2023-10-07 00:30:20,186 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.tez.cartesian-product.enabled does not exist
2023-10-07 00:30:20,186 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.warehouse.external.dir does not exist
2023-10-07 00:30:20,186 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.heapsize does not exist
2023-10-07 00:30:20,186 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.materializedview.rewriting.incremental does not exist
2023-10-07 00:30:20,186 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.webui.cors.allowed.headers does not exist
2023-10-07 00:30:20,187 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.hook.proto.base-directory does not exist
2023-10-07 00:30:20,187 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.load.data.owner does not exist
2023-10-07 00:30:20,187 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.max-partitions-per-writers does not exist
2023-10-07 00:30:20,187 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.service.metrics.codahale.reporter.classes does not exist
2023-10-07 00:30:20,187 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.strict.managed.tables does not exist
2023-10-07 00:30:20,187 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.ignore-absent-partitions does not exist
2023-10-07 00:30:20,187 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.create.as.insert.only does not exist
2023-10-07 00:30:20,187 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.webui.enable.cors does not exist
2023-10-07 00:30:20,187 [main] WARN o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.db.type does not exist
2023-10-07 00:30:20,422 [main] WARN o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-10-07 00:30:20,440 [main] INFO hive.metastore - Trying to connect to metastore with URI thrift://hive-metastore:9083
2023-10-07 00:30:20,498 [main] INFO hive.metastore - Opened an SSL connection to metastore, current connections: 1
2023-10-07 00:30:20,851 [main] INFO hive.metastore - Connected to metastore.
2023-10-07 00:30:21,116 [main] INFO o.a.i.BaseMetastoreTableOperations - Refreshing table metadata from new version: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json
2023-10-07 00:30:21,160 [main] WARN org.apache.iceberg.util.Tasks - Retrying task after failure: Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json
org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json
at org.apache.iceberg.hadoop.Util.getFs(Util.java:58)
at org.apache.iceberg.hadoop.HadoopInputFile.fromLocation(HadoopInputFile.java:56)
at org.apache.iceberg.hadoop.HadoopFileIO.newInputFile(HadoopFileIO.java:90)
at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:266)
at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$0(BaseMetastoreTableOperations.java:189)
at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$1(BaseMetastoreTableOperations.java:208)
at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:208)
at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:185)
at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:180)
at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:176)
at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.registerTableToTargetCatalog(CatalogMigrator.java:212)
at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.registerTable(CatalogMigrator.java:147)
at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:159)
at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:38)
at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
at picocli.CommandLine.access$1500(CommandLine.java:148)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273)
at picocli.CommandLine$RunLast.execute(CommandLine.java:2417)
at picocli.CommandLine.execute(CommandLine.java:2170)
at org.projectnessie.tools.catalog.migration.cli.CatalogMigrationCLI.main(CatalogMigrationCLI.java:48)
Caused by: java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.iceberg.hadoop.Util.getFs(Util.java:56)
... 29 common frames omitted
2023-10-07 00:30:21,272 [main] WARN org.apache.iceberg.util.Tasks - Retrying task after failure: Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespa..........
same error over and over
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.