Giter VIP home page Giter VIP logo

iceberg-catalog-migrator's People

Contributors

ajantha-bhat avatar harshm-dev avatar renovate[bot] avatar snazy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iceberg-catalog-migrator's Issues

Handle Abort scenarios gracefully

When the user kills the process or another hard failure occurs, the "result files" will not be populated and the user "left in the dark"

Need to fix this.

Currently in this case, users has to do listTables in target catalog to see what tables are newly registered.

Fix example: Hadoop to Nessie doesn't support migrate

Met the issue while trying to follow the example - https://github.com/projectnessie/iceberg-catalog-migrator/blob/main/README.md#bulk-migrating-all-the-tables-from-hadoop-catalog-to-nessie-catalog-main-branch

java -jar iceberg-catalog-migrator-cli-0.2.0.jar migrate \
--source-catalog-type HADOOP \
--source-catalog-properties warehouse=/tmp/iceberg_checks,type=hadoop \
--target-catalog-type NESSIE  \
--target-catalog-properties uri=http://localhost:19120/api/v1,ref=main,warehouse=/tmp/nessie_warehouse
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.codehaus.groovy.reflection.CachedClass$3$1 (file:/Users/harshmaheshwari/Downloads/iceberg-catalog-migrator-cli-0.2.0.jar) to method java.lang.Object.finalize()
WARNING: Please consider reporting this to the maintainers of org.codehaus.groovy.reflection.CachedClass$3$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

ERROR - Source catalog is a Hadoop catalog and it doesn't support deleting the table entries just from the catalog. Please use 'register' command instead.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Ignored or Blocked

These are blocked by an existing closed PR and will not be recreated unless you click a checkbox below.

Detected dependencies

github-actions
.github/workflows/main.yml
  • actions/checkout v4@b4ffde65f46336ab88eb53be808477a3936bae11
  • actions/setup-java v4
  • gradle/gradle-build-action v2
  • actions/upload-artifact v4
  • ubuntu 22.04
.github/workflows/release-create.yml
  • actions/setup-java v4
  • actions/checkout v4.1.1@b4ffde65f46336ab88eb53be808477a3936bae11
  • gradle/gradle-build-action v2
  • gradle/gradle-build-action v2
  • ubuntu 22.04
.github/workflows/release-publish.yml
  • actions/setup-java v4
  • actions/checkout v4.1.1@b4ffde65f46336ab88eb53be808477a3936bae11
  • actions/checkout v4.1.1@b4ffde65f46336ab88eb53be808477a3936bae11
  • ubuntu 22.04
gradle
buildSrc/src/main/kotlin/Checkstyle.kt
buildSrc/src/main/kotlin/CodeCoverage.kt
buildSrc/src/main/kotlin/Errorprone.kt
buildSrc/src/main/kotlin/Ide.kt
buildSrc/src/main/kotlin/Jandex.kt
buildSrc/src/main/kotlin/Java.kt
buildSrc/src/main/kotlin/PublishingHelperPlugin.kt
buildSrc/src/main/kotlin/ReleaseSupportPlugin.kt
buildSrc/src/main/kotlin/Spotless.kt
buildSrc/src/main/kotlin/Testing.kt
buildSrc/src/main/kotlin/Utilities.kt
buildSrc/src/main/kotlin/VersionTuple.kt
gradle.properties
settings.gradle.kts
build.gradle.kts
api/build.gradle.kts
  • junit:junit 4.13.2
api-test/build.gradle.kts
buildSrc/settings.gradle.kts
buildSrc/build.gradle.kts
buildSrc/src/main/kotlin/build-conventions.gradle.kts
cli/build.gradle.kts
  • junit:junit 4.13.2
gradle/baselibs.versions.toml
  • net.ltgt.gradle:gradle-errorprone-plugin 3.1.0
  • gradle.plugin.org.jetbrains.gradle.plugin.idea-ext:gradle-idea-ext 1.1.8
  • com.github.vlsi.gradle:jandex-plugin 1.90
  • com.github.johnrengelman:shadow 8.1.1
  • com.diffplug.spotless:spotless-plugin-gradle 6.25.0
gradle/libs.versions.toml
  • org.assertj:assertj-core 3.25.3
  • software.amazon.awssdk:apache-client 2.20.18
  • software.amazon.awssdk:auth 2.20.18
  • software.amazon.awssdk:dynamodb 2.20.18
  • software.amazon.awssdk:glue 2.20.18
  • software.amazon.awssdk:kms 2.20.18
  • software.amazon.awssdk:lakeformation 2.20.18
  • software.amazon.awssdk:sts 2.20.18
  • software.amazon.awssdk:s3 2.20.18
  • software.amazon.awssdk:url-connection-client 2.20.18
  • com.puppycrawl.tools:checkstyle 10.12.5
  • com.google.errorprone:error_prone_annotations 2.26.1
  • com.google.errorprone:error_prone_core 2.26.1
  • jp.skypencil.errorprone.slf4j:errorprone-slf4j 0.1.22
  • com.google.code.findbugs:annotations 3.0.1
  • com.google.code.findbugs:jsr305 3.0.2
  • com.google.googlejavaformat:google-java-format 1.21.0
  • com.google.guava:guava 32.1.3-jre
  • org.apache.hadoop:hadoop-aws 2.7.3
  • org.apache.hadoop:hadoop-common 2.7.3
  • org.apache.iceberg:iceberg-dell 1.3.1
  • org.apache.iceberg:iceberg-spark-runtime-3.3_2.12 1.3.1
  • org.immutables:builder 2.10.1
  • org.immutables:value-annotations 2.10.1
  • org.immutables:value-processor 2.10.1
  • org.jacoco:org.jacoco.ant 0.8.11
  • org.jacoco:org.jacoco.report 0.8.11
  • org.jacoco:jacoco-maven-plugin 0.8.11
  • org.jboss:jandex 3.1.7
  • org.junit:junit-bom 5.10.2
  • org.junit.jupiter:junit-jupiter-api 5.10.2
  • ch.qos.logback:logback-classic 1.5.3
  • io.github.hakky54:logcaptor 2.9.2
  • info.picocli:picocli 4.7.5
  • org.slf4j:log4j-over-slf4j 1.7.36
  • net.ltgt.errorprone 3.1.0
  • org.projectnessie 0.30.7
  • io.github.gradle-nexus.publish-plugin 1.3.0
  • com.github.johnrengelman.shadow 8.1.1
gradle-wrapper
gradle/wrapper/gradle-wrapper.properties
  • gradle 8.6

  • Check this box to trigger a request for Renovate to run again on this repository

When I register tables from HADOOP to NESSIE,There is a com.amazonaws.AmazonClientException:

My hadoop warehouse is S3a://XXXXXX,
and I add the
--source-catalog-hadoop-conf fs.s3a.access.key=$AWS_ACCESS_KEY_ID,fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY,fs.s3a.endpoint=$AWS_S3_ENDPOINT
Then goes wrong with:

com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
        at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
        at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:960)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
        at org.apache.iceberg.hadoop.HadoopCatalog.isDirectory(HadoopCatalog.java:175)
        at org.apache.iceberg.hadoop.HadoopCatalog.isNamespace(HadoopCatalog.java:376)
        at org.apache.iceberg.hadoop.HadoopCatalog.listNamespaces(HadoopCatalog.java:306)
        at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.getAllNamespacesFromSourceCatalog(CatalogMigrator.java:202)
        at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.getMatchingTableIdentifiers(CatalogMigrator.java:97)
        at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:136)
        at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:38)
        at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
        at picocli.CommandLine.access$1500(CommandLine.java:148)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273)
        at picocli.CommandLine$RunLast.execute(CommandLine.java:2417)
        at picocli.CommandLine.execute(CommandLine.java:2170)
        at org.projectnessie.tools.catalog.migration.cli.CatalogMigrationCLI.main(CatalogMigrationCLI.java:48)
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:279)
        at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
        at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
        at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
        at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
        at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
        ... 23 more
Caused by: java.lang.RuntimeException: Invalid value for IsTruncated field: 
        true
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler.endElement(XmlResponsesSaxParser.java:647)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:610)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1718)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2883)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
        ... 29 more

Migrating from Hive to Nessie getting java.io.IOException: No FileSystem for scheme: hdfs

My hadoop env is 3.3.1
my hive env is 3.1.0
my iceberg is 1.2.1
my spark-3.2.1
my nessie-0.59.0 server

export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf

java -Djavax.net.ssl.trustStore=/etc/security/clientKeys/client-truststore.jks \
     -Djavax.net.ssl.trustStorePassword=password \
-jar iceberg-catalog-migrator-cli-0.2.1-SNAPSHOT.jar register \
--source-catalog-type HIVE \
--source-catalog-properties warehouse=hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/,uri=thrift://hive-metastore:9083 \
--identifiers hive_data.t_ers_event_perf,hive_data.T_KWH_MATCH_RECORD_PERF \
--target-catalog-type NESSIE \
--target-catalog-properties uri=http://localhost:19120/api/v1,ref=main,warehouse=hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive
$ cat logs/catalog_migration.log 
2023-10-07 00:30:19,852 [main] INFO  o.apache.hadoop.hive.conf.HiveConf - Found configuration file file:/usr/lib/spark/conf/hive-site.xml
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.tez.cartesian-product.enabled does not exist
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.warehouse.external.dir does not exist
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.heapsize does not exist
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.materializedview.rewriting.incremental does not exist
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.webui.cors.allowed.headers does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.hook.proto.base-directory does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.load.data.owner does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.max-partitions-per-writers does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.service.metrics.codahale.reporter.classes does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.strict.managed.tables does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.ignore-absent-partitions does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.create.as.insert.only does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.webui.enable.cors does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.db.type does not exist
2023-10-07 00:30:20,422 [main] WARN  o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-10-07 00:30:20,440 [main] INFO  hive.metastore - Trying to connect to metastore with URI thrift://hive-metastore:9083
2023-10-07 00:30:20,498 [main] INFO  hive.metastore - Opened an SSL connection to metastore, current connections: 1
2023-10-07 00:30:20,851 [main] INFO  hive.metastore - Connected to metastore.
2023-10-07 00:30:21,116 [main] INFO  o.a.i.BaseMetastoreTableOperations - Refreshing table metadata from new version: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json
2023-10-07 00:30:21,160 [main] WARN  org.apache.iceberg.util.Tasks - Retrying task after failure: Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json
org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json
	at org.apache.iceberg.hadoop.Util.getFs(Util.java:58)
	at org.apache.iceberg.hadoop.HadoopInputFile.fromLocation(HadoopInputFile.java:56)
	at org.apache.iceberg.hadoop.HadoopFileIO.newInputFile(HadoopFileIO.java:90)
	at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:266)
	at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$0(BaseMetastoreTableOperations.java:189)
	at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$1(BaseMetastoreTableOperations.java:208)
	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
	at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:208)
	at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:185)
	at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:180)
	at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:176)
	at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
	at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
	at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
	at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.registerTableToTargetCatalog(CatalogMigrator.java:212)
	at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.registerTable(CatalogMigrator.java:147)
	at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:159)
	at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:38)
	at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
	at picocli.CommandLine.access$1500(CommandLine.java:148)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273)
	at picocli.CommandLine$RunLast.execute(CommandLine.java:2417)
	at picocli.CommandLine.execute(CommandLine.java:2170)
	at org.projectnessie.tools.catalog.migration.cli.CatalogMigrationCLI.main(CatalogMigrationCLI.java:48)
Caused by: java.io.IOException: No FileSystem for scheme: hdfs
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.iceberg.hadoop.Util.getFs(Util.java:56)
	... 29 common frames omitted
2023-10-07 00:30:21,272 [main] WARN  org.apache.iceberg.util.Tasks - Retrying task after failure: Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespa..........

same error over and over

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.