Giter VIP home page Giter VIP logo

Comments (4)

jatorre avatar jatorre commented on June 16, 2024

So two topics here. It sounds like some systems like Presto or Athena (which are the same right?), rely on the catalogs to read the metadata and those systems will need to support geoparquet. I think that is a fair assumption and I dont think we should override that, we should try to get those systems to recognise geoparaquet.

Meanwhile a geoparquet read by Presto/Athena will look like having a binary column if the catalog has not been updated to support geoparquet. Thats fine, thats the same behaviour I would expect on other products.

So I think it is the same like with BigQuery, Redshfit, Snowflake, etc... is juts that on those system there 2 pieces of software that need to be updated and in the others is only 1 since they combine the catalog with the engine.

Now, a different story is with the multi-file datasets... I assume the recommendation is for every part to contain the same metadata and if there is a global metadata file to all of them we can include also, as a recommendation, the geoparque metadata?

from geoparquet.

mojodna avatar mojodna commented on June 16, 2024

For purposes of this issue, I'm only expressing a desire for the GeoParquet-specific metadata (bbox, CRS, geometry type(?)) to be duplicated from the Parquet footers into a file that can be directly addressed and read independently of the files containing data.

"is a global metadata file to all of them we can include also, as a recommendation, the geoparquet metadata?"

Effectively, yeah.

To @jatorre's other comments:

Athena is related to Presto, but they're not the same; Athena v2's functions are currently based on Presto 0.217's, Athena includes support for user-defined functions implemented with AWS Lambda, and Athena has diverged with its support for Hudi and Iceberg. GeoParquet files can be registered with Glue Catalog (as vanilla Parquet) and queried using Athena; there's no external "geometry" type, so WKB columns appear as byte arrays and can be converted to internal geometries for use w/ geospatial functions. Neither the catalog nor the engine (Athena) understand GeoParquet-specific metadata, so bbox-related optimizations aren't possible and there's no way to programmatically know the CRS of a GeoParquet source (hence the desire for the metadata to be directly addressable/join-able).

Glue Catalog is responsible for storing and tracking metadata about the objects that make up tables (and surfaces this metadata to a variety of services, software, and customer tools that relate to Hadoop, so not just Athena). Glue Catalog is probably the starting point for AWS to fully support GeoParquet (esp. for engine optimizations), but there are ecosystem-wide considerations around geometry types (optionally including CRS), since the "Hive Metastore" (Catalog) has become a de facto standard, and with it comes assumptions about the universe of Hive data types.

I view Athena and Glue Catalog as separate considerations (and will do what I can to raise them within AWS, though I don't have much visibility into those teams).

from geoparquet.

jatorre avatar jatorre commented on June 16, 2024

I am going to ask some people from Apache on support here to provide best practice on how to handle this. It does not sound like a geo specific thing, but more of how do you treat advanced custom data types on this situation.

from geoparquet.

cholmes avatar cholmes commented on June 16, 2024

We should discuss whether to move this to the 'future' milestone, in line with the latest discussions where we're focusing on 'interoperability' in 1.0.0, and after that we'll dig into use cases of using GeoParquet as a direct source / streaming from it.

But I could see addressing the specific request in conjunction with #79

from geoparquet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.