Comments (4)
So two topics here. It sounds like some systems like Presto or Athena (which are the same right?), rely on the catalogs to read the metadata and those systems will need to support geoparquet. I think that is a fair assumption and I dont think we should override that, we should try to get those systems to recognise geoparaquet.
Meanwhile a geoparquet read by Presto/Athena will look like having a binary column if the catalog has not been updated to support geoparquet. Thats fine, thats the same behaviour I would expect on other products.
So I think it is the same like with BigQuery, Redshfit, Snowflake, etc... is juts that on those system there 2 pieces of software that need to be updated and in the others is only 1 since they combine the catalog with the engine.
Now, a different story is with the multi-file datasets... I assume the recommendation is for every part to contain the same metadata and if there is a global metadata file to all of them we can include also, as a recommendation, the geoparque metadata?
from geoparquet.
For purposes of this issue, I'm only expressing a desire for the GeoParquet-specific metadata (bbox, CRS, geometry type(?)) to be duplicated from the Parquet footers into a file that can be directly addressed and read independently of the files containing data.
"is a global metadata file to all of them we can include also, as a recommendation, the geoparquet metadata?"
Effectively, yeah.
To @jatorre's other comments:
Athena is related to Presto, but they're not the same; Athena v2's functions are currently based on Presto 0.217's, Athena includes support for user-defined functions implemented with AWS Lambda, and Athena has diverged with its support for Hudi and Iceberg. GeoParquet files can be registered with Glue Catalog (as vanilla Parquet) and queried using Athena; there's no external "geometry" type, so WKB columns appear as byte arrays and can be converted to internal geometries for use w/ geospatial functions. Neither the catalog nor the engine (Athena) understand GeoParquet-specific metadata, so bbox-related optimizations aren't possible and there's no way to programmatically know the CRS of a GeoParquet source (hence the desire for the metadata to be directly addressable/join-able).
Glue Catalog is responsible for storing and tracking metadata about the objects that make up tables (and surfaces this metadata to a variety of services, software, and customer tools that relate to Hadoop, so not just Athena). Glue Catalog is probably the starting point for AWS to fully support GeoParquet (esp. for engine optimizations), but there are ecosystem-wide considerations around geometry types (optionally including CRS), since the "Hive Metastore" (Catalog) has become a de facto standard, and with it comes assumptions about the universe of Hive data types.
I view Athena and Glue Catalog as separate considerations (and will do what I can to raise them within AWS, though I don't have much visibility into those teams).
from geoparquet.
I am going to ask some people from Apache on support here to provide best practice on how to handle this. It does not sound like a geo specific thing, but more of how do you treat advanced custom data types on this situation.
from geoparquet.
We should discuss whether to move this to the 'future' milestone, in line with the latest discussions where we're focusing on 'interoperability' in 1.0.0, and after that we'll dig into use cases of using GeoParquet as a direct source / streaming from it.
But I could see addressing the specific request in conjunction with #79
from geoparquet.
Related Issues (20)
- Simplify or remove script dependencies HOT 3
- PROJJSON schema version HOT 4
- Metadata encoding options for GeoArrow-encoded columns in GeoParquet metadata HOT 2
- Is it possible to define a transform alongside a CRS, similar to geotiff? HOT 3
- Recommendation on the Arrow specific type for the WKB geometry column ? HOT 5
- Antimeridian Crossings and bbox HOT 9
- Update example files for 1.1 HOT 4
- The releases on the repository can be misleading regarding the status of GeoParquet as an OGC Standard HOT 1
- Clarify projection of bounding box columns HOT 4
- Mixed concerns: Encoding + Geometry Type HOT 15
- Covering Schema
- Clarify recommended file extension HOT 9
- List of Submitting Organisations HOT 3
- Enforce pull requests and approvals for all repository updates HOT 4
- Require status checks to pass before merging HOT 4
- Synchronise requirements in the metanorma asciidoc files with those in the gpq validator HOT 1
- add support wkt or wkt2 formats for crs HOT 26
- Thoughts about a first-class GEOMETRY data type in Parquet? HOT 20
- Start a 'best practices' document
- Forward compatibility guarantees? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geoparquet.