Comments (11)
I think we want to have a single defined column acting as "the geometry". Your suggestion is definitely a lot cleaner.
I could see some scenarios where you might want multiple geometries in a row with the ability to easily switch. For example you could store different simplification levels in different columns and switch which to use under different conditions.
I'll try this out.
from geopandas.
@jtratner I like this suggestion. It seems to me the "pandas way" to handle this, using a property that has special behavior rather than a specially named column. Of course you can always store geometries in any column, but there's no easy way to know which ones (they're just dtype
object). It seems right that there should only be one "geometry" of a given row at a time, even if other "geometries" are actually stored in the same row.
Another advantage of making the geometry a property is that normal reduction methods won't have to worry about treating the geometry column differently.
@jwass Your PR looks like a good first step. It gives us part of that API.
from geopandas.
Okay. @jwass - I was (implicitly I guess) offering to do it, but if you'd like to, go ahead :)
from geopandas.
Sorry Jeff, didn't mean to step on your toes there. I think I just read it as a suggestion based on your experience with Pandas. Obviously feel free to rip apart my commit with what you had in mind.
from geopandas.
I suppose the next steps for this would be to store the geometry as a member and remove any special treatment of a geometry
column. Maybe the GeoDataFrame's constructor would look for a column named geometry
by default?
from geopandas.
@jwass Already merged #47, as it fit with our current API. Now that I think about it, though, pandas columns are already accessible as properties, so the property there was redundant. The set_geometry
method makes sense, though.
@jtratner I like the idea of geometry as a real property, feel free to try it out and send a PR.
from geopandas.
This discussion is an exciting one. I actually work with multiple geometry columns in databases all the time (origin and destination locations), so the ability to switch between them is ideal. The more I follow this discussion, the more I like the idea of set_geometry
acting like default pandas set_index
. In fact, the whole concept of geometries acting as the 'index' of a data frame is a nice one, and avoids lots of issues (as @kjordahl suggested re: reduction methods). This also sounds a lot like the 'pandas way' of doing things... turn a column into an index, which has its own internal 'index' to speed up retrieval etc. For now, we can avoid this internal index at the expense of some speed, but when we do eventually get around to working that out (via RTree?), the API shouldn't change much/at all. Cool stuff!
from geopandas.
Happy that you put something through already - just wanted to make it clear
I'd do it if not. :)
I don't think you need to change the idea of treating the geo column
specially by default, I think we should change places that do
self['geometry']
to be self.geometry
or
self[[self._geo_column_name]]
, etc. For example, if I do this:
gf = GeoDataFrame(zip([1, 2, 3], [4, 5, 6], [geo1, geo2, geo3], [geo1,
geo2, geo3]), columns = ["A", "B", "geometry", "geometry"])
Looking up on geometry now produces a (Geo)DataFrame (oh no!). So you
already have a potential breakage.
I suggest the following:
- Have
'geometry'
be the default for now. - Override the column property to validate that there aren't multiple
names for whatever the geo column is on setting. - Store the geo column name as metadata on the frame (so it'll get carried
forward).
1-3 will help avoid issues with dup cols for geometry and avoid
encountering issues with reordering.
Side note - how will GeoPandas handle hierarchical columns? Unless you use
integer column locations (which create their own issues with reordering),
it's not totally clear to me. Maybe warn for now or something?
from geopandas.
After re-reading the discussion here, I'm not sure everyone is on the same page about how to approach this. I'll try to summarize my interpretation. If I misrepresent your position, let me know.
@jtratner is recommending that GeoDataFrame have a property indicating the column to return when accessing the geometry
property. We can have several columns containing geometries and calls to set_geometry
simply switch the name of the column to use. However the geometry column(s) would still be ordinary columns accessible in the usual way. For example:
gdf = GeoDataFrame(...) # Uses 'geometry' column by default
shape = gdf.geometry.iloc[0]
shape = gdf['geometry'].iloc[0] # Returns the same as above because 'geometry' is just a column
gdf['other_geometry'] = GeoSeries(...) # (or list of geometries, etc.)
a = gdf.area
a = gdf.geometry.area
a = gdf['geometry'].area # All three are the same
gdf.set_geometry('other_geometry', inplace=True)
a2 = gdf.area
a2 = gdf.geometry.area
a2 = gdf['other_geometry'].area # All three are the same
@kjordahl is recommending that the geometry would not really be a column in the DataFrame but would be a separate property that we always do the right thing with. It would serve the similar function above, but you couldn't access it using the normal DataFrame operations. In this approach:
gdf = GeoDataFrame(...)
shape = gdf.geometry.iloc[0] # The only way to get the geometry
shape = gdf['geometry'].iloc[0] # Raises KeyError because there's no column named geometry
I think the major problem with this approach is that we'd need to override all the row indexing operations to grab out the proper rows in our geometry property...? gdf[:5]
would have to separately grab those elements in our geometry property. If we keep the geometry as a column, the usual indexing takes care of that for us.
I tried implementing the basic idea of the first suggestion: (https://github.com/jwass/geopandas/commits/set_geometry). It too raises a bunch of questions.
- If I get a column containing geometries but is not the defined geometry, would it return a GeoSeries? (i.e. what does
gdf['other_geometry']
return whenother_geometry
is not the geometry column?). I think it would be powerful to return a GeoSeries. Would the GeoBlock mentioned in Issue #51 help with this, so we don't have to check all the elements whenever a column is retrieved? - What type is returned here:
gdf2 = gdf[['prop1', 'prop2', 'other_geometry']]
whenother_geometry
is not the geometry? - Would we allow different columns to have different crs?
I think both have their pros and cons but want to make sure we're talking about the same thing moving forward. I'm leaning toward the first approach where they the geometry is an ordinary column...
from geopandas.
I'm on board with @kjordahl suggestion (and it's really just a further
extension of making geometry a property). But the downside is that you have
to reapply all the slicing whenever you do things to the other columns...
from geopandas.
Some background on pandas internals that have bearing on this: internal
representation of DataFrame (and Series) is a block manager which holds
individual dtypes in separate blocks. (not sure if object dtypes are put in
the same block). Series uses a SingleBlockManager that does something
similar (only in 0.13 and above). When you ask for a column, part/all of
the block is converted to a Series, and the Series is cached on the parent
(so if you query twice you get the same thing) and Series updates or
invalidates its cache as necessary.
If you always want to return Geo Series from geocolumns, would need to
intercept all the indexers (loc, iloc, ix, xs, etc) OR you could monkey
patch DataFrames' constructor_sliced to be GeoSeries and then return
regular Series if it's not Geo.
One possibility that came to mind would be to bind all the geo methods to
Series and DataFrame (or put them in a geo namespace like StringMethods)
and then you could swap out the class at that point.
from geopandas.
Related Issues (20)
- BUG: pandas 2.2.0 changes `GeoDataFrame.from_records()` behavior HOT 2
- ENH: (Geo)Arrow interoperability & Arrow PyCapsule Interface HOT 5
- BUG: Missing geo metadata in Parquet/Feather file which has a geo column HOT 1
- REGR: geometry column not found after groupby & column select HOT 4
- TST: Make Pytest 8 xfail less verbose? HOT 5
- BUG: assignment destination is read-only when assigning after applying ufunc HOT 8
- GeoSeries.transform returns "unsupported operand type(s)" HOT 5
- ENH: Allow to change schema and/or metadata in to_parquet HOT 4
- Support `orient` keyword in to_json HOT 3
- TST/COMPAT: Ignore int64 vs int_nativesize mismatches introduced by pandas 2.x HOT 2
- COMPAT: Default value of observed parameter in dissolve
- BUG: explore() using `cmap` instead of `color` attribute for *list-like of colors* when if `column` and `categorical` is set HOT 4
- ENH: parity between GeoDataFrame.to_json and GeoSeries.to_json HOT 1
- BUG: reducing shapely ufunc failing when called on a GeoSeries
- GeoPandas 1.0 release HOT 11
- MAINT: Fiona 1.10 support HOT 4
- BUG: up to 4 times slower in Linux compared to Windows when using gpd.read_file to read vector data HOT 6
- BUG: Is it possible to use GeoPandas on Windows 7 ?
- BUG: Unable to append to a layer when using fiona backend HOT 3
- BUG: plotting_with_geoplot example is broken HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geopandas.