Giter VIP home page Giter VIP logo

Comments (11)

jwass avatar jwass commented on May 16, 2024

I think we want to have a single defined column acting as "the geometry". Your suggestion is definitely a lot cleaner.

I could see some scenarios where you might want multiple geometries in a row with the ability to easily switch. For example you could store different simplification levels in different columns and switch which to use under different conditions.

I'll try this out.

from geopandas.

kjordahl avatar kjordahl commented on May 16, 2024

@jtratner I like this suggestion. It seems to me the "pandas way" to handle this, using a property that has special behavior rather than a specially named column. Of course you can always store geometries in any column, but there's no easy way to know which ones (they're just dtype object). It seems right that there should only be one "geometry" of a given row at a time, even if other "geometries" are actually stored in the same row.

Another advantage of making the geometry a property is that normal reduction methods won't have to worry about treating the geometry column differently.

@jwass Your PR looks like a good first step. It gives us part of that API.

from geopandas.

jtratner avatar jtratner commented on May 16, 2024

Okay. @jwass - I was (implicitly I guess) offering to do it, but if you'd like to, go ahead :)

from geopandas.

jwass avatar jwass commented on May 16, 2024

Sorry Jeff, didn't mean to step on your toes there. I think I just read it as a suggestion based on your experience with Pandas. Obviously feel free to rip apart my commit with what you had in mind.

from geopandas.

jwass avatar jwass commented on May 16, 2024

I suppose the next steps for this would be to store the geometry as a member and remove any special treatment of a geometry column. Maybe the GeoDataFrame's constructor would look for a column named geometry by default?

from geopandas.

kjordahl avatar kjordahl commented on May 16, 2024

@jwass Already merged #47, as it fit with our current API. Now that I think about it, though, pandas columns are already accessible as properties, so the property there was redundant. The set_geometry method makes sense, though.

@jtratner I like the idea of geometry as a real property, feel free to try it out and send a PR.

from geopandas.

carsonfarmer avatar carsonfarmer commented on May 16, 2024

This discussion is an exciting one. I actually work with multiple geometry columns in databases all the time (origin and destination locations), so the ability to switch between them is ideal. The more I follow this discussion, the more I like the idea of set_geometry acting like default pandas set_index. In fact, the whole concept of geometries acting as the 'index' of a data frame is a nice one, and avoids lots of issues (as @kjordahl suggested re: reduction methods). This also sounds a lot like the 'pandas way' of doing things... turn a column into an index, which has its own internal 'index' to speed up retrieval etc. For now, we can avoid this internal index at the expense of some speed, but when we do eventually get around to working that out (via RTree?), the API shouldn't change much/at all. Cool stuff!

from geopandas.

jtratner avatar jtratner commented on May 16, 2024

Happy that you put something through already - just wanted to make it clear
I'd do it if not. :)

I don't think you need to change the idea of treating the geo column
specially by default, I think we should change places that do
self['geometry'] to be self.geometry or
self[[self._geo_column_name]], etc. For example, if I do this:

gf = GeoDataFrame(zip([1, 2, 3], [4, 5, 6], [geo1, geo2, geo3], [geo1,
geo2, geo3]), columns = ["A", "B", "geometry", "geometry"])

Looking up on geometry now produces a (Geo)DataFrame (oh no!). So you
already have a potential breakage.

I suggest the following:

  1. Have 'geometry' be the default for now.
  2. Override the column property to validate that there aren't multiple
    names for whatever the geo column is on setting.
  3. Store the geo column name as metadata on the frame (so it'll get carried
    forward).

1-3 will help avoid issues with dup cols for geometry and avoid
encountering issues with reordering.

Side note - how will GeoPandas handle hierarchical columns? Unless you use
integer column locations (which create their own issues with reordering),
it's not totally clear to me. Maybe warn for now or something?

from geopandas.

jwass avatar jwass commented on May 16, 2024

After re-reading the discussion here, I'm not sure everyone is on the same page about how to approach this. I'll try to summarize my interpretation. If I misrepresent your position, let me know.

@jtratner is recommending that GeoDataFrame have a property indicating the column to return when accessing the geometry property. We can have several columns containing geometries and calls to set_geometry simply switch the name of the column to use. However the geometry column(s) would still be ordinary columns accessible in the usual way. For example:

gdf = GeoDataFrame(...)  # Uses  'geometry' column by default
shape = gdf.geometry.iloc[0]
shape = gdf['geometry'].iloc[0]  # Returns the same as above because 'geometry' is just a column

gdf['other_geometry'] = GeoSeries(...) # (or list of geometries, etc.)
a = gdf.area
a = gdf.geometry.area
a = gdf['geometry'].area  # All three are the same

gdf.set_geometry('other_geometry', inplace=True)
a2 = gdf.area
a2 = gdf.geometry.area
a2 = gdf['other_geometry'].area  # All three are the same

@kjordahl is recommending that the geometry would not really be a column in the DataFrame but would be a separate property that we always do the right thing with. It would serve the similar function above, but you couldn't access it using the normal DataFrame operations. In this approach:

gdf = GeoDataFrame(...)
shape = gdf.geometry.iloc[0]  # The only way to get the geometry
shape = gdf['geometry'].iloc[0]  # Raises KeyError because there's no column named geometry

I think the major problem with this approach is that we'd need to override all the row indexing operations to grab out the proper rows in our geometry property...? gdf[:5] would have to separately grab those elements in our geometry property. If we keep the geometry as a column, the usual indexing takes care of that for us.

I tried implementing the basic idea of the first suggestion: (https://github.com/jwass/geopandas/commits/set_geometry). It too raises a bunch of questions.

  • If I get a column containing geometries but is not the defined geometry, would it return a GeoSeries? (i.e. what does gdf['other_geometry'] return when other_geometry is not the geometry column?). I think it would be powerful to return a GeoSeries. Would the GeoBlock mentioned in Issue #51 help with this, so we don't have to check all the elements whenever a column is retrieved?
  • What type is returned here: gdf2 = gdf[['prop1', 'prop2', 'other_geometry']] when other_geometry is not the geometry?
  • Would we allow different columns to have different crs?

I think both have their pros and cons but want to make sure we're talking about the same thing moving forward. I'm leaning toward the first approach where they the geometry is an ordinary column...

from geopandas.

jtratner avatar jtratner commented on May 16, 2024

I'm on board with @kjordahl suggestion (and it's really just a further
extension of making geometry a property). But the downside is that you have
to reapply all the slicing whenever you do things to the other columns...

from geopandas.

jtratner avatar jtratner commented on May 16, 2024

Some background on pandas internals that have bearing on this: internal
representation of DataFrame (and Series) is a block manager which holds
individual dtypes in separate blocks. (not sure if object dtypes are put in
the same block). Series uses a SingleBlockManager that does something
similar (only in 0.13 and above). When you ask for a column, part/all of
the block is converted to a Series, and the Series is cached on the parent
(so if you query twice you get the same thing) and Series updates or
invalidates its cache as necessary.

If you always want to return Geo Series from geocolumns, would need to
intercept all the indexers (loc, iloc, ix, xs, etc) OR you could monkey
patch DataFrames' constructor_sliced to be GeoSeries and then return
regular Series if it's not Geo.

One possibility that came to mind would be to bind all the geo methods to
Series and DataFrame (or put them in a geo namespace like StringMethods)
and then you could swap out the class at that point.

from geopandas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.