Giter VIP home page Giter VIP logo

javisantana.com's People

Contributors

desarrolla2 avatar gimco avatar javierarce avatar javisantana avatar nimbusaeta avatar

Stargazers

 avatar  avatar  avatar

javisantana.com's Issues

[blogpost] Local search with PostGIS

Local search with PostGIS

I was talking with some friends about a geo problem they have:

Given a set of elements with a position and a name in a database give me the N closest to a certain point that
match some pattern on the name.

So imagine you have openstreetmap database and want to find the first 300 banks and bars closer to
Madrid city center (pretty interesting combination I'd say, in Spain ATMs are in the banks).

So in order to test it I loaded Madrid OSM in a postgres database. I just downloaded the data from geofabrik site and imported using osm2pgsql tool, pretty straightforward.

Then I created some indices for the way and amenity columns (using full text search stuff)

-- gist index for the geometry, full text search for the text
create index on planet_osm_point gist(way);
create index on planet_osm_point using to_tsvector('spanish', amenity)

First try, use Nearest Neighbour search

Since postgis 2.0 we have Nearest Neighbour search (thanks to CartoDB which founded it) that allows
to use the geospatial index to sort results (read this blogpost in boundless blog), so the first try was to order by distance operator <-> the results from the text filter.

select way, amenity from planet_osm_point 
where 
to_tsvector('spanish', amenity) @@ to_tsquery('ba:*') 
order by way <->'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry limit 301

This takes around 48ms (everything cached). Looking at the explain analyze out of curiosity I
realize spatial index wasn't being used:

Limit  (cost=23947.86..23948.61 rows=301 width=41) (actual time=48.729..48.780 rows=301 loops=1)
   ->  Sort  (cost=23947.86..24035.30 rows=34976 width=41) (actual time=48.727..48.755 rows=301 loops=1)
         Sort Key: ((way <-> '010100002031BF0D00F8DAD368D52F19C1C324815603CB5241'::geometry))
         Sort Method: top-N heapsort  Memory: 48kB
         ->  Bitmap Heap Scan on planet_osm_point  (cost=1363.07..22333.09 rows=34976 width=41) (actual time=12.120..41.008 rows=17382 loops=1)
               Recheck Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
               ->  Bitmap Index Scan on planet_osm_point_to_tsvector_idx1  (cost=0.00..1354.32 rows=34976 width=0) (actual time=10.889..10.889 rows=17382 loops=1)
                     Index Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
 Total runtime: 48.877 ms

I don't fully understand how the postgres planner works but sounds like it might be using a bitmap
and operation both indices. Increasing the limit does not change anything, I thought it could change
the selectivity. I tried a search by distance:

SELECT way, amenity FROM planet_osm_point 
WHERE
to_tsvector('spanish', amenity) @@ to_tsquery('ba:*') 
AND
st_dwithin(way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, 500)

and I get the desired BitmapAnd:

 ...
 ->  BitmapAnd  (cost=1422.90..1422.90 rows=42 width=0) (actual time=14.881..14.881 rows=0 loops=1)
         ->  Bitmap Index Scan on planet_osm_point_index  (cost=0.00..68.32 rows=2121 width=0) (actual time=2.229..2.229 rows=6028 loops=1)
               Index Cond: (way && '010300002031BF0D000100000005000000F8DAD368154F19C1C32481560FC95241F8DAD368154F19C1C3248156F7CC5241F8DAD368951019C1C3248156F7CC5241F8DAD368951019C1C32481560FC95241F8DAD368154F19C1C32481560FC95241'::geometry)
         ->  Bitmap Index Scan on planet_osm_point_to_tsvector_idx1  (cost=0.00..1354.32 rows=34976 width=0) (actual time=12.607..12.607 rows=17382 loops=1)
 ...

Second try, use a recursive query

But that's not the result I was looking for, I need the first 300 bars and banks closer to my
location, no matter if they are 30km away

So I though about having a kind of python generator, a query that returns results on demand until I
have enough results.

Reading postgres documentation there is a nice trick in CTE documentation, you can do a recursive
query without stop condition that stops where the external query reach the limit.

WITH RECURSIVE t(osm_id, way, amenity, distance) AS (
  (
    SELECT osm_id, way, amenity, 100.0 as distance from planet_osm_point
        WHERE
    to_tsvector('spanish', amenity) @@ to_tsquery('ba:*')
      AND
    st_dwithin(way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, 100) 
    order by way <-> 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry
  )
  UNION ALL
  (
    SELECT p.osm_id, p.way, p.amenity, prev.distance * 2 as distance
    FROM planet_osm_point p , (SELECT distance FROM t LIMIT 1) prev
       WHERE
    to_tsvector('spanish', p.amenity) @@ to_tsquery('ba:*')
      AND
    st_dwithin(p.way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, prev.distance * 2)
      AND not
    st_dwithin(p.way, 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, prev.distance)
    order by way <-> 'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry
   )
),
results as  (
    select *, st_distance(way,'SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry) as real_dist FROM t limit 300
)
SELECT * FROM results

The time for this query is around 48ms so no improvement at all. But that may depend on the number
of iterations it needs to do until fetch all the results. Starting with 300 meters takes 4 loops
since last bar is 2393 meters away form city center.

If it starts the iteration with a bigger radius, like 1500 (2 iterations), the query time is 25ms. If it only needs
to do the fist iteration, it's 18ms which is much better.

So how do we know what would be a good value to start? Hard to say without some density information stats... luckily postgres has pretty good stats about indices and there are good ways to access it: EXPLAIN and
_postgis_selectivity

spain=# explain select 1 FROM planet_osm_point where   to_tsvector('spanish', amenity) @@ to_tsquery('ba:*')       ;
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on planet_osm_point  (cost=1363.07..22245.65 rows=34976 width=0)
   Recheck Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
   ->  Bitmap Index Scan on planet_osm_point_to_tsvector_idx1  (cost=0.00..1354.32 rows=34976 width=0)
         Index Cond: (to_tsvector('spanish'::regconfig, amenity) @@ to_tsquery('ba:*'::text))
(4 rows)

Forget everything but rows=34976 that's the estimate of total number of bars and banks in the
whole table (the real count value is 17k but let's see if it's good enough)

In order to know how many points there are in a certain are we can use _postgis_selectivity
function. It's kind of hidden, is what actually postgres stats planner uses.

select _postgis_selectivity('planet_osm_point', 'way', st_expand('SRID=900913;POINT(-412661.352370664 4926477.3516323)'::geometry, 5000))
 _postgis_selectivity
----------------------
  0.00751143626570702

(you can get the same value using EXPLAIN but I like _postgis_selectivity)

So knowing the selectivity (the part of total rows selected by that bbox), the total number of bars
and banks and the total count (1.8M) we can estimate:

total_rows * percentaje_categories * selectivity
1748797 * 0.02 * 0.00751143626570702 ~ 262.0

So in 10km area we have an stimation of 262 points, so to get 300 points:

density = 262/(10000*10000) // points/m^2
area_to_300 = 300/density
R = sqrt(area_to_300/2*PI);
R -> 4268.93996744354m

So using a radius of 4200 meters the query takes ~18ms as it's doing a single iteration. It's
important to say that calculate the stats is almost free, it takes less than 1ms (except total count
that could be precalculated).

Other nice thing about the recursive query is it can be paginated, so you can find 100 results get
the last distance and the next time use that distance to start iterating.

Maybe the method works well for this case but not for other ones, depending on how postgres uses the indices the times may vary a lot but this is the best approach I found, any idea?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.