Giter VIP home page Giter VIP logo

Comments (25)

Robinlovelace avatar Robinlovelace commented on August 22, 2024 1

End of March sound do-able target for this @mpadge? Just added this to the 0.1.8 milestone: https://github.com/ropensci/stplanr/milestone/3 . Note: just removed another dependency: leaflet (which is quite big) is no longer an import.

from stplanr.

Robinlovelace avatar Robinlovelace commented on August 22, 2024 1

Closing as we're no longer doing this. It will live in https://github.com/mpadge/bikedata

from stplanr.

mpadge avatar mpadge commented on August 22, 2024

Yup. Thanks for the motivation to finish my half-started trial of NYC citibike data ... watch this space

from stplanr.

mpadge avatar mpadge commented on August 22, 2024

Now at the stage where we need to discuss a rather important issue: For such data to be at all useful, they really are going to have to be stored in a postgres database (coz as far as I know that's the only format compatible with general R licensing requirements and the like). Once stored, queries can then be run supremely fast. This, however, requires the creation and likely ongoing existence of such a database, and I'm not at all sure how this fits in with general R policies and practices?

I've got some very preliminary code here which does all the postgres stuff. Before I go any further, we'll need to decide on how to incorporate the need for a database within stplanr. Initial thoughts?

(Oh, and note that several necessary bash scripts are in inst/sh, as suggested here. This is where all the magic happens.)

from stplanr.

richardellison avatar richardellison commented on August 22, 2024

Is expecting everybody to have a fully set-up and configured instance of postgres on the same computer they run R really reasonable? My normal approach is to do exactly what you're suggesting and offload a lot of the data storage and processing work to a postgres database (although generally hosted on a different computer) but in this case I'm struggling to see how we can reasonably build it into stplanr.

Have you tried using spatialite with these datasets? It will be slower than postgres but at least up to a certain size it may well work well enough. Above that size you probably end up needing to set up indexes and configure other settings that I wouldn't expect most users of stplanr to be familiar with. The licence should also be compatible with R licensing requirements.

from stplanr.

richardellison avatar richardellison commented on August 22, 2024

I would add that if we can have it so that a user could specify a postgres + postgis database (with a connection string for instance) then that could be an option. That would avoid needing to have a local postgres instance (as your current code requires) if somebody has data that spatialite can not handle efficiently. postGIStools might be useful for this.

from stplanr.

mpadge avatar mpadge commented on August 22, 2024

Thanks Richard, I'll check out those options and modify accordingly....

from stplanr.

mpadge avatar mpadge commented on August 22, 2024

postgres versus C++

Some quantitative food for thought now on the bikedata README. I've plugged in my C++ routines and done a timing comparison with constructing a postgres database. The latter takes >62s, while reading the same data with C++ takes <10s, and so is over six times faster. Given Richard's entirely legitimate concerns about database storage within stplanr, this raises the question of whether you'd simply prefer the C++ code?

Drawbacks:

  1. Lack of flexibility - I would have to hard-code the extraction of particular 'queries', such as grouped by hour-of-day or weekday
  2. Each 'query' will require running the entire extraction routine from scratch again, raising what I see as the abiding issue here, which is ...

How to decide?

If you think an 'average' user is likely to want less than around 10 different queries, then C++ should enable data extraction that is both quicker and easier to use overall. Otherwise, postgres will be better.

Note that most 'queries' are likely to be generic - I can easily code fixed C++ routines to bin data by hour, by weekday, by month, and by user profile (which exists for some cities including NYC), and the main R function can simply accept start and end dates of desired data. That should cover most query needs. I could even return all of those standard permutations of queries in a single list, and almost certainly do it quicker than postgres.

I also imagine that overt slowness in extracting the data may not be such an issue, because most time is likely to be spent in analysis rather than running endless queries on different aspects of the data.

Thoughts?

Additional comments

  1. I guess this also resolves Richard's sqlite question: If postgres is already so slow in comparison, then using sqlite would seem rather pointless indeed.
  2. Don't worry about the very non-official structure of the bikedata repo---it's just there to run the current demo, and pulling into standard R form will be no issue.
  3. Note that the C++ code uses Dirk's BH package which adds a bit of weight but not as much as it used to!. I could conceivably get rid of this dependency, but that would likely be quite some work ...
  4. My C++ code is likely not particularly well optimised in many places, and I imagine that some pretty hefty speed-ups could be made ... down the track.

from stplanr.

mpadge avatar mpadge commented on August 22, 2024

Slight update: C++ routines weren't quite doing comparable things. Now they are, and the speed difference is more like 3+ times faster rather than 6. More importantly, note differences in numbers of reported trips on the bikedata README - these data files are full of junk trips (missing starts, ends, other bits that don't make sense). My C++ code controls for all of that, whereas automatic import into postgres obviously doesn't. Another argument in favour of C++?

from stplanr.

Robinlovelace avatar Robinlovelace commented on August 22, 2024

My view: the C++ option sounds way more appealing based on the above.

Note that we already install BH when stplanr is installed, as now noted in the README, along with the kitchen sink (any counseling on how to reduce this welcome):

readme-unnamed-chunk-11-1

So yes, sounds like that leaves a clear path for the C++ solution to go ahead - happy for further input from others - look to be learning some C++ from @virgesmith in January so another reason if any others were needed!

from stplanr.

mpadge avatar mpadge commented on August 22, 2024

That dep graph is monstrous! The strongest argument of all to exclude another monster via RPostgreSQL dep.

from stplanr.

richardellison avatar richardellison commented on August 22, 2024

I'm not sure benchmarking reading data in C++ and constructing and importing data into a postgres (or for that matter spatialite) table is an appropriate comparison. Inserting data into a database is always going to be slow than reading data into memory. The question is if subsequent queries make up for the additional import time, particularly if you need to reread the data every time. My experience is that it does, particularly with very large datasets and some properly configured indexes provided you are not trying to retrieve every single trip in every query (which effectively reduces to just scanning the whole dataset at which point you may as well read the whole dataset every time). One other thing to mention, importing data into a postgres or spatialite database is many times faster using the C++ postgres (or spatialite) libraries than using R as an intermediary.

I might have a play with the citibike data and see if using spatialite is a reasonable option. To add yet another alternative, there are the bigmemory and associated packages that allow you to use datasets that are larger than available memory in R.

from stplanr.

richardellison avatar richardellison commented on August 22, 2024

I think we may want to try pruning some of those dependencies. Can we get rid of RCurl for instance and replace it entirely with httr? maptools seems to be used for spRbind, is there an alternative in one of the other packages we already import?

from stplanr.

Robinlovelace avatar Robinlovelace commented on August 22, 2024

Yes we're already using raster::bind() which is equivalent so we can chuck that. Yes pls ditch RCurl - I think that lives in one of your funs @richardellison ; )

returnval[i] <- gsub('\\\\\\\\\"','\\\\\\"',gsub('\\\\','\\\\\\\\',RCurl::getURL(paste0(qryurl,"loc=",paste0(viapoints[[i]][,1],',',viapoints[[i]][,2],'&u=',viapoints[[i]][,3],collapse='&loc='),'&',

from stplanr.

mpadge avatar mpadge commented on August 22, 2024

The question is if subsequent queries make up for the additional import time, particularly if you need to reread the data every time. My experience is that it does, particularly with very large datasets and some properly configured indexes provided you are not trying to retrieve every single trip in every query (which effectively reduces to just scanning the whole dataset at which point you may as well read the whole dataset every time)

That's my question exactly, and I don't know the answer. I'd suggest you guys need to put some thought into what a typical use case might look like. Noting that the timing comparisons are with the C++ postgres (hence the scripts in /inst/sh), I think one way to approach an answer would be to estimate the number of distinct queries likely to be necessary in a typical use case. If it's more than a handful, then of course postgres/sqlite will be the way to go.

@richardellison everything you need for an sqlite version should be in the bikedata scripts. Data download is R internal. Feel free to adapt and PR ... And note that the total number of trips reported from postgres far exceeds numbers of lines in the files (which are consistent either way and given in the current Rcpp section). There's obviously something fishy going on there ...

Oh, and finally, memory ought not be a real concern because (given that Chinese data can't yet be accessed) the biggest bike system remains Paris, with 'only' 1230 stations, so no troubles doing everything in RAM.

from stplanr.

Robinlovelace avatar Robinlovelace commented on August 22, 2024

Also: I'm confident we can ditch data.table - dplyr does all we need there I think. Happy to have a bash that @richardellison? Could also do but think it's in your code too.

from stplanr.

richardellison avatar richardellison commented on August 22, 2024

RCurl is definitely in my code, will have a go at removing both RCurl and data.table.

Having had another look at the scripts in /inst/sh, the scripts first import the data using the COPY command and then copy the data into a new table. What I'm suggesting is a C++ function that reads the data and then inserts the data into the final table in one go, effectively eliminating one of the copies. In any case, I'll fork the bikedata repo and try a spatialite version with a C++ import function.

from stplanr.

richardellison avatar richardellison commented on August 22, 2024

I have removed the RCurl, data.table and maptools dependencies with #169.

from stplanr.

richardellison avatar richardellison commented on August 22, 2024

As an update on this, I have so far written a (at this point rudimentary) C++ function to import the data into a spatialite database. This process is considerably slower than using C++ directly and even slightly slower than importing it into a postgres database (as I expected). However, it is still reasonably quick and even without any indexes I can run a queries like the following in about 1 second for each 1 million trips most of which is printing the results:

SELECT start_station_id, end_station_id, COUNT(*) as numtrips 
FROM trips 
GROUP BY start_station_id, end_station_id;

With indexes it should be faster but I still need to run some additional tests.

from stplanr.

mpadge avatar mpadge commented on August 22, 2024

Great stuff @richardellison - as said, please feel free to PR into bikedata and i'll gladly merge with the rest of what I've got (notably through adaptation to other cities).

from stplanr.

Robinlovelace avatar Robinlovelace commented on August 22, 2024

Awesome work removing the data.table dep - will make the graph a little less monstrous. Would like to get that in before the next release. Suggest we save the PR resulting from this until the next but one release.

from stplanr.

mpadge avatar mpadge commented on August 22, 2024

Latest thoughts: The bikedata routines seem more like a separate package, i'd say. They'd likely also find more usage that way, rather than being 'buried' within stplanr. They will nevertheless be constructed to enable direct import into stplanr. Other + for you guys is that it will avoid extra stplanr deps (like sqlite). End of March is still likely feasible. Does that sound okay?

from stplanr.

Robinlovelace avatar Robinlovelace commented on August 22, 2024

Sounds good to me Mark.

from stplanr.

Robinlovelace avatar Robinlovelace commented on August 22, 2024

And no pressure - may have more osmdata pressure round then for all we know.

from stplanr.

richardellison avatar richardellison commented on August 22, 2024

Sounds good to me too.

from stplanr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.