Comments (11)
Regarding cloud storage, I only have experience with training on Google Cloud Platform; however I don't feel comfortable assessing whether GCP is an alternative.
from paris-bikes.
I asked what the best practices of other projects are on Slack.
from paris-bikes.
From the solutions that popped up on Slack, I'm most excited about two:
DVC
I've been wanting to try this out for some time.
It would require people to learn a few more command line tools and probably some learning/setting-up, but it doesn't seem too hard.
Here's a brief tutorial.
The big advantage is that this is completely independent of our code, so it would work for R and python scripters without any necessary overhead.
The files would just be on our laptops for use.
I think we could use DVC to store our data on NextCloud, using the WebDav protocol.
But this is just an assumption, for now.
At minimum, people would have to somehow configure a secure connection to NextCloud, and all of us would need credentials for that.
The disadvantage is that we would have to find a way to keep track where all the files came from.
Some kind of metadata file.
Custom python data loader class
We could also write our own custom data loader class.
This could separate the handling of external servers and connections away from the user (who would simply have to establish the secure connection to the remote server we want to use).
I'm thinking I could write something like this:
class DataLoader:
def load(self, dataset):
filename = self._get_filename_from_config_file(dataset)
if a local copy of the file already exists:
data = load file (pd.load_csv or whatever)
if it doesn't exist:
self._get_file_from_remote_server(file)
data = load file (pd.load_csv or whatever)
return data
_get_file_from_remote_server
would separate the user from having to know/handle the remote server.
We could start with Google Drive until we get NextCloud credentials, and then move to NextCloud.
The user would not be affected (except for having to setup the authorization to access the server).
The other advantage of this is that we could have a config file where we store the metadata (data source etc) (self. _get_filename_from_config_file
).
That could also abstract the user from the actual file.
E.g. I could try loading the census data with DataLoader.load("census"), without having to know the actual filename or keeping track of where we got that data from.
The metadata file could be quite simple. E.g.:
metadata.yaml
census:
source: https://data-apur.opendata.arcgis.com/datasets/Apur::recensement-iris-population
remote: <link to NextCloud/google drive>
notes: <whatever notes we would like to add about it>
Another advantage is that the remote link to raw files could be their actual source URL, we would not have to use space on that.
And yet another advantage is that if we want to ever deploy some dashboard or code, we might not have to change much on the data loading side, just on the base data loader class or metadata files.
The disadvantage is that this would then be specific to one language, so we would have to duplicate the effort between R and python.
@dietrichsimon @astrid4559 @Goldmariek
What do you think? I feel I'm overthinking this :p
from paris-bikes.
Looks like we can start with GDrive :) I'll have some time to try it out this evening (if my train has a stable connection :p)
from paris-bikes.
On github lfs max file sizes:
Seems like we have a limit between 2 and 5 GB. Sounds decent. More than that is anyway hard to process on our local laptops, I guess.
Here's a guide on how to check what the current consumption is: https://docs.github.com/en/billing/managing-billing-for-git-large-file-storage/viewing-your-git-large-file-storage-usage
from paris-bikes.
Ok, I prepared a test branch to use git-lfs
.
I downloaded a quite granular census file from here which is about 4 or 5 Mb large.
When I tried to push these to the remote, I get the following error:
[2022-07-26T20:19:07.734Z] batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
I assume that either we need to somehow get more storage space for the correlaid project, or find an alternative solution. @katoss do you know who we could ask?
from paris-bikes.
Thank you for trying git lfs @operte ! I'll try to find out if the error is tied to the correlaid account and if we can do something about it
from paris-bikes.
So, apparently another project already uses the bandwidth of correlaid's git lfs. They are most probably ok with paying an upgrade of 50gb - however, frie is on holiday until mid-august, and phil would like to talk to her about that first.
An alternative for storage is https://correlcloud.org/ (server in germany for gdpr compliance). For correlcloud we need accounts, for which we also need to wait for frie. As we are only working with open data, I guess we could just use google drive for data storage as an alternative to lfs. But I don't know if it is possible to integrate that smoothly with github.
from paris-bikes.
I see that it is possible to have our code on github, but the lfs can point elsewhere.
If correlcloud supported it, we could try pointing our lfs to it.
However I briefly checked the nextcloud documentation and there's no mention of git-lfs servers.
For now we could proceed by writing some simple data loader functions/classes which could check if there's already a copy of the raw data in the user environment. If yes, it would load them normally, if not it could download them from the source addresses.
Of course that if/when we start storing processed data we have to figure out where to store it (gdrive/correlcloud) and check if it is easy to make the addresses accessible by these functions.
from paris-bikes.
I talked with @Liyubov, we think it could be a good idea for now to create branches of our repo, and use your personal git lfs storage. Then once we have the upgrade, we can merge to the main branch. I haven't tried it myself yet, but I guess every account has personal lfs storage, so it should normally work. What do you think?
from paris-bikes.
Thanks a lot for sharing your thoughts @operte ! I don't have much experience with either option, but from what I've read so far, DVC seems like a good option (choosing our own cloud system and not having to pay extra for more lfs storage, and apparently dvc has been built for data science).
Do you think we can use it with Google Drive in the beginning, too, and then possibly change to nextcloud, or does that only work reasonably with custom python loader classes?
We can discuss this in our meeting on Wednesday. :)
from paris-bikes.
Related Issues (20)
- Streaming data from velib stations HOT 4
- Systematic review for best practices in cities HOT 4
- Analyse bicycle network using network measures HOT 7
- Collect citizen data on missing bike parking HOT 6
- Something like direkt.bahn.guru but for vélib HOT 6
- Using insights from recent article about improving bike infrastructure in Paris HOT 3
- Setup requirements management HOT 1
- Amsterdam bicycle planning resources HOT 1
- Collecting topics for next meeting with City of Paris HOT 2
- DVC remote isn't working HOT 5
- Calibrate variables for IRIS model HOT 3
- Create data overview for IRIS model HOT 8
- Blog article(s) for CorrelAid blog [Project outcomes for community] HOT 3
- Open Online Data Meetup [Project outcomes for community] HOT 3
- Agreeing on a deliverable (web app?) for parking demand index HOT 14
- Documentation HOT 1
- App issues to resolve until presentation HOT 14
- Host app HOT 5
- Normalize metrics by area HOT 3
- Scheduled data pipeline HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paris-bikes.