Giter VIP home page Giter VIP logo

Comments (11)

Goldmariek avatar Goldmariek commented on July 24, 2024 2

Regarding cloud storage, I only have experience with training on Google Cloud Platform; however I don't feel comfortable assessing whether GCP is an alternative.

from paris-bikes.

operte avatar operte commented on July 24, 2024 1

I asked what the best practices of other projects are on Slack.

from paris-bikes.

operte avatar operte commented on July 24, 2024 1

From the solutions that popped up on Slack, I'm most excited about two:

DVC

I've been wanting to try this out for some time.
It would require people to learn a few more command line tools and probably some learning/setting-up, but it doesn't seem too hard.
Here's a brief tutorial.

The big advantage is that this is completely independent of our code, so it would work for R and python scripters without any necessary overhead.
The files would just be on our laptops for use.

I think we could use DVC to store our data on NextCloud, using the WebDav protocol.
But this is just an assumption, for now.
At minimum, people would have to somehow configure a secure connection to NextCloud, and all of us would need credentials for that.

The disadvantage is that we would have to find a way to keep track where all the files came from.
Some kind of metadata file.

Custom python data loader class

We could also write our own custom data loader class.
This could separate the handling of external servers and connections away from the user (who would simply have to establish the secure connection to the remote server we want to use).

I'm thinking I could write something like this:

class DataLoader:
    def load(self, dataset):
        filename = self._get_filename_from_config_file(dataset)
        if a local copy of the file already exists:
            data = load file (pd.load_csv or whatever)
        if it doesn't exist:
            self._get_file_from_remote_server(file)
            data = load file (pd.load_csv or whatever)
        return data

_get_file_from_remote_server would separate the user from having to know/handle the remote server.
We could start with Google Drive until we get NextCloud credentials, and then move to NextCloud.
The user would not be affected (except for having to setup the authorization to access the server).

The other advantage of this is that we could have a config file where we store the metadata (data source etc) (self. _get_filename_from_config_file).
That could also abstract the user from the actual file.
E.g. I could try loading the census data with DataLoader.load("census"), without having to know the actual filename or keeping track of where we got that data from.
The metadata file could be quite simple. E.g.:

metadata.yaml

census:
    source: https://data-apur.opendata.arcgis.com/datasets/Apur::recensement-iris-population
    remote: <link to NextCloud/google drive>
    notes: <whatever notes we would like to add about it>

Another advantage is that the remote link to raw files could be their actual source URL, we would not have to use space on that.

And yet another advantage is that if we want to ever deploy some dashboard or code, we might not have to change much on the data loading side, just on the base data loader class or metadata files.

The disadvantage is that this would then be specific to one language, so we would have to duplicate the effort between R and python.

@dietrichsimon @astrid4559 @Goldmariek
What do you think? I feel I'm overthinking this :p

from paris-bikes.

operte avatar operte commented on July 24, 2024 1

Looks like we can start with GDrive :) I'll have some time to try it out this evening (if my train has a stable connection :p)

from paris-bikes.

operte avatar operte commented on July 24, 2024

On github lfs max file sizes:

https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage

Seems like we have a limit between 2 and 5 GB. Sounds decent. More than that is anyway hard to process on our local laptops, I guess.

Here's a guide on how to check what the current consumption is: https://docs.github.com/en/billing/managing-billing-for-git-large-file-storage/viewing-your-git-large-file-storage-usage

from paris-bikes.

operte avatar operte commented on July 24, 2024

Ok, I prepared a test branch to use git-lfs.
I downloaded a quite granular census file from here which is about 4 or 5 Mb large.

When I tried to push these to the remote, I get the following error:

[2022-07-26T20:19:07.734Z] batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

I assume that either we need to somehow get more storage space for the correlaid project, or find an alternative solution. @katoss do you know who we could ask?

from paris-bikes.

katoss avatar katoss commented on July 24, 2024

Thank you for trying git lfs @operte ! I'll try to find out if the error is tied to the correlaid account and if we can do something about it

from paris-bikes.

katoss avatar katoss commented on July 24, 2024

So, apparently another project already uses the bandwidth of correlaid's git lfs. They are most probably ok with paying an upgrade of 50gb - however, frie is on holiday until mid-august, and phil would like to talk to her about that first.
An alternative for storage is https://correlcloud.org/ (server in germany for gdpr compliance). For correlcloud we need accounts, for which we also need to wait for frie. As we are only working with open data, I guess we could just use google drive for data storage as an alternative to lfs. But I don't know if it is possible to integrate that smoothly with github.

from paris-bikes.

operte avatar operte commented on July 24, 2024

I see that it is possible to have our code on github, but the lfs can point elsewhere.
If correlcloud supported it, we could try pointing our lfs to it.
However I briefly checked the nextcloud documentation and there's no mention of git-lfs servers.

For now we could proceed by writing some simple data loader functions/classes which could check if there's already a copy of the raw data in the user environment. If yes, it would load them normally, if not it could download them from the source addresses.
Of course that if/when we start storing processed data we have to figure out where to store it (gdrive/correlcloud) and check if it is easy to make the addresses accessible by these functions.

from paris-bikes.

katoss avatar katoss commented on July 24, 2024

I talked with @Liyubov, we think it could be a good idea for now to create branches of our repo, and use your personal git lfs storage. Then once we have the upgrade, we can merge to the main branch. I haven't tried it myself yet, but I guess every account has personal lfs storage, so it should normally work. What do you think?

from paris-bikes.

katoss avatar katoss commented on July 24, 2024

Thanks a lot for sharing your thoughts @operte ! I don't have much experience with either option, but from what I've read so far, DVC seems like a good option (choosing our own cloud system and not having to pay extra for more lfs storage, and apparently dvc has been built for data science).
Do you think we can use it with Google Drive in the beginning, too, and then possibly change to nextcloud, or does that only work reasonably with custom python loader classes?
We can discuss this in our meeting on Wednesday. :)

from paris-bikes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.