Giter VIP home page Giter VIP logo

Comments (3)

TomAugspurger avatar TomAugspurger commented on June 24, 2024

Thanks for opening this issue. A couple questions:

  1. Can you share an example of an operation that's enabled by the data lake APIs that isn't possible (or is maybe slower) with the Blob Storage APIs? Just trying to understand why a user might want this.
  2. Do you have a suggestion for how this might be implemented, and what the user facing API would be?

It looks like historically this library had two implementations: one for Data Lake (Gen 1) and one for Blob Storage. API-wise, would we want to keep these separate? Or would we want a single AzureBlobFileSystem with a keyword that controls the underlying Azure client we use?

from adlfs.

kaaveland avatar kaaveland commented on June 24, 2024

Not OP here, but the initial sales pitch for data lake gen2 when it was launched was that it understands file system structure. The name is a bit unfortunate, because it suggests that the product is related to data lake gen1, which it really isn't, to any significant degree. I always considered it "blob storage with first class folder structure".

With azure-storage-file-datalake listing the contents of a directory is very fast and you can expand the file tree one level at a time. If I recall correctly, you need to do prefix/glob-match with BlobServiceClient (this might not be true anymore, it's been years since I worked with it). Depending on use-cases, that might be very slow. For datasets with many partitions, it makes a pretty big difference if you mostly access them with partition filters since listing blobs is/used to be so slow. I guess you also have atomic/cheap renames of folders, which I can't imagine is easy to achieve with the blob API.

Data lake gen2 also supports a bunch of things that I don't think are relevant to this project, like setting up ACL/RBAC for folders, that aren't supported by blog storage. People who also use adlfs may be using azure-storage-file-datalake to do those things before/after writing data, so they may already have a configured client instance available but that seems like a pretty weak reason to take on the complexity of supporting both clients.

from adlfs.

WaterKnight1998 avatar WaterKnight1998 commented on June 24, 2024

@efiop @hayesgb this will increase the speed a lot, please take a look :)

from adlfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.