Comments (14)
@martindurant @zooba thanks for the replies!
@martindurant not complaining that it doesn't exist yet, just trying to gather the info needed to maybe kickstart it myself :)
@zooba you mean the library I want if I'm thinking of implementing this type of dask connector?
from adlfs.
Ok I understood now that you have to pass the storage_options
dict like so:
df = dd.read_csv("adl://somedatalakestore.azuredatalakestore.net/somefile.csv",
storage_options={"tenant_id" : "something", "client_id" : "something"})
from adlfs.
from adlfs.
@mrocklin , I am not aware of a way to get Azure default credentials, but I would be surprised if there isn't one, probably in https://github.com/AzureAD/azure-activedirectory-library-for-python. I would ask some people at MS directly.
from adlfs.
@mrocklin , I am not aware of a way to get Azure default credentials, but I would be surprised if there isn't one, probably in https://github.com/AzureAD/azure-activedirectory-library-for-python. I would ask some people at MS directly.
@noelbundick or @zooba might know the right people to contact about helping out here
from adlfs.
@lmazuel may be able to help, but I suspect the easiest way is to use classes from msrestazure (which should already be installed as a dependency of the DataLake SDK)
from adlfs.
What about a similar thing for azure blob storage? I reckon it resembles S3 better and is much cheaper. And the authentication is done via shared secrets rather than active directory credentials.
I'm guessing it would amount to writing an equivalent to https://github.com/dask/s3fs, and then a wrapper like this one. Is that right?
from adlfs.
@colobas , it's simply a case of no-one having spent the time to look into it. I don't know how usable the existing MS library code might be for deriving from.
from adlfs.
@colobas The library you want in that case is azure-storage, and it should be pretty trivial to map. It is essentially the same thing as S3, though with some bonus features more-or-less layered on top of blob storage.
Both would be good, as DataLake is the better service for "dump all my files and process them later" (though it comes with its own analytics service, but I'd rather use Dask, so I guess other people would too :) )
from adlfs.
Thanks @zooba for the mention :)
There is two ways to automatically authenticate a ServicePrincipal for the SDK, without any configuration:
- Use the currently loaded Azure CLI 2.0 profile
- Using an authentication file and set it to env variable AZURE_AUTH_LOCATION
This will require this dask module to depends on azure-common
, and a few code in this dask file (i.e. if CLI, try CLI, ifnot, try env variable, if not, die, etc.). And probably a few changes for me as well, since these are tight to SDK client, I'm sure there a small gap to fill to make them a little more generic. But that makes sense to enable this scenario :)
from adlfs.
@colobas That's right. Our Azure library is broken up more than boto, so sometimes it's less obvious that you don't need to depend upon the whole thing. (In particular, azure-mgmt-*
will pull in a lot of dependencies, and if you're just trying to use one of the client libraries then you'll want to bypass that by depending on the more specific piece.)
Our Azure SDK expert for Python is @lmazuel, so feel free to call him in whenever you have questions :)
from adlfs.
@zooba thanks a lot for the quick answers and tips! Take care and have a good holiday
from adlfs.
@lmazuel , assuming you do
from azure.common.client_factory import get_client_from_cli_profile
from azure.mgmt.compute import ComputeManagementClient
client = get_client_from_cli_profile(ComputeManagementClient)
how do you get the appropriate credentials out of the client object?
from adlfs.
@martindurant It's why I was saying "And probably a few changes for me as well, since these are tight to SDK client" :)
But you can do
from azure.common.credentials import get_azure_cli_credentials
credentials, subscription_id = get_azure_cli_credentials()
From the credentials attributes, you should be able to get client_id (not tested).
from adlfs.
Related Issues (20)
- UserWarning: Failed to fetch container properties for CONTAINER_NAME. Assume it exists already HOT 1
- "sdk_moniker" key error HOT 9
- Avoid private APIs from azure.storage HOT 2
- InternalServerError while writing large json data.
- await file_obj.credential.close() : TypeError: object NoneType can't be used in 'await' expression HOT 4
- update readme HOT 1
- Support py3.12
- `find` doesn't accept `maxdepth` parameter HOT 1
- Add use_emulator setting to better align with object_store crate HOT 1
- Current state of the library, milestones and current development HOT 1
- Concurrent download of multiple files HOT 1
- Support virtual directory stubs with uppercase "Hdi_isfolder" metadata HOT 1
- Feature Suggestion: Optional content type when for writing file HOT 2
- Support passing url in AzureBlobFileSystem HOT 1
- Add comment why `aiohttp` is required
- Fix typo in repo About
- Python 3.12 support blocked by aiohttp HOT 1
- Feature Request: Support for Adding Metadata to Blobs
- Runtime warning from missing await HOT 2
- `fs.info()` and `fs.ls(detail=True)` return different etag formats
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adlfs.