Comments (15)
Sorry to resurrect, but just wanted to know if this example was reproducible or if I'm the only one having the problem?
from adlfs.
This exception seems to follow checks for fs.isdir(path)
and fs.isfile(path)
. Do you have a chance to check whether the behaviour of those functions has changed for your path with the newer version of adlfs?
Given that the attempt to read immediately follows the write, can you try eliminating any caching effects by doing (I think)
import adlfs
adlfs.AzureDatalakeFileSystem.clear_instance_cache()
from adlfs.
Sure!
Clearing cache had no effect - I assume you mean AzureBlobFileSystem, though neither has an effect.
I've had some issues with ls
- continuing the snippet from above:
>>> fs = adlfs.AzureBlobFileSystem(**STORAGE_OPTIONS)
>>> fs.ls("test")
Traceback (most recent call last):
File "/home/anders/.pyenv/versions/feature_env/lib/python3.8/site-packages/adlfs/core.py", line 518, in ls
elif len(blobs) == 1 and blobs[0]["blob_type"] == "BlockBlob":
File "/home/anders/.pyenv/versions/feature_env/lib/python3.8/site-packages/azure/storage/blob/_shared/models.py", line 191, in __getitem__
return self.__dict__[key]
KeyError: 'blob_type'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/anders/.pyenv/versions/feature_env/lib/python3.8/site-packages/adlfs/core.py", line 534, in ls
raise FileNotFoundError(f"File {path} does not exist!!")
FileNotFoundError: File does not exist!!
But there is data in the container:
fs.ls("test/test_group")
['test/test_group/partition_key=1/', 'test/test_group/partition_key=2/', 'test/test_group/_common_metadata', 'test/test_group/_metadata', 'test/test_group/part.0.parquet']
Previous behaviour (just verified by rolling back)
>>> fs.ls("test")
['test/test_group/']
Interestingly, both fs.isdir
and fs.isfile
return False
>>> fs.isfile("test/test_group")
False
>>> fs.isdir("test/test_group")
False
Previous behaviour
>>> fs.isdir("test/test_group")
True
>>> fs.isfile("test/test_group")
False
Which probably is not good 😄
from adlfs.
Do you observe this behavior with adlfs<0.3.0? The update to 0.3 migrates to Azure storage v12 from v2.0. It would be helpful to know if its related to this change.
from adlfs.
@hayesgb You're fast 😄 Just updated my reply with equivalent behaviour from v0.2.4
from adlfs.
LOL. I just sat down at my computer and saw this. I just added a branch "isfile_tests" that checks to verify if files and directories in the top level directory are identified properly. These pass, and can be found under test_core.py:
def test_isdir(storage):
fs = adlfs.AzureBlobFileSystem(
account_name=storage.account_name, connection_string=CONN_STR
)
assert fs.isdir("data") is True
assert fs.isdir("data/root") is True
assert fs.isdir("data/top_file.txt") is False
def test_isfile(storage):
fs = adlfs.AzureBlobFileSystem(
account_name=storage.account_name, connection_string=CONN_STR
)
assert fs.isfile("data") is False
assert fs.isfile("data/root") is False
assert fs.isfile("data/top_file.txt") is True
Two questions. 1) Any chance you can add a failing example, and 2) Can you share the versions of pyarrow, fastparquet, and fsspec you are using now vs what was working previously?
from adlfs.
Sure - with a "pure" AzureBlobFileSystem (essentially copying your test) works fine - it seems the issue is with the way dask.dataframe.to_parquet
writes the data when using the abfs backend.
So a complete failing example would be (again, using Azurite):
Versions:
adlfs=0.3.0
pandas = 1.0.3
azure-storage-blob=12.3.1
dask=2.16.0
fsspec=0.7.4
pyarrow=0.17.1
fastparquet=0.4.0
import dask.dataframe as dd
import pandas as pd
import adlfs
from azure.storage.blob import BlobServiceClient
conn_str = "DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey" \
"=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr" \
"/KBHBeksoGMGw==;BlobEndpoint=http://127.0.0.1:10000/devstoreaccount1;"
STORAGE_OPTIONS = {"account_name": "devstoreaccount1",
"connection_string": conn_str}
client: BlobServiceClient = BlobServiceClient.from_connection_string(conn_str)
container_client = client.create_container("test")
df = pd.DataFrame(
{
"col1": [1, 2, 3, 4],
"col2": [2, 4, 6, 8],
"index_key": [1, 1, 2, 2],
"partition_key": [1, 1, 2, 2],
}
)
dask_dataframe = dd.from_pandas(df, npartitions=1)
dask_dataframe.to_parquet(
"abfs://test/test_group",
storage_options=STORAGE_OPTIONS,
engine="pyarrow",
)
fs = adlfs.AzureBlobFileSystem(**STORAGE_OPTIONS)
fs.ls("test")
Traceback (most recent call last):
File "/home/anders/.pyenv/versions/feature_env/lib/python3.8/site-packages/adlfs/core.py", line 518, in ls
elif len(blobs) == 1 and blobs[0]["blob_type"] == "BlockBlob":
File "/home/anders/.pyenv/versions/feature_env/lib/python3.8/site-packages/azure/storage/blob/_shared/models.py", line 191, in __getitem__
return self.__dict__[key]
KeyError: 'blob_type'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/home/anders/.pyenv/versions/feature_env/lib/python3.8/site-packages/adlfs/core.py", line 534, in ls
raise FileNotFoundError(f"File {path} does not exist!!")
FileNotFoundError: File does not exist!!
Should this be a Dask issue instead? Seems like to_parquet
is writing files in an unexpected manner, which causes errors when trying to read them again.
from adlfs.
Can you tell whether files have indeed been created in the blob container?
from adlfs.
Yes, I can do the following:
>>> fs.ls("test/test_group")
['test/test_group/_common_metadata', 'test/test_group/_metadata', 'test/test_group/part.0.parquet']
I have also "manually" confirmed by checking with the BlobStoreClient directly
from adlfs.
If ls("test/test_group")
works, but ls("test")
does not, because it apparently tries to list the root/empty path, then the problem does lie here. Why dask should need to list the parent directory is another matter.
from adlfs.
I think the issue arises, because dask is trying to do fs.isdir("test/test_group")
which returns False, which is new behaviour.
Digging into the adlfs/fsspec code, this is probably because fs.info("test/test_group")
also raises the same error as fs.ls("test")
which makes sense, since fs.info
delegating to AbstractFileSystem, which calls ls
: https://github.com/intake/filesystem_spec/blob/master/fsspec/spec.py#L538
What I don't understand is why ls
is getting screwed up by dask writing a parquet file to it...
from adlfs.
I'm looking into it.
from adlfs.
I believe I have a fix for this. In some instances, Azure Blob Filesystem will return an ItemPaged iterator instead of a BlobPrefix. The scenario you were seeing appears to be one of those instances, so it wasn't being picked up. I'm going to do a push to master. Any chance you can take a look and see if it fixes your issue?
from adlfs.
@hayesgb That's great news! My test case is working fine now, looks like you found the issue 👍
from adlfs.
Sounds great. I've released this in 0.3.1.
from adlfs.
Related Issues (20)
- InternalServerError while writing large json data.
- await file_obj.credential.close() : TypeError: object NoneType can't be used in 'await' expression HOT 4
- update readme HOT 1
- Support py3.12
- `find` doesn't accept `maxdepth` parameter HOT 1
- Add use_emulator setting to better align with object_store crate HOT 1
- Current state of the library, milestones and current development HOT 1
- Concurrent download of multiple files HOT 1
- Support virtual directory stubs with uppercase "Hdi_isfolder" metadata HOT 1
- Feature Suggestion: Optional content type when for writing file HOT 2
- Support passing url in AzureBlobFileSystem HOT 1
- Add comment why `aiohttp` is required
- Fix typo in repo About
- Python 3.12 support blocked by aiohttp HOT 1
- Feature Request: Support for Adding Metadata to Blobs
- Runtime warning from missing await HOT 2
- `fs.info()` and `fs.ls(detail=True)` return different etag formats
- Issue with parallel uploads to the same blob
- Can I use a bearer token / entra ID token for authentication? HOT 1
- Parameter anon ignored if set to False
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adlfs.