Giter VIP home page Giter VIP logo

Comments (11)

mgsnuno avatar mgsnuno commented on May 27, 2024

@hayesgb any pointers on how to debug this? Thanks

from adlfs.

hayesgb avatar hayesgb commented on May 27, 2024

Does repartitioning the dataframe before writing it resolve the problem?

Are you still seeing the issue with either 0.4.x or 0.5.0? If so, can you provide me with a small example that replicates the problem?

Something like this:

storage_options = { I don't need these }
A = np.random.randint(0, 100, size=(10000, 4))
df2 = pd.DataFrame(data=A, columns=list("ABCD"))
ddf2 = dd.from_pandas(df2, npartitions=4)
ddf2.to_csv("abfs://container/path_to_my.csv/*.csv", storage_options=storage_options)

What is the exact expected result of the to_csv operation?

from adlfs.

mgsnuno avatar mgsnuno commented on May 27, 2024

@hayesgb repartitioning the dataframe before, with partition_size to make sure we have evenly sized partitions and any empty, didn't fix the issue with 0.4.x.

I've tried 0.5.0 briefly, need more time to see if it's working.

P.S. - What I notice with 0.5.0 (you're using asyncio correct?) is that some delete operations that were fast before, are really slow sometimes and sometimes really fast. I use Python Interactive in VSCode a lot and sometimes it just gets stuck. As reported by more people, looks like VSCode Python Interactive is not too friendly with asyncio multiple processess.

from adlfs.

hayesgb avatar hayesgb commented on May 27, 2024

v0.5.x uses asyncio and caching for filesystem operation (not for the AzureBlobFile class) to speed up operations.

from adlfs.

mgsnuno avatar mgsnuno commented on May 27, 2024

@hayesgb quick way to cause this issue:

file = open('sample.txt', 'wb')  #empty file
abfs.put("./sample.txt", '<blob_name>/sample.txt')
Traceback ~/miniconda3/envs/pipelines/lib/python3.8/site-packages/adlfs/spec.py in put(self, lpath, rpath, recursive, **kwargs) 1121 1122 for lpath, rpath in zip(lpaths, rpaths): -> 1123 self.put_file(lpath, rpath, **kwargs) 1124 1125 def upload(self, lpath, rpath, recursive=False, **kwargs):

~/miniconda3/envs/pipelines/lib/python3.8/site-packages/fsspec/spec.py in put_file(self, lpath, rpath, **kwargs)
677 while data:
678 data = f1.read(self.blocksize)
--> 679 f2.write(data)
680
681 def put(self, lpath, rpath, recursive=False, **kwargs):

~/miniconda3/envs/pipelines/lib/python3.8/site-packages/adlfs/spec.py in exit(self, *args)
1608
1609 def exit(self, *args):
-> 1610 self.close()
1611
1612 def del(self):

~/miniconda3/envs/pipelines/lib/python3.8/site-packages/adlfs/spec.py in close(self)
1585 else:
1586 if not self.forced:
-> 1587 self.flush(force=True)
1588 if self.fs is not None:
1589 self.fs.invalidate_cache(self.path)

~/miniconda3/envs/pipelines/lib/python3.8/site-packages/adlfs/spec.py in flush(self, force)
1460 self._initiate_upload()
1461
-> 1462 if self._upload_chunk(final=force) is not False:
1463 self.offset += self.buffer.seek(0, 2)
1464 self.buffer = io.BytesIO()

~/miniconda3/envs/pipelines/lib/python3.8/site-packages/adlfs/spec.py in _upload_chunk(self, final, **kwargs)
1422 block_id = len(self._block_list)
1423 block_id = f"{block_id:07d}"
-> 1424 self.blob_client.stage_block(block_id=block_id, data=data, length=length)
1425 self._block_list.append(block_id)
1426

~/miniconda3/envs/pipelines/lib/python3.8/site-packages/azure/core/tracing/decorator.py in wrapper_use_tracer(*args, **kwargs)
81 span_impl_type = settings.tracing_implementation()
82 if span_impl_type is None:
---> 83 return func(*args, **kwargs)
84
85 # Merge span is parameter is set, but only if no explicit parent are passed

~/miniconda3/envs/pipelines/lib/python3.8/site-packages/azure/storage/blob/_blob_client.py in stage_block(self, block_id, data, length, **kwargs)
2015 return self._client.block_blob.stage_block(**options)
2016 except StorageErrorException as error:
-> 2017 process_storage_error(error)
2018
2019 def _stage_block_from_url_options(

~/miniconda3/envs/pipelines/lib/python3.8/site-packages/azure/storage/blob/_shared/response_handlers.py in process_storage_error(storage_error)
145 error.error_code = error_code
146 error.additional_info = additional_data
--> 147 raise error
148
149

HttpResponseError: The value for one of the HTTP headers is not in the correct format.
RequestId:8d688c63-601e-000a-3e91-8db477000000
Time:2020-09-18T07:56:45.3665244Z
ErrorCode:InvalidHeaderValue
Error:None
HeaderName:Content-Length
HeaderValue:0

With/without repartitioning the issue still occurs and even more frequently in v0.5.x. Something like bellow fails with v0.5.x:

s3.get(path_s3, path_local, recursive=True)
abfs.put(path_local, path_abfs, recursive=True)

It is tied with recursive I believe. I use this code to copy folders of parquet tables from s3 to aws. This code was working on v0.3.3 and now gives those empty header/values.

Versions:
adlfs - v0.5.3
fsspec - 0.8.2
azure.storage.blob - 12.5.0'

from adlfs.

hayesgb avatar hayesgb commented on May 27, 2024

I tried to replicate your issue (unsuccessfully) as follows:

fs = AzureBlobFileSystem(account_name=storage.account_name, connection_string=CONN_STR)

# Create a directory for the sample file
fs.mkdir("test")
with open('sample.txt', 'wb') as f:
    f.write(b"test of put method")
fs.put("./sample.txt", "test/sample.txt")
fs.get("test/sample.txt", "sample2.txt")
with open("./sample.txt", 'rb') as f:
    f1 = f.read()
with open("./sample.txt", 'rb') as f:
    f2 = f.read()
assert f1 == f2

This passes with:
adlfs -- master branch
fsspec -- 0.8.2
azure.storage.blob -- 12.5.0
azure-common -- 1.1.24
azure-core -- 1.8.0
azure-identity 1.3.1
azure-storage-common 2.1.0
aiohttp 3.6.2

azure storage account:
Access Tier: standard/hot
Replication: RA-GRS

from adlfs.

mgsnuno avatar mgsnuno commented on May 27, 2024

@hayesgb like I mentioned, sample.txt has to be emtpy file = open('sample.txt', 'wb') #empty file.

Like the error points towards, it's a empty (null content) file.

Your test modified to cause the error (just got the error using the same versions as you, apart from adlfs==0.5.3):

fs = AzureBlobFileSystem(account_name=storage.account_name, connection_string=CONN_STR)

fs.mkdir("test")
_ = open('sample.txt', 'wb')
fs.put("./sample.txt", "test/sample.txt")  # will raise HttpResponseError

The I think the question is: why are empty files being sent by adlfs when writing a dataframe.to_csv with no empty partitions (verified)?

from adlfs.

mgsnuno avatar mgsnuno commented on May 27, 2024

Just caused the issue in another way. Same versions as you, download the latest ubuntu and ubuntu server images:

wget https://releases.ubuntu.com/20.04.1/ubuntu-20.04.1-desktop-amd64.iso
wget https://releases.ubuntu.com/20.04.1/ubuntu-20.04.1-live-server-amd64.iso

And then:

fs = AzureBlobFileSystem(account_name=storage.account_name, connection_string=CONN_STR)
fs.mkdir("test")
file_name = "ubuntu-20.04.1-desktop-amd64.iso"  # 2.6Gb
fs.put(f"{file_name}", f"test/{file_name}")  # raises empty header error at some point
file_name = 'ubuntu-20.04.1-live-server-amd64.iso'  # 914Mb
fs.put(f"{file_name}", f"test/{file_name}")  # no error

Can it be that adlfs is doing some automatic chunking under the hood and we end up with empty chunks at some point?

from adlfs.

hayesgb avatar hayesgb commented on May 27, 2024

#108 Adds a fix for the put operation of an empty file.

from adlfs.

mgsnuno avatar mgsnuno commented on May 27, 2024

Thanks a lot.
Any ideas why putting a single file fails if the file is above a certain size? Like in the case above of ubuntu desktop vs ubuntu live? Is there chunking being made somewhere?
Thank you again

from adlfs.

hayesgb avatar hayesgb commented on May 27, 2024

#115 fixes both. I tested it with the ubuntu image and I downloaded a 5GB zipfile from AWS. This took nearly 90 minutes to write, but it was successful. I'll implement the same on the get_file next.

from adlfs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.