stitchfix / splits Goto Github PK
View Code? Open in Web Editor NEWA Python library for dealing with splittable files
License: MIT License
A Python library for dealing with splittable files
License: MIT License
The decode
step will sometimes raise a UnicodeDecodeError
, I think because it tries to decode num
bytes from the file at a time, which isn't necessarily a valid utf-8 encoded string even if the full contents of the file is a valid utf-8 encoded string.
To reproduce:
This works fine:
>>> from ripley.readers import SFReader
>>> f = SFReader('prod', 'style')
2018-04-06 14:11:31,837 [INFO] wednesday.client:18 - __init__ ServiceClient for 'staunch' in environment 'prod' with base_uri 'http://staunch.vertigo.stitchfix.com'
2018-04-06 14:11:32,289 [INFO] ripley.metadata:41 - Finding metadata for prod.style?_no_partition_=y
2018-04-06 14:11:33,279 [INFO] ripley.metadata:54 - Only hive metadata found (no s3)
>>> s = f.read()
>>>
This fails:
>>> f = SFReader('prod', 'style')
2018-04-06 14:20:02,814 [INFO] wednesday.client:18 - __init__ ServiceClient for 'staunch' in environment 'prod' with base_uri 'http://staunch.vertigo.stitchfix.com'
2018-04-06 14:20:06,387 [INFO] ripley.metadata:41 - Finding metadata for prod.style?_no_partition_=y
2018-04-06 14:20:07,349 [INFO] ripley.metadata:54 - Only hive metadata found (no s3)
>>> s = ''
>>> while True:
... s += f.read(1)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/ahunter/.virtualenvs/aa-py2/lib/python2.7/site-packages/splits/readers.py", line 62, in read
new_data = new_data.decode('utf-8')
File "/Users/ahunter/.virtualenvs/aa-py2/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 0: unexpected end of data
>>>
Python introduced strict certificate checking in version 2.7.9 (see PEP 476) that causes buckets with dots in their names to fail on load e.g.,
> boto.s3.get_bucket('my.bucket')
ssl.CertificateError: hostname 'my.bucket.s3.amazonaws.com' doesn't match either of'*.s3.amazonaws.com', 's3.amazonaws.com'
More at boto/boto/issues/2836.
The addition of .decode
to the splits reader broke several downstream use cases (I believe all through ripley).
See:
https://github.com/stitchfix/ripley/issues/83
Loads to redis that use unicodecsv + SFReader
https://aa-jenkins.vertigo.stitchfix.com/job/orbiter-etl--0.2.22--load/296/console
https://github.com/stitchfix/orbiter/blob/master/orbiter/lib/redis_cache_loader.py
Adding the issue here as well in case there are other non-ripley issues.
Exceptions thrown by the authentication layer are occasionally swallowed by the read function of SplitReader:
def read(self, num=None):
val = ''
try:
while True:
if num > 0:
new_data = self._get_current_file().read(num - len(val))
else:
new_data = self._get_current_file().read()
if not new_data:
self._current_file.close()
else:
val += new_data
if num > 0 and len(val) == num:
break
except:
pass
return val
The naked exception handler is apparently there to ignore a StopIteration exception.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.