dask / hdfs3 Goto Github PK

View Code? Open in Web Editor NEW

136.0 12.0 40.0 382 KB

A wrapper for libhdfs3 to interact with HDFS from Python

Home Page: http://hdfs3.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Shell 0.60% Python 98.49% Dockerfile 0.91%

hdfs3's Introduction

hdfs3

This project is not undergoing development

Pyarrow's JNI hdfs interface is mature and stable. It also has fewer problems with configuration and various security settings, and does not require the complex build process of libhdfs3. Therefore, all users who have trouble with hdfs3 are recommended to try pyarrow.

Old README

hdfs3 is a lightweight Python wrapper for libhdfs3, a native C/C++ library to interact with the Hadoop File System (HDFS).

View the documentation for hdfs3.

hdfs3's People

Contributors

Stargazers

Watchers

hdfs3's Issues

index.rst dead link for libhdfs3

The link for libhdfs3 in the index.rst file is dead. I believe this should be updated to https://github.com/Pivotal-Data-Attic/pivotalrd-libhdfs3.

Seek fails when going off end of file

def test_seek(hdfs):
    with hdfs.open(a, 'w', repl=1) as f:
        f.write(b'123')

    with hdfs.open(a) as f:
        f.seek(1000)
        assert not f.read(1)
        f.seek(0)
        assert f.read(1) == b'1'

Read fs.default.name settings from core-site.xml

I configured HDFS Client on Ubuntu 16.04 and I can successfully run this command:

hdfs --config /etc/hadoop/conf/ dfs -ls /

The config parameter takes the file core-site.xml from etc/hadoop/conf/

From core-site.xml

<property>
    <name>fs.default.name</name>
    <value>igfs://[email protected]:10500</value>
  </property>

Is there a way to configure the Python HDFS library using the above section from core-site.xml?

>>> from hdfs3 import HDFileSystem
>>> hdfs = HDFileSystem(host='10.200.10.1', port=10500)

When I run this code, I get the following error message:
ConnectionError: Connection Failed: HdfsRpcException: Failed to invoke RPC call "getFsStats" on server "10.200.10.1:10500"

In HA-mode, HDFileSystem() doesn't initialize when port is not given or None is given.

Hi,

First, thank you for a great library!
It is very helpful utility to use HDFS-related functionalities in Python.

While using hdfs3 in HA-mode Hadoop, I am failing to use HA configuration with following commands.

hdfs = HDFileSystem()
# or hdfs = HDFileSystem(host="mycluster", port=None)

It fails with the following error message.

/home/test/.conda/envs/py3/lib/python3.6/site-packages/hdfs3/core.py:127: UserWarning: Setting conf parameter port failed
  warnings.warn('Setting conf parameter %s failed' % par)

While digging the cause in the source file, I found that the following codes are suspicious.

    def connect(self):
        if conf['port'] is not None:     # <---- Don't set port if it is None
            _lib.hdfsBuilderSetNameNodePort(o, conf.pop('port'))
        _lib.hdfsBuilderSetNameNode(o, ensure_bytes(conf.pop('host')))
        ...
        for par, val in conf.items():    # <---- But, this loop uses ALL items in conf including port.
            if not _lib.hdfsBuilderConfSetStr(
                    o, ensure_bytes(par), ensure_bytes(val)) == 0:
                warnings.warn('Setting conf parameter %s failed' % par)

When I added if-else statement that just skips allocation if par=="port", HDFileSystem can be properly initialized.
It seems that port/user/etc. properties that are set before this loop shouldn't be used inside this loop as in the following code:

https://github.com/priancho/hdfs3/blob/bugfix_SkipAlreadyEvaluatedConnectionProperties/hdfs3/core.py#L125

        for par, val in conf.items():
            if par in ['port', 'user', 'ticket_cache', 'token']:
                continue

            if not _lib.hdfsBuilderConfSetStr(o, ensure_bytes(par),
                                              ensure_bytes(val)) == 0:
                warnings.warn('Setting conf parameter %s failed' % par)

I would like to hear any comments on this since I wonder if this is a valid fix for the problem.

Best wishes,
Han-Cheol

Cannot load a pickle file containing an array

Tested with python 3.5

Here is a small snippet reproducing the problem:

import pickle
import array
from hdfs3 import HDFileSystem

hdfs = HDFileSystem(host='localhost', port=8020)

a = array.array('d', [1, 2, 3, 4])

# Dump works and the pickle is valid:
# (when retrieved locally using hadoop CLI, pickle can load it)
with hdfs.open("/user/aabadie/test.pkl", "wb") as f:
    pickle.dump(a, f)

# But loading via hdfs file object fails:
with hdfs.open("/user/aabadie/test.pkl", "rb") as f:
    print(pickle.load(f))

Here is the error:

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
<ipython-input-17-d986117cc344> in <module>()
      1 with hdfs.open("/user/aabadie/test.pkl", "rb") as f:
----> 2     print(pickle.load(f))
      3 
      4 
      5 

EOFError: Ran out of input

I get the same result if I try to pickle a numpy array insteand of a Python array.

Include endlines in readline

HDFS3 differs from typical file handling in that it does not include endlines. This stops convenient use with wrappers like io.TextIOWrapper

In [1]: with open('foo', 'wb') as f:
    f.write('Alice 100\nBob 200\nCharlie 300'.encode())
   ...:     

In [2]: f = open('foo', 'rb')

In [3]: f.readline()
Out[3]: 'Alice 100\n'

In [4]: f.readline()
Out[4]: 'Bob 200\n'

In [5]: f.readline()
Out[5]: 'Charlie 300'

In [6]: import hdfs3

In [8]: hdfs = hdfs3.HDFileSystem()

In [9]: with hdfs.open('/tmp/test/text.1.txt', 'wb') as f:
    f.write('Alice 100\nBob 200\nCharlie 300'.encode())
   ...:     

In [10]: f = hdfs.open('/tmp/test/text.1.txt', 'rb')

In [11]: f.readline()
Out[11]: 'Alice 100'

In [12]: f.readline()
Out[12]: 'Bob 200'

In [13]: f.readline()
Out[13]: 'Charlie 300'

Large writes

~GB sized writes fail every time, presumably because of a server timeout. Calls to write() should be of the order of the blocksize.

Problems reading from Pandas: invalid path or buffer object type

I'm trying to read a csv to a pandas dataframe using HDFS3 and I'm having the following error:

In [5]: with client.open('...', 'r') as fh:
   ...:     x = pd.read_csv(filepath_or_buffer=fh)
   ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-1aae9741b43b> in <module>()
      1 with client.open('...', 'rb') as fh:
----> 2     x = pd.read_csv(filepath_or_buffer=fh)
      3

/disk1/home/polisds/.anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654
--> 655         return _read(filepath_or_buffer, kwds)
    656
    657     parser_f.__name__ = name

/disk1/home/polisds/.anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    390     compression = _infer_compression(filepath_or_buffer, compression)
    391     filepath_or_buffer, _, compression = get_filepath_or_buffer(
--> 392         filepath_or_buffer, encoding, compression)
    393     kwds['compression'] = compression
    394

/disk1/home/polisds/.anaconda3/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression)
    208     if not is_file_like(filepath_or_buffer):
    209         msg = "Invalid file path or buffer object type: {_type}"
--> 210         raise ValueError(msg.format(_type=type(filepath_or_buffer)))
    211
    212     return filepath_or_buffer, None, compression

ValueError: Invalid file path or buffer object type: <class 'hdfs3.core.HDFile'>

If I simply iterate over the file's lines and print, everything works:

hdfs3 v0.1.4
pandas v0.20.1

Heuristics to find Namenode address

As discussed in #12 (comment)

Usage with multiprocessing.Pool

I want to read in parallel from hdfs using following dummy code that is actually reading but doing nothing so just for demonstration purposes:

import sys
import time
import random
import logging
from multiprocessing import Pool

from hdfs3 import HDFileSystem
from hdfs3.compatibility import ConnectionError

_logger = logging.getLogger(__name__)


def get_hdfs_client(host, port):
    """Creates a HDFS client

    Args:
        host (str): host name of the HDFS servers
        port (int): port of the HDFS servers

    Returns:
        hdfs client object from the hdfs3 library
    """
    try:
        pars = {"hadoop.security.authentication": "kerberos"}
        hdfs_client = HDFileSystem(host=host, port=port, pars=pars)
    except ConnectionError:
        _logger.exception("HDFSConnectionError")
        _logger.error("Have you authenticated with `kinit` yet?")
        sys.exit(1)
    return hdfs_client


class MultiReader(object):
    def __init__(self, host, port, process_count):
        self.pool = Pool(processes=process_count, initializer=self._setup,
                         initargs=(host, port, process_count))

    def read_profiles(self, file_paths):
        self.pool.map(_multi_read, file_paths)

    def close_pool(self):
        self.pool.close()
        self.pool.join()

    @classmethod
    def _setup(cls, host, port, process_count):
        cls.client = get_hdfs_client(host, port)

    @classmethod
    def _read_concurrent(cls, file_path):
        with cls.client.open(file_path) as fh:
            for row in fh:
                pass


def _multi_read(file_path):
    MultiReader._read_concurrent(file_path)


if __name__ == '__main__':
    host = 'my_host'
    port = 8020
    file_path = '/user/my_user/tmp'
    
    hdfs = get_hdfs_client(host, port)
    file_paths = [f for f in hdfs.walk(file_path) if hdfs.info(f)['kind'] == 'file']
    hdfs.disconnect()
    del hdfs
    time.sleep(5)
    
    # Using this line and commenting the block above causes no error messages
    # file_paths = ['{}/{:0>6}_0'.format(file_path, i) for i in range(250)]

    multi_read = MultiReader(host, port, 10)
    multi_read.read_profiles(file_paths)

So when the main process somewhere has a hdfs client object generated I am seeing a lot of following errors but still it runs (is it okay to ignore those errors?):

2016-11-16 12:52:50.671476, p9902, th139871594358528, ERROR Failed to invoke RPC call "getFsStats" on server "XXXXXXXXXXXXXX:8020": 
RpcChannel.cpp: 393: HdfsRpcException: Failed to invoke RPC call "getFsStats" on server "XXXXXXXXXXXX:8020"
        @       Hdfs::Internal::RpcChannelImpl::invokeInternal(boost::shared_ptr<Hdfs::Internal::RpcRemoteCall>)
        @       Hdfs::Internal::RpcChannelImpl::invoke(Hdfs::Internal::RpcCall const&)
        @       Hdfs::Internal::NamenodeImpl::invoke(Hdfs::Internal::RpcCall const&)
        @       Hdfs::Internal::NamenodeImpl::getFsStats()
        @       Hdfs::Internal::NamenodeProxy::getFsStats()
        @       Hdfs::Internal::FileSystemImpl::getFsStats()
        @       Hdfs::Internal::FileSystemImpl::connect()
        @       Hdfs::FileSystem::connect(char const*, char const*, char const*)
        @       hdfsBuilderConnect
        @       ffi_call_unix64
        @       ffi_call
        @       _ctypes_callproc
        @       PyCFuncPtr_call
        @       PyObject_Call
        @       PyEval_EvalFrameEx
        @       PyEval_EvalFrameEx
        @       _PyEval_EvalCodeWithName
        @       PyEval_EvalCodeEx
        @       function_call
        @       PyObject_Call
        @       method_call
        @       PyObject_Call
        @       slot_tp_init
        @       type_call
        @       PyObject_Call
        @       PyEval_EvalFrameEx
        @       PyEval_EvalFrameEx
        @       _PyEval_EvalCodeWithName
        @       PyEval_EvalCodeEx
        @       function_call
        @       PyObject_Call
....

I know that I should not have an hdfs client object when forking but that is why I am disconnecting and even deleting the object. Is this a bug or an intended behavior? Is there no way I can use hdfs client objects in the master process as well as in the child processes? Is the only workaround now that I fork a special process in order to execute the walk command?

Remove command line interface from docs

We don't do this well and aren't putting resources behind this. I would like to remove this from the docs so as to lower expectations and reduce scope. Any objections?

Publish wheel to PyPI that includes binary dependencies

Following up on #64, I'm wondering if there is a way to update our packaging so that users can simply pip install hdfs3 without needing to independently install libhdfs3 using their system package manager. Currently, the pip installation docs include a separate apt-get step, which can definitely be pain depending on your OS and distribution.

There is some information on the Python Packaging Guide about publishing binary extensions, which I think is what we need here, but the docs are incomplete. I also know of a project that bundles a C extension in its package on PyPI, but I don't know if that example is relevant to what we would need to do here.

Kerberos Support

Hi Team,

Does hdfs3 support kerberos? I tried to follow this instruction HDFileSystem(host=None, port=None, user=None, ticket_cache=None, token=None, pars=None, connect=True) to connect to kerberized hdfs name node, but it's not working.

Can you please give me some example or reference how to use hdfs3 to connect kerberized cluster?

Appreciate your support!

Thanks!

Test hdfs3 with Kerberos authentication

Related: bloomberg/chef-bach#364

Please do a release to pypi

I would like to do some work on integrating it with Airflow, but a pypi package is required to make sure people can pull in the dependencies.

Benchmark HDFS solution

hdfs3-0.2.0 undefined symbol: hdfsConcat

Since update to hdfs3-0.2.0 connections to HDFS are failing:

$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from hdfs3 import HDFileSystem
>>> hdfs = HDFileSystem(host='localhost', port=8020)

File "/usr/local/lib/python2.7/dist-packages/hdfs3/core.py", line 75, in __init__
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/hdfs3/core.py", line 94, in connect
    get_lib()
  File "/usr/local/lib/python2.7/dist-packages/hdfs3/core.py", line 604, in get_lib
    from .lib import _lib as l
  File "/usr/local/lib/python2.7/dist-packages/hdfs3/lib.py", line 408, in <module>
    hdfsConcat = _lib.hdfsConcat
  File "/usr/lib/python2.7/ctypes/__init__.py", line 378, in __getattr__
    func = self.__getitem__(name)
  File "/usr/lib/python2.7/ctypes/__init__.py", line 383, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib/libhdfs3.so: undefined symbol: hdfsConcat

System is Ubuntu 14.04 with libhdfs3 version 2.2.31-1 from Pivotal (https://github.com/Pivotal-Data-Attic/pivotalrd-libhdfs3/releases)

HDFS High Availability

Does the module support HDFS High Availability? I have a system that uses primary and secondary namenodes, and when there is a failover it changes which one is the primary namenode. Looking at the code, it appears to only support one connection or IP address.

More usefule conf_to_dict or avoid meaningless conversions

Right now in the function hdfs3.core.conf_to_dict there are a lot of conversions going on from string to int, float, bool. Later when the returned dictionary is used as pars keyword parameter to the HDFSFileSystem constructor the function ensure_bytes is used all over to convert int, float, bool and whatever back to string oder bytes which is not working for Python 3 for instance. Python 3 has for instance to_bytes() instead of tobytes() and many other things changed.
The actual question is why this conversion is done in the first place??? Couldn't you just remove the conversion in conf_to_dict and therefore save the back conversions again in HDFSFileSystem.connect???
Okay, that would mean that if you create the dictionary passed to pars yourself you have to write 'true' instead of True but it saves you later a lot of headaches when figuring out what the right byte conversion of True is.

Distributed copy command (distcp)

It would be handy to have a distcp-like command to copy files into HDFS in parallel.

`boost` version incompatibility

After installing hdfs3 in a clean conda environment with these commands

conda create --name hdfs3test python=3
source activate hdfs3
conda install -c dask hdfs3

I get the following error after trying import hdfs3 in the Python interpreter:

Traceback (most recent call last):                                                                                                                                                   
File "<stdin>", line 1, in <module>                                                                                                                                                
File "/work/analytics2/analytics/python/envs/hdfs3test/lib/python3.5/site-packages/hdfs3/__init__.py", line 1, in <module>                                                         
  from .core import HDFileSystem, HDFile                                                                                                                                           
File "/work/analytics2/analytics/python/envs/hdfs3test/lib/python3.5/site-packages/hdfs3/core.py", line 12, in <module>                                                            
  from .lib import _lib                                                                                                                                                            
File "/work/analytics2/analytics/python/envs/hdfs3test/lib/python3.5/site-packages/hdfs3/lib.py", line 11, in <module>                                                             
  _lib = ct.cdll.LoadLibrary('libhdfs3.so')                                                                                                                                        
File "/work/analytics2/analytics/python/envs/hdfs3test/lib/python3.5/ctypes/__init__.py", line 425, in LoadLibrary                                                                 
  return self._dlltype(name)                                                                                                                                                       
File "/work/analytics2/analytics/python/envs/hdfs3test/lib/python3.5/ctypes/__init__.py", line 347, in __init__                                                                    
  self._handle = _dlopen(self._name, mode)                                                                                                                                         
OSError: libboost_thread.so.1.57.0: cannot open shared object file: No such file or directory

Downgrading the installed version of boost with conda install boost=1.57 seems to fix the issue.

token user - provide default?

hdfs3 can generate delegation tokens. For actual kerberos use, this requires to provide a user-name, the only one that can thereafter renew the token. Is it reasonable to default to os.environ['USER'] if nothing is supplied?

readline may confuse use of seek/tell

Readline, or iterating over an open file, uses a buffer of some chunk size, which is then split into lines. That means that after a single readline, tell(), the true position in the file on disc, is at the end of the chunk, but we've only seen the first line so far. Similarly, seek() will move the real file pointer, but we'll get the next stored line on readline anyway.
For now, we have a warning in the docstring on the use of readline; we could fix up seek and tell instead.

Can not find the shared library:libhdfs3.so

Hello, My environment is Ubuntu 16 Desktop version, I have experienced this problem, when i install hdfs3 with Dask distributed
Error Can not find the shared library:libhdfs3.so
When I try to run Dask retrieve data from HDFS. Also noticed a lot of people have this issues.
What I was done?

conda install hdfs3 -c conda-forge
conda install distributed hdfs3 -c conda-forge
conda install libprotobuf

Add HDFSMap to API docs

I'd like to link from the Zarr docs to docs for the HDFSMap class as an example of a compatible storage layer, currently this is missing from the API docs, would be nice to add it.

Testing on travis.ci

@danielfrg has a nice docker container for HDFS. Perhaps we can use this to provide continuous integration with a service like travis.ci.

Release version 1.3 on conda

Using pip installs version 1.3 of hdfs3 but conda still install 1.2. Could you also push version 1.3 to the conda repos? Thanks.

Support for Google Cloud Storage connector

Thanks for all the work on Dask!

Many Hadoop clusters running on Google's cloud use the GCS connector since it turns GCS into a persistent and performant drop-in HDFS replacement. Unfortunately I can't figure out how to make hdfs3 recognize gs://... URIs as valid.

Dask on HDFS is so great, but it kills (prevents?) the whole workflow to have to distcp from gs:// to hdfs:// just to use it. Is there any way to support gs:// URIs natively?

Better handling of so_directory in lib.py

Current hard-coded to /opt/conda/lib

Fill out tests

Should we add coverage as well? This is the sort of library that can and should achieve. 100% coverage. I've tested for 100% test coverage in other projects (toolz) and have been happy with the result.

read_block should include terminator

We chop off the end-of-block delimiter when doing read_block, but should not.

Windows Support -- help needed

Hi all,

what is needed to make this support Windows? I am in a need of a python HDFS client so I would be willing to develop that if it is reasonable.

Incorrect Glob pattern behavior

The glob() function, at core.py:321 doesn't define the correct behavior for the ? wildcard: it replaces it in the regex with ., which will match /, which glob pattern wildcards should never match.
pattern = re.compile("^" + path.replace('//', '/') .rstrip('/') .replace('*', '[^/]*') .replace('?', '.') + "$")
I believe changing that line to .replace('?', '[^/]') would fix this issue.

Update hdfs3 and libhdfs3 packages on Anaconda.org

https://anaconda.org/search?q=hdfs3

As you can see, dask channel contains outdated packages (1 minor version behind, yet it shouldn't be outdated at all!) and anaconda channel has hdfs3 package only for Python 2.7.

MapR compatibility

I understand that libhdfs3 isn't really relevant to MapR since they have their own C API, but it would nice for this library to support MapR. It seems like the MapR API is different enough from libhdfs3 to make this a quick and simple project. Still, it seems a shame to to start a different project instead of adding MapR support to this one.

Any high level thoughts or insights on this?

Fill out docstrings

Update libhdfs3 conda recipe with new repo

The libhdfs3 source has moved to the Apache HAWQ repo. Need to update the conda recipe accordingly to point to: https://github.com/apache/incubator-hawq/tree/master/depends/libhdfs3

Setup HDFS config for local short-circuit reads

Add documentation for HA mode

(see #90)

Add HA detection during configuration reading

While #118 works around getting the defaults to set correctly so the wrapper will run successfully, the larger issue is that None values need to be valid for HA setups long term. While the configuration can be explicitly stated, detection of HA for Hadoop should be added at some point from configuration.

Will put some time into this in the near future to build this out.

Test delegate tokens

We don't currently allow access to libhdfs3's hdfsGetDelegationToken although we do allow passing a token to the client initialisation.

Provide a function to recursively put a directory with subdirectories to HDFS

Right now it is only possible to recursively delete a directory. It would be nice if hdfs3 came with a function out of the box that allows recursively pushing a directory to HDFS.

A naive implementation would be something like:

class cd(object):
    """Context manager for changing the current working directory"""
    def __init__(self, new_path):
        self.new_path = os.path.expanduser(new_path)

    def __enter__(self):
        self.old_path = os.getcwd()
        os.chdir(self.new_path)

    def __exit__(self, etype, value, traceback):
        os.chdir(self.old_path)

def put_dir(hdfs_client, origin_path, destination_path):
    """Recursively push a directory to HDFS

    Args:
        hdfs_client: hdfs client
        origin_path: origin path
        destination_path: destination path on HDFS
    """
    hdfs_client.mkdir(destination_path)
    with cd(origin_path):
        for root, dirs, files in os.walk('./'):
            dest_path = os.path.join(destination_path, root)
            for dir in dirs:
                hdfs_client.mkdir(os.path.join(dest_path, dir))
            for file in files:
                file_path = os.path.join(root, file)
                hdfs_client.put(file_path, os.path.join(dest_path, file))

But this is pretty slow in practice.

Build HDFS location -> distributed byte Futures function

dfs.client.use.datanode.hostname have no effect

I was trying to use hdfs3 in a setup similar to described in this post.

I am failing to upload a file due to network connectivity issue which is due to hdfs client trying to use datanodes ip addresses instead of hostnames.
I've tried setting "dfs.client.use.datanode.hostname" to "true" both in conf = {} as well as in hdfs-site.xml. I mean I tried passing all the configuration options using those different methods. No effect.

I understand this may be issue with the underlying lib. Don't hesitate to reject if that's the case.

Build libhdfs3 conda package

Todo items continuing work from #24:

Cannot load a pickle file with python 2.7

Here a small snippet to reproduce the issue:

import pickle
from hdfs3 import HDFileSystem                      

hdfs = HDFileSystem(host='localhost', port=8020)

# Dump works
with hdfs.open("/user/aabadie/test.pkl", "wb") as f:
    pickle.dump('test', f)

# Load fails (see stacktrace below)
with hdfs.open("/user/aabadie/test.pkl", "rb") as f:
    print(pickle.load(f))

Error:

---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
<ipython-input-5-97136b6bb693> in <module>()
      1 with hdfs.open("/user/aabadie/test.pkl", "rb") as f:
----> 2     print(pickle.load(f))
      3 

/home/aabadie/conda3/envs/py27/lib/python2.7/pickle.pyc in load(file)
   1382 
   1383 def load(file):
-> 1384     return Unpickler(file).load()
   1385 
   1386 def loads(str):

/home/aabadie/conda3/envs/py27/lib/python2.7/pickle.pyc in load(self)
    862             while 1:
    863                 key = read(1)
--> 864                 dispatch[key](self)
    865         except _Stop, stopinst:
    866             return stopinst.value

/home/aabadie/conda3/envs/py27/lib/python2.7/pickle.pyc in load_eof(self)
    884 
    885     def load_eof(self):
--> 886         raise EOFError
    887     dispatch[''] = load_eof
    888

If I download the pickle file with hadoop, pickle is able to load it, meaning that the pickle dump worked.

$ hadoop fs -get hdfs://localhost/user/aabadie/test.pkl

The same code works in python 3.5. Maybe python 2.7 is not supported, in this case feel free to close.

Small block sizes

I'm not sure if this is possible, but it'd be great if we could tune the block size of files down to very very small. This would make certain testing activities much easier. There is currently a block_size kwrag in open/File but any use of it results in an error. I'm not sure why.

Test Python 3

The docker container currently only tests Python 2. It would be nice to extend this.

get_block_locations should work on directories

Current behavior of get_block_locations:

In [23]: hdfs.get_block_locations('/tmp/test/file')
Out[23]: 
[{'hosts': ['8d2452d1af62'], 'length': 67108864, 'offset': 0},
 {'hosts': ['8d2452d1af62'], 'length': 32891136, 'offset': 67108864}]

However, if we want to read many files as in either of the following cases

In [23]: hdfs.get_block_locations('/tmp/test/file-*.csv')  # glob string
In [23]: hdfs.get_block_locations('/tmp/test/2015/')  # directory

Then we want to get both the existing information as well as the explicit filename

Out[23]: 
[{'filename': '/tmp/test/file-1.csv', 'hosts': ['8d2452d1af62'], 'length': 67108864, 'offset': 0},
 {'filename': '/tmp/test/file-1.csv', 'hosts': ['8d2452d1af62'], 'length': 32891136, 'offset': 67108864},
 {'filename': '/tmp/test/file-2.csv', 'hosts': ['8d2452d1af62'], 'length': 67108864, 'offset': 0},
 {'filename': '/tmp/test/file-2.csv', 'hosts': ['8d2452d1af62'], 'length': 32891136, 'offset': 67108864}]

Conda packages not working on Ubuntu 12.04

After installing hdfs3 via conda, an libstdc++ error occurs upon import.

Issues:

Needs conda install libgcc on Ubuntu 12.04
ELF errors on CentOS 5, need to compile on older OS (CentOS 5.8 instead of Ubuntu 14.04)
Move some build/run dependencies to meta.yaml for libhdfs3