cthoyt / pystow Goto Github PK

View Code? Open in Web Editor NEW

33.0 4.0 5.0 257 KB

👜 Easily pick a place to store data for your Python code.

Home Page: https://pystow.readthedocs.io

License: MIT License

Python 100.00%

file-management file-manager filesystem pathlib python reproducibility reproducible-research reproducible-science

pystow's Issues

Return types for some ensure functions

For functions like ensure_open_zip, the return value is documented as

:yields: An open file object

so I thought this wold be something I can call e.g., read() on. Checking the type of the returned value, it's contextlib._GeneratorContextManager and it took me a little while to figure out that this means I have to use with to interact with this function. Is the documentation and type annotation for these functions correct?

Package on Anaconda via conda-forge?

https://conda-forge.org/docs/maintainer/adding_pkgs.html

sparks-baird/matbench-genmetrics#47 (comment)

syncing an upstream gzip file with an expanded local version

pystow has methods for syncing with a gzipped file from a URL and dynamically opening it

but if my upstream file is a gzipped sqlite (e.g. https://s3.amazonaws.com/bbop-sqlite/hp.db.gz), then I need it to be uncompressed in my ~/.data folder, before I make a connection to it (the same may hold for things like OWL)

I can obviously do this trivially, but this would require introspecting paths and would seem to defeat the point of having an abstraction layer.

For now I am putting duplicative .db and .db.gz files on s3, and only using the former with pystow, but I would like to migrate away from distributing the uncompressed versions

What I am imagining is:

url = 'https://s3.amazonaws.com/bbop-sqlite/hp.db.gz'
path = pystow.ensure('oaklib', 'sqlite', url=url, decompress=True)
conn = connect("file:///{path}")

Does that make sense?

As an aside, it may also be useful to have specific ensure methods for sqlite and/or sqlalchemy the same way you have for pandas.

Missing type information

I just noticed, that the type information of pystow is not shipped, because no py.typed file is present. See this mypy output I had in another project.

...
sylloge/base.py:28: error: Skipping analyzing "pystow": module is installed, but missing library stubs or py.typed marker  [import-untyped]
sylloge/base.py:30: error: Skipping analyzing "pystow.utils": module is installed, but missing library stubs or py.typed marker  [import-untyped]
....

This can be easily fixed, by adding a py.typed file inside src/pystow.

Raise error if directory passed as name

3cd6d59#r50109723

Alignment with platform-specific directories

See https://pypi.org/project/appdirs/

Potential naming conflicts with default pystow home directory name.

Pystow does not check if there is an existing .data directory on the users system and happily commandeers this folder even if it already exists. Since this is a very common and generic name, it is not unlikely that a user may already have a .data directory in their home folder. It is also not unlikely that a user will end up in a situation where they or some software they have installed other than pystow will try to place a .data directory in their home folder. I suggest to change to a less generic name such as
".pystow_data" to avoid potential naming conflicts. I think just having the possibility of changing the default folder name with an environment variable is insufficient because the direct users of pystow are python package developers not python package users. We should seek to minimize any cognitive burden or sources of surprise for end users of python packages that use pystow.

[WINDOWS] TypeError when using `ensure_csv`: "argument of type 'WindowsPath' is not iterable"

import pystow
from pystow import ensure_csv
MBGM_HOME = pystow.join("matbench-genmetrics")
ensure_csv(MBGM_HOME, url="https://figshare.com/ndownloader/files/36581838")

..\..\..\..\Miniconda3\envs\matbench-genmetrics\lib\site-packages\pystow\api.py:610: in ensure_csv
    _module = Module.from_key(key, ensure_exists=True)
..\..\..\..\Miniconda3\envs\matbench-genmetrics\lib\site-packages\pystow\impl.py:83: in from_key
    base = get_base(key, ensure_exists=False)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

key = WindowsPath('C:/Users/sterg/.data/matbench-genmetrics')
ensure_exists = False

    def get_base(key: str, ensure_exists: bool = True) -> Path:
        """Get the base directory for a module.
    
        :param key:
            The name of the module. No funny characters. The envvar
            <key>_HOME where key is uppercased is checked first before using
            the default home directory.
        :param ensure_exists:
            Should all directories be created automatically?
            Defaults to true.
        :returns:
            The path to the given
    
        :raises ValueError: if the key is invalid (e.g., has a dot in it)
        """
>       if "." in key:
E       TypeError: argument of type 'WindowsPath' is not iterable

loading comma-separated values format defaults to tab separators

After reading through https://en.wikipedia.org/wiki/Comma-separated_values, I think I can understand the decision behind making tab separators the default as "the only safe option," though it does seem confusing to me that ensure_csv assumes sep="\t". Maybe worth mentioning the default pd.to_csv() uses commas (no tabs).

pystow/src/pystow/impl.py

Lines 1414 to 1417 in 2ce9690

 def _clean_csv_kwargs(read_csv_kwargs): 

 read_csv_kwargs = {} if read_csv_kwargs is None else dict(read_csv_kwargs) 

 read_csv_kwargs.setdefault("sep", "\t") 

 return read_csv_kwargs

Confusing folder configuration

In the README, the instructions say

Data gets stored in ~/.data by default. If you want to change the name of the directory,
set the environment variable PYSTOW_NAME. If you want to change the default parent directory
to be other than the home directory, set PYSTOW_HOME.

I interpreted this to say that PYSTOW_HOME can be configured to /path to make it create its .data folder in /path/.data. However, this made it put the individual project folders into /path/project1, /path/project2, etc. I think it would make sense to either interpret PYSTOW_HOME as the parent of .data and PYSTOW_NAME as the name for the .data folder, or change the instructions above to describe the current behavior (i.e., instead of "default parent directory to be other than the home directory" say "default pystow data directory to be other than ~/.data")

Issue with directories containing dots

These are considered as files by pathlib. Make name of file more explicit and keyword only.

Get configuration from file corresponding to leading part

If I want to use zenodo:sandbox as a key, it should figure to look in zenodo.ini and other zenodo.* files if they exist

Best practices for clearing stale files

My workflow involves periodic

ls -alt ~/.data/oaklib

followed by looking at timestamps, using tacit knowledge about update frequencies of different ontologies, and selectively removing older files.

But if pystow is used in toolchains used by less technical users this could be confusing. What are the long term plans here? Should application developers write bespoke cache management solutions? This is not a bad idea as they can take advantage of specific conventions (e.g. I am caching sqlites of ontology files and I know versionIRI, when present, should uniquely identify the version). But it may be useful to have some kind of general purpose cache management helpers in the core, together with some kind of autoflush-after-N-days type options?

Ensure from figshare

This repo does some analysis, but wasn't able to post their data on github (https://github.com/ky66/ROBIN#data). The data is on figshare at https://figshare.com/ndownloader/files/36477873, seemingly this could be a simple wrapper around ensure() that just formats in the file number

Python 3.10 support

Joining to existing pystow path

Assume I start with

workdir = pystow.join('part1', 'part2', ..., 'partn')

This guarantees that workdir will be created if it doesn't yet exist. However, if I now want to do make a subfolder inside workdir, I either have to do

work_subdir = workdir.joinpath('part_nplusone')

which doesn't create the folder for me and so I have to do additional bookkeeping to make that happen or do

workdir = pystow.join('part1', 'part2', ..., 'partn', 'part_nplusone')

which is redundant.

What is the recommended approach here?

Bug in ensure_open_zip when using API

Hello, when using the API function ensure_open_zip, there is an error thrown, e.g. the following:

import pystow
url = "https://cloud.enterprise.informatik.uni-leipzig.de/index.php/s/LHPbMCre7SLqajB/download/MultiKE_D_Y_15K_V1.zip"
inner_path = "MultiKE/D_Y_15K_V1/721_5fold/1/20210219183115/kg1_ent_ids"
with pystow.ensure_open_zip("kiez", url=url, inner_path=inner_path) as file:
    for line in file:
        print(line)
        break

results in TypeError: '_GeneratorContextManager' object is not iterable.
Using the module does work.
I figured out how this fix by changing the respective api code to the following:

@contextmanager
def ensure_open_zip(
    key: str,
    *subkeys: str,
    url: str,
    inner_path: str,
    name: Optional[str] = None,
    force: bool = False,
    download_kwargs: Optional[Mapping[str, Any]] = None,
    mode: str = "r",
    open_kwargs: Optional[Mapping[str, Any]] = None,
):
    """Ensure a file is downloaded then open it with :mod:`zipfile`."""
    _module = Module.from_key(key, ensure_exists=True)
    with _module.ensure_open_zip(
        *subkeys,
        url=url,
        inner_path=inner_path,
        name=name,
        force=force,
        download_kwargs=download_kwargs,
        mode=mode,
        open_kwargs=open_kwargs,
    ) as inner_ensure_open_zip:
        yield [inner_ensure_open_zip]
``

Autogenerate README in .data folder

Make pandas optional

Need to create a config file before it can be written to using write_config

Related: cthoyt/zenodo-client#6

pystow/src/pystow/config_api.py

Lines 151 to 164 in ecbe7ea

 def write_config(module: str, key: str, value: str) -> None: 

 """Write a configuration value. 

  :param module: The name of the app (e.g., ``indra``) 

  :param key: The key of the configuration in the app 

  :param value: The value of the configuration in the app 

  """ 

 _get_cfp.cache_clear() 

 cfp = ConfigParser() 

 path = get_home() / f"{module}.ini" 

 cfp.read(path) 

 cfp.set(module, key, value) 

 with path.open("w") as file: 

 cfp.write(file)

I needed to call cfp.add_section(module) before I could use cfp.set(module, key, value) given the file doesn't already exist.

Feature: Add progress bar during download

Optionally behind a verbose parameter, but it'd be nice in some cases for the user to be aware that downloads are happening in the background.

Ensure zip file?

Hello,
I love the project. However I am missing something like an ensure_zip_file function. While ensuring that the actual .zip is there can be done with ensure, it would be nice to have a functionality, where I ensure a file from this zip is there, and load this file as fileobject, to then read it in however it is needed.
Poking around in the code I found Module.ensure_open_zip. I think this contextmanager can be easily wrapped for the API, to get what I need like this:

def ensure_open_zip(
    key: str,
    *subkeys: str,
    url: str,
    inner_path: str,
    name: Optional[str] = None,
    download_kwargs: Optional[Mapping[str, Any]] = None,
):
    _module = Module.from_key(key, ensure_exists=True)
    return _module.ensure_open_zip(*subkeys, url=url, inner_path=inner_path, name=name)

If it's fine with you I could make a PR and add it (with tests and doc of course)?

Variants of ensure functions focused on loading

Currently, pystow provides a number of ensure_* functions that target different file types, downloads them if not available, and loads them. For instance, ensure_csv takes a path to a CSV file and loads it with pandas. However, all of these functions also require a url argument from which the file is first downloaded if it doesn't already exist. I think it would be useful to have variants of these functions where the same functionality of loading canonical file types is provided but without the url/download part, under the assumption that the given file is already there and never necessarily came from a URL in the first place. I would definitely use pickle loading or JSON loading for instance, knowing that a given file is already there.

	def _clean_csv_kwargs(read_csv_kwargs):
	read_csv_kwargs = {} if read_csv_kwargs is None else dict(read_csv_kwargs)
	read_csv_kwargs.setdefault("sep", "\t")
	return read_csv_kwargs

	def write_config(module: str, key: str, value: str) -> None:
	"""Write a configuration value.

	:param module: The name of the app (e.g., ``indra``)
	:param key: The key of the configuration in the app
	:param value: The value of the configuration in the app
	"""
	_get_cfp.cache_clear()
	cfp = ConfigParser()
	path = get_home() / f"{module}.ini"
	cfp.read(path)
	cfp.set(module, key, value)
	with path.open("w") as file:
	cfp.write(file)

cthoyt / pystow Goto Github PK

pystow's Issues

Recommend Projects

Recommend Topics

Recommend Org