Giter VIP home page Giter VIP logo

pystow's Introduction

PyStow

Build status PyPI - Python Version License Documentation Status DOI Code style: black

๐Ÿ‘œ Easily pick a place to store data for your python code.

๐Ÿš€ Getting Started

Get a directory for your application.

import pystow

# Get a directory (as a pathlib.Path) for ~/.data/pykeen
pykeen_directory = pystow.join('pykeen')

# Get a subdirectory (as a pathlib.Path) for ~/.data/pykeen/experiments
pykeen_experiments_directory = pystow.join('pykeen', 'experiments')

# You can go as deep as you want
pykeen_deep_directory = pystow.join('pykeen', 'experiments', 'a', 'b', 'c')

If you reuse the same directory structure a lot, you can save them in a module:

import pystow

pykeen_module = pystow.module("pykeen")

# Access the module's directory with .base
assert pystow.join("pykeen") == pystow.module("pykeen").base

# Get a subdirectory (as a pathlib.Path) for ~/.data/pykeen/experiments
pykeen_experiments_directory = pykeen_module.join('experiments')

# You can go as deep as you want past the original "pykeen" module
pykeen_deep_directory = pykeen_module.join('experiments', 'a', 'b', 'c')

Get a file path for your application by adding the name keyword argument. This is made explicit so PyStow knows which parent directories to automatically create. This works with pystow or any module you create with pystow.module.

import pystow

# Get a directory (as a pathlib.Path) for ~/.data/indra/database.tsv
indra_database_path = pystow.join('indra', 'database', name='database.tsv')

Ensure a file from the internet is available in your application's directory:

import pystow

url = 'https://raw.githubusercontent.com/pykeen/pykeen/master/src/pykeen/datasets/nations/test.txt'
path = pystow.ensure('pykeen', 'datasets', 'nations', url=url)

Ensure a tabular data file from the internet and load it for usage (requires pip install pandas):

import pystow
import pandas as pd

url = 'https://raw.githubusercontent.com/pykeen/pykeen/master/src/pykeen/datasets/nations/test.txt'
df: pd.DataFrame = pystow.ensure_csv('pykeen', 'datasets', 'nations', url=url)

Ensure a comma-separated tabular data file from the internet and load it for usage (requires pip install pandas):

import pystow
import pandas as pd

url = 'https://raw.githubusercontent.com/cthoyt/pystow/main/tests/resources/test_1.csv'
df: pd.DataFrame = pystow.ensure_csv('pykeen', 'datasets', 'nations', url=url, read_csv_kwargs=dict(sep=","))

Ensure a RDF file from the internet and load it for usage (requires pip install rdflib)

import pystow
import rdflib

url = 'https://ftp.expasy.org/databases/rhea/rdf/rhea.rdf.gz'
rdf_graph: rdflib.Graph = pystow.ensure_rdf('rhea', url=url)

Also see pystow.ensure_excel(), pystow.ensure_rdf(), pystow.ensure_zip_df(), and pystow.ensure_tar_df().

If your data comes with a lot of different files in an archive, you can ensure the archive is downloaded and get specific files from it:

import numpy as np
import pystow

url = "https://cloud.enterprise.informatik.uni-leipzig.de/index.php/s/LHPbMCre7SLqajB/download/MultiKE_D_Y_15K_V1.zip"
# the path inside the archive to the file you want
inner_path = "MultiKE/D_Y_15K_V1/721_5fold/1/20210219183115/ent_embeds.npy"
with pystow.ensure_open_zip("kiez", url=url, inner_path=inner_path) as file:
    emb = np.load(file)

Also see pystow.module.ensure_open_lzma(), pystow.module.ensure_open_tarfile() and pystow.module.ensure_open_gz().

โš™๏ธ๏ธ Configuration

By default, data is stored in the $HOME/.data directory. By default, the <app> app will create the $HOME/.data/<app> folder.

If you want to use an alternate folder name to .data inside the home directory, you can set the PYSTOW_NAME environment variable. For example, if you set PYSTOW_NAME=mydata, then the following code for the pykeen app will create the $HOME/mydata/pykeen/ directory:

import os
import pystow

# Only for demonstration purposes. You should set environment
# variables either with your .bashrc or in the command line REPL.
os.environ['PYSTOW_NAME'] = 'mydata'

# Get a directory (as a pathlib.Path) for ~/mydata/pykeen
pykeen_directory = pystow.join('pykeen')

If you want to specify a completely custom directory that isn't relative to your home directory, you can set the PYSTOW_HOME environment variable. For example, if you set PYSTOW_HOME=/usr/local/, then the following code for the pykeen app will create the /usr/local/pykeen/ directory:

import os
import pystow

# Only for demonstration purposes. You should set environment
# variables either with your .bashrc or in the command line REPL.
os.environ['PYSTOW_HOME'] = '/usr/local/'

# Get a directory (as a pathlib.Path) for /usr/local/pykeen
pykeen_directory = pystow.join('pykeen')

Note: if you set PYSTOW_HOME, then PYSTOW_NAME is disregarded.

X Desktop Group (XDG) Compatibility

While PyStow's main goal is to make application data less opaque and less hidden, some users might want to use the XDG specifications for storing their app data.

If you set the environment variable PYSTOW_USE_APPDIRS to true or True, then the appdirs package will be used to choose the base directory based on the user data dir option. This can still be overridden by PYSTOW_HOME.

๐Ÿš€ Installation

The most recent release can be installed from PyPI with:

$ pip install pystow

Note, as of v0.3.0, Python 3.6 isn't officially supported (its end-of-life was in December 2021). For the time being, pystow might still work on py36, but this is only coincidental.

The most recent code and data can be installed directly from GitHub with:

$ pip install git+https://github.com/cthoyt/pystow.git

To install in development mode, use the following:

$ git clone git+https://github.com/cthoyt/pystow.git
$ cd pystow
$ pip install -e .

โš–๏ธ License

The code in this package is licensed under the MIT License.

pystow's People

Contributors

cthoyt avatar dobraczka avatar mberr avatar sgbaird avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pystow's Issues

Confusing folder configuration

In the README, the instructions say

Data gets stored in ~/.data by default. If you want to change the name of the directory,
set the environment variable PYSTOW_NAME. If you want to change the default parent directory
to be other than the home directory, set PYSTOW_HOME.

I interpreted this to say that PYSTOW_HOME can be configured to /path to make it create its .data folder in /path/.data. However, this made it put the individual project folders into /path/project1, /path/project2, etc. I think it would make sense to either interpret PYSTOW_HOME as the parent of .data and PYSTOW_NAME as the name for the .data folder, or change the instructions above to describe the current behavior (i.e., instead of "default parent directory to be other than the home directory" say "default pystow data directory to be other than ~/.data")

Bug in ensure_open_zip when using API

Hello, when using the API function ensure_open_zip, there is an error thrown, e.g. the following:

import pystow
url = "https://cloud.enterprise.informatik.uni-leipzig.de/index.php/s/LHPbMCre7SLqajB/download/MultiKE_D_Y_15K_V1.zip"
inner_path = "MultiKE/D_Y_15K_V1/721_5fold/1/20210219183115/kg1_ent_ids"
with pystow.ensure_open_zip("kiez", url=url, inner_path=inner_path) as file:
    for line in file:
        print(line)
        break

results in TypeError: '_GeneratorContextManager' object is not iterable.
Using the module does work.
I figured out how this fix by changing the respective api code to the following:

@contextmanager
def ensure_open_zip(
    key: str,
    *subkeys: str,
    url: str,
    inner_path: str,
    name: Optional[str] = None,
    force: bool = False,
    download_kwargs: Optional[Mapping[str, Any]] = None,
    mode: str = "r",
    open_kwargs: Optional[Mapping[str, Any]] = None,
):
    """Ensure a file is downloaded then open it with :mod:`zipfile`."""
    _module = Module.from_key(key, ensure_exists=True)
    with _module.ensure_open_zip(
        *subkeys,
        url=url,
        inner_path=inner_path,
        name=name,
        force=force,
        download_kwargs=download_kwargs,
        mode=mode,
        open_kwargs=open_kwargs,
    ) as inner_ensure_open_zip:
        yield [inner_ensure_open_zip]
``

Best practices for clearing stale files

My workflow involves periodic

ls -alt ~/.data/oaklib

followed by looking at timestamps, using tacit knowledge about update frequencies of different ontologies, and selectively removing older files.

But if pystow is used in toolchains used by less technical users this could be confusing. What are the long term plans here? Should application developers write bespoke cache management solutions? This is not a bad idea as they can take advantage of specific conventions (e.g. I am caching sqlites of ontology files and I know versionIRI, when present, should uniquely identify the version). But it may be useful to have some kind of general purpose cache management helpers in the core, together with some kind of autoflush-after-N-days type options?

loading comma-separated values format defaults to tab separators

After reading through https://en.wikipedia.org/wiki/Comma-separated_values, I think I can understand the decision behind making tab separators the default as "the only safe option," though it does seem confusing to me that ensure_csv assumes sep="\t". Maybe worth mentioning the default pd.to_csv() uses commas (no tabs).

pystow/src/pystow/impl.py

Lines 1414 to 1417 in 2ce9690

def _clean_csv_kwargs(read_csv_kwargs):
read_csv_kwargs = {} if read_csv_kwargs is None else dict(read_csv_kwargs)
read_csv_kwargs.setdefault("sep", "\t")
return read_csv_kwargs

Variants of ensure functions focused on loading

Currently, pystow provides a number of ensure_* functions that target different file types, downloads them if not available, and loads them. For instance, ensure_csv takes a path to a CSV file and loads it with pandas. However, all of these functions also require a url argument from which the file is first downloaded if it doesn't already exist. I think it would be useful to have variants of these functions where the same functionality of loading canonical file types is provided but without the url/download part, under the assumption that the given file is already there and never necessarily came from a URL in the first place. I would definitely use pickle loading or JSON loading for instance, knowing that a given file is already there.

[WINDOWS] TypeError when using `ensure_csv`: "argument of type 'WindowsPath' is not iterable"

import pystow
from pystow import ensure_csv
MBGM_HOME = pystow.join("matbench-genmetrics")
ensure_csv(MBGM_HOME, url="https://figshare.com/ndownloader/files/36581838")
..\..\..\..\Miniconda3\envs\matbench-genmetrics\lib\site-packages\pystow\api.py:610: in ensure_csv
    _module = Module.from_key(key, ensure_exists=True)
..\..\..\..\Miniconda3\envs\matbench-genmetrics\lib\site-packages\pystow\impl.py:83: in from_key
    base = get_base(key, ensure_exists=False)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

key = WindowsPath('C:/Users/sterg/.data/matbench-genmetrics')
ensure_exists = False

    def get_base(key: str, ensure_exists: bool = True) -> Path:
        """Get the base directory for a module.
    
        :param key:
            The name of the module. No funny characters. The envvar
            <key>_HOME where key is uppercased is checked first before using
            the default home directory.
        :param ensure_exists:
            Should all directories be created automatically?
            Defaults to true.
        :returns:
            The path to the given
    
        :raises ValueError: if the key is invalid (e.g., has a dot in it)
        """
>       if "." in key:
E       TypeError: argument of type 'WindowsPath' is not iterable

syncing an upstream gzip file with an expanded local version

pystow has methods for syncing with a gzipped file from a URL and dynamically opening it

but if my upstream file is a gzipped sqlite (e.g. https://s3.amazonaws.com/bbop-sqlite/hp.db.gz), then I need it to be uncompressed in my ~/.data folder, before I make a connection to it (the same may hold for things like OWL)

I can obviously do this trivially, but this would require introspecting paths and would seem to defeat the point of having an abstraction layer.

For now I am putting duplicative .db and .db.gz files on s3, and only using the former with pystow, but I would like to migrate away from distributing the uncompressed versions

What I am imagining is:

url = 'https://s3.amazonaws.com/bbop-sqlite/hp.db.gz'
path = pystow.ensure('oaklib', 'sqlite', url=url, decompress=True)
conn = connect("file:///{path}")

Does that make sense?

As an aside, it may also be useful to have specific ensure methods for sqlite and/or sqlalchemy the same way you have for pandas.

Missing type information

I just noticed, that the type information of pystow is not shipped, because no py.typed file is present. See this mypy output I had in another project.

...
sylloge/base.py:28: error: Skipping analyzing "pystow": module is installed, but missing library stubs or py.typed marker  [import-untyped]
sylloge/base.py:30: error: Skipping analyzing "pystow.utils": module is installed, but missing library stubs or py.typed marker  [import-untyped]
....

This can be easily fixed, by adding a py.typed file inside src/pystow.

Joining to existing pystow path

Assume I start with

workdir = pystow.join('part1', 'part2', ..., 'partn')

This guarantees that workdir will be created if it doesn't yet exist. However, if I now want to do make a subfolder inside workdir, I either have to do

work_subdir = workdir.joinpath('part_nplusone')

which doesn't create the folder for me and so I have to do additional bookkeeping to make that happen or do

workdir = pystow.join('part1', 'part2', ..., 'partn', 'part_nplusone')

which is redundant.

What is the recommended approach here?

Potential naming conflicts with default pystow home directory name.

Pystow does not check if there is an existing .data directory on the users system and happily commandeers this folder even if it already exists. Since this is a very common and generic name, it is not unlikely that a user may already have a .data directory in their home folder. It is also not unlikely that a user will end up in a situation where they or some software they have installed other than pystow will try to place a .data directory in their home folder. I suggest to change to a less generic name such as
".pystow_data" to avoid potential naming conflicts. I think just having the possibility of changing the default folder name with an environment variable is insufficient because the direct users of pystow are python package developers not python package users. We should seek to minimize any cognitive burden or sources of surprise for end users of python packages that use pystow.

Return types for some ensure functions

For functions like ensure_open_zip, the return value is documented as

:yields: An open file object

so I thought this wold be something I can call e.g., read() on. Checking the type of the returned value, it's contextlib._GeneratorContextManager and it took me a little while to figure out that this means I have to use with to interact with this function. Is the documentation and type annotation for these functions correct?

Ensure zip file?

Hello,
I love the project. However I am missing something like an ensure_zip_file function. While ensuring that the actual .zip is there can be done with ensure, it would be nice to have a functionality, where I ensure a file from this zip is there, and load this file as fileobject, to then read it in however it is needed.
Poking around in the code I found Module.ensure_open_zip. I think this contextmanager can be easily wrapped for the API, to get what I need like this:

def ensure_open_zip(
    key: str,
    *subkeys: str,
    url: str,
    inner_path: str,
    name: Optional[str] = None,
    download_kwargs: Optional[Mapping[str, Any]] = None,
):
    _module = Module.from_key(key, ensure_exists=True)
    return _module.ensure_open_zip(*subkeys, url=url, inner_path=inner_path, name=name)

If it's fine with you I could make a PR and add it (with tests and doc of course)?

Need to create a config file before it can be written to using write_config

Related: cthoyt/zenodo-client#6

def write_config(module: str, key: str, value: str) -> None:
"""Write a configuration value.
:param module: The name of the app (e.g., ``indra``)
:param key: The key of the configuration in the app
:param value: The value of the configuration in the app
"""
_get_cfp.cache_clear()
cfp = ConfigParser()
path = get_home() / f"{module}.ini"
cfp.read(path)
cfp.set(module, key, value)
with path.open("w") as file:
cfp.write(file)

I needed to call cfp.add_section(module) before I could use cfp.set(module, key, value) given the file doesn't already exist.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.