Giter VIP home page Giter VIP logo

clldutils's Introduction

clldutils

Utilities for programmatic data curation

Build Status Documentation Status PyPI

Install

Install from PyPI running

pip install clldutils

Overview

Originally, clldutils started out as a library for functionality often used in clld web apps. Over time, it turned into a toolbox for various data curation tasks, with a focus on cross-linguistic data (as reflected by modules such as clldutils.iso_639_3 or clldutils.sfm).

Design goals are

  • wide applicability of the included functionality
  • small number of dependencies, thus wide installability

API

Documentation of the package is at https://clldutils.readthedocs.io/en/latest/index.html

clldutils's People

Contributors

chrzyki avatar eva-dlce-zenodo avatar mcmtroffaes avatar simongreenhill avatar xflr6 avatar xrotwang avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

clldutils's Issues

`readlines` fails on empty line when `comment` is specified

>>> from clldutils.path import readlines
>>> readlines(['', 'abc'], comment='#')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./lib/python2.7/site-packages/clldutils/path.py", line 98, in readlines
    res = [None if l.startswith(comment) else l for l in res]
AttributeError: 'NoneType' object has no attribute 'startswith'

Add `apilib` module

Add an apilib module providing basic boilerplate functionality for implementing API objects for data repositories. In particular, there should be

  • commonly applicable validators for data objects based on the attr package,
def valid_enum_member(choices, instance, attribute, value):
    if value not in choices:
        raise ValueError(value)


def valid_range(min_, max_, instance, attribute, value):
    if value < min_ or value > max_:
        raise ValueError(value)
  • a base class for API objects, taking a path to a directory as single argument.
class API(UnicodeMixin):
    def __init__(self, repos):
        self.repos = Path(repos)

    def __unicode__(self):
        return '<{0} repository {1} at {2}>'.format(self.repos.name, git_describe(self.repos), self.repos)

Add `clldutils.path.readlines` function

A function clldutils.path.readlines, which returns a list of lines read from a file would be useful, in particular as a place to add functionality like stripping of "commented" lines, which - combined with clldutils.dsv.reader's ability to read csv from lines - would go a long way towards replacing e.g. LingPy's "qlc" format.

Markdown aware wrap function

A text wrapper which is aware of markdown tables (i.e. leaves them alone) would be useful (e.g. for glottolog).

with temp dir context manager

Since tempfile.TemporaryDirectory is not available in python 2.7, we should provide a temporary directory as context manager.

Factor our csvw sub-package

The csvw sub-package should be turned into its own proper package (maybe requiring clldutils), since it contains a well-defined set of functionality that could be useful outside the context of clld apps.

Switch import of pathlib

To make sure we use pathlib from the standard library for Python 3.5, the import order in clldutils.path should be switched to first try the standard lib module and fall back to pathlib2 on ImportError.

Support for simple download of ISO 639-3 code tables

Currently, download and unpacking is baked into the same function iter_tables. Since the ISO object can be instantiated by pointing it to a local path with the zipped tables, it would be good if these could be downloaded easily.

Table.write does not know how to deal with foreign keys

Foreign key specifications do not have a .header (instead, they can be multiple columns, described by columnReference, if I understand http://w3c.github.io/csvw/metadata/#dfn-foreign-key-definition correctly). I think that one also says that other properties (.virtual) should not exist?

def write(self, items, fname=NO_DEFAULT):

Aditionally, Table.write passes data writing to its columns' .write method. Foreign keys don't implement that method, and it might be hard to implement anyway because in the general case it outputs more than one column, if I get it correctly.

`clilib.ArgumentParser.main` should print usage when invalid command is passed

Instead of raising KeyError when encountering an invalid command name (see below), clilib.ArgumentParser.main should print usage and an error message.

(pycldf)dlt5502178l:~/venvs/pycldf/pycldf$ cldf pycldf/tests/fixtures/ds1.csv
Traceback (most recent call last):
  File "/home/shh.mpg.de/forkel/venvs/pycldf/bin/cldf", line 9, in <module>
    load_entry_point('pycldf==0.5.2', 'console_scripts', 'cldf')()
  File "/home/shh.mpg.de/forkel/venvs/pycldf/pycldf/pycldf/cli.py", line 88, in main
    sys.exit(parser.main())
  File "/home/shh.mpg.de/forkel/venvs/pycldf/local/lib/python2.7/site-packages/clldutils/clilib.py", line 35, in main
    self.commands[args.command](args)
KeyError: 'pycldf/tests/fixtures/ds1.csv'

Remove testing module or support testing with pytest

Midterm, we aim at switching all our code to using pytest for testing rather than the unittest/nose combo.
Thus, most of the code in testing will be obsolete or should be reimplemented to work in the pytest paradigm.

context manager to temporarily adapt `sys.path`

For several projects we want to be able to import custom code not within a python package. The most convenient (and portable) way to this seems to be via a context manager, adapting sys.path appropriately:

@contextmanager
def with_sys_path(d):
    p = d.as_posix()
    sys.path.append(p)
    yield
    if sys.path[-1] == p:
        sys.path.pop()

Alternatively, the complete functionality, i.e. importing a module by filesystem path could be provided:

        with with_sys_path(path.parent):
            return import_module(path.name)

`clilib` allow custom names for sub-commands passed to ArgumentParser

Right now, subcommand names for clilib.ArgumentParser can only be function names. Thus, reserved words like class are impossible and names like list would cause shadowing of built-ins where the function is defined. So there should be a way to register callables as commands under custom names.

Add functionality to capture stdout in tests

In clldutils.testing we should add functionality to capture stdout, along the lines of

import sys
from contextlib import contextmanager

from six import StringIO

@contextmanager
def capture(func, *args, **kw):
    out, sys.stdout = sys.stdout, StringIO()
    func(*args, **kw)
    sys.stdout.seek(0)
    yield sys.stdout.read()
    sys.stdout = out

Support passing encoding to `INI.from_file`

Currently it is not possible to specify an encoding for ini files to be read in INI.from_file. On win, the encoding used seems to default to cp1252, which typically isn't what we want.

Add info about local use codes

Some three-letter codes are reserved for local use by ISO 639-3. This information should be available programmatically, e.g. as a list:

from string import ascii_lowercase
local_use = ['q' + x + y for x in ascii_lowercase[:ascii_lowercase.index('t') + 1] for y in ascii_lowercase]

iterdicts has no docstring

The method Table.iterdicts() has no docstring, but it's not entirely obvious what it does. I suggest the following.

def iterdicts(self, log=None, with_metadata=False, fname=None):

"""Iterate over the rows of the table

Create an iterator that maps the information in each row to a `dict` whose keys are the column names of the table and whose values are the values in the corresponding table cells, or for virtual columns (which have no values) the valueUrl for that column. This includes columns not specified in the table specification.

Parameters
----------
log: Logger object (default None)
    The object that reports parsing errors. If none is given, parsing errors raise ValueError instead.
with_metadata: bool (default False)
    Also yield fname and lineno
fname: file-like, pathlib.Path, or str (default None)
    The file to be read. Defaults to inheriting from a parent object, if one exists.

Yields
------
fname: str (only if with_metadata)
lineno: int (only if with_metadata)
row: dict
""""

Allow nullable foreign keys in csvw

We want to allow foreign keys in tables to be optional, marked by a NULL value of the referencing column(s). The current behaviour can still be had by making the column(s) required.

Add `write` method to `dsv_metadata.Table`

The idea is to turn the metadata support into a mechanism to reformat csv files, e.g. to reformat dates one could do

>>> tg = TableGroup(...)
>>> items = list(tg.tables[0])
>>> tg.tables[0].schema.columns[0].datatype.format = 'yyyy-MM-dd'
>>> tg.tables[0].write(items)

Remove the \ufeff character from any file that is read?

This is the 10th time that I had these issues: reading in a file, searching for a header, I see an error, and only in the end I find out, why the header (in this case "ID") was not found: The first character was \ufeffID.

This happened in the concepticon-api, which uses clldutils, and I wonder, given that this usually leads to a long debugging session, whether we should not remove this character immediately if it occurs as the header? Or are there other preferred ways how to deal with it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.