octue / octue-sdk-python Goto Github PK

The python SDK for @Octue services and digital twins.

License: Other

Python 98.75% Dockerfile 0.42% HCL 0.83%

data data-service data-service-development-kit data-services digital-twin digital-twin-application digital-twin-web digital-twins microservice microservices python python3 renewable-energy renewables sdk sdk-python wind-energy wind-energy-analytics

octue-sdk-python's People

Contributors

Stargazers

Watchers

Forkers

babyj79 erdal-pb deyh2020

octue-sdk-python's Issues

Service Resource and interaction methods

Create a new app template in the SDK which shows how to access/call a child resource from a parent.
- This will actually be two apps; child-example and parent-example (or call them something more interesting).

Site (parent, local service) *Accept as input a set of lat/lon location pairs and return site characteristics. Return as output an average wind speed for each location, and an elevation in m above WGS84 for each location*
    |
    | -> Atmosphere (child1, local service)
    | -> Elevation (child2, remote service)

Specify the children requirements in the Site's twine, and specify their details in a children JSON string/file that gets parsed on run.
- Details should contain a URI of a child service which is accessible.
- Child resources should be tagged and keyed as per the twine file, and should have a method which allows them to be asked questions.
Add a ~~Child~~ Service Resource which is instantiated (analogous to input_manifest), from this JSON specification of the child services.
In the templates, attempt to use what feels like the most appropriate API for asking questions of the children. This will probably invoke a method; perhaps something like:
```
analysis.children['elevation-service-or-whatever-key-you-gave-it-in-the-twine'].ask(input_values={
"locations": [[0, 0], [51, 0]]
})
```
The method, which is fundamentally async, should invoke the same pattern as the tests of django-twined
~~Figure out how to run child services in their own virtual environment using tox, just like pre-commit does.~~

This will solve #57
Need to build and install the virtual environments, rebuilding on dependency changes.
Ideally store somewhere so that we don't have to recreate at each invocation.
And have a method for deleting them all when everything goes to hell
It may be that the analysis.children object is actually a ServiceManager that makes sure they're available and created, or perhaps the Service resource does that itself.

Establish a way of communicating between the children, to exchange messages
- Possible refactor of ReelMessage from twined-server into twined or into here for the purpose of structuring messages
- Possible ways of communicating messages locally:
  - pypubsub
  - socket.io
- Should be abstracted so that we can switch out to using cloud-based pubsub for online stuff as opposed to local.
Figure out how to configure local services!!! Possibly by adding configuration to the children json file?

Optimise log messages

Some of the log messages are interpolating variables even when they're not being emitted. We can avoid this, saving lots of processing, by not interpolating but by using percent-notation and passing the variables in to e.g. logger.info as extra positional arguments. i.e.

a = 3
logger.info('This is %s', a)

rather than

a = 3
logger.info(f'This is {a}')

This applies to all repos using python logging.

CoolNameable Mixin

Use the coolname library to automatically generate slugified names like thine-mega-thingy as a mixin, which could subclass
the Nameable mixin. This would allow users to create resources whose names are recognisable by default.

Here's some code which uses coolname from amy (requires pip install coolname), which is designed as a model mixin but can be refactored simply to an sdk library mixin.

from coolname import generate_slug


class CoolNamed:
    """ Sets the 'name' field (if not already populated) using a cool name string and the id field
    """

    def save(self, *args, **kwargs):
        if hasattr(self, 'name') and self.name is None:
            id_appendix = ''
            if hasattr(self, 'id') and self.id is not None:
                id_appendix = '-' + str(self.id)[:7]

            self.name = '{coolname}{id}'.format(coolname=generate_slug(2), id=id_appendix)

        super(CoolNamed, self).save(*args, **kwargs)

JSONSchema DeprecationWarning raised when running tests

The following DeprecationWarning is raised when running the tests:

/repos/octue-sdk-python/venv/lib/python3.8/site-packages/jsonschema/validators.py:928: DeprecationWarning: The metaschema specified by $schema was not found. Using the latest draft to validate, but this will raise an error in the future.
  cls = validator_for(schema)

We should address it at some point if it's easy to do so.

Possible issues getting secret values into tox tests

Bug report

What is the expected behavior?

Using:

env:
    GCP_SERVICE_ACCOUNT: {{ secrets.GCP_SERVICE_ACCOUNT }}

in the tox testing action should lead to the value of that being in the environment ready for testing.

What is the current behavior?

It doesn't

Hints

Over in windquest we were able to create a .env file containing secrets - using the following action step:

      - name: Create .env File (enables injection of operational secrets into test container)
        uses: SpicyPizza/create-envfile@v1
        with:
          envkey_GITHUB_ACTIONS: True
          envkey_CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
          envkey_DJANGO_SECRET_KEY: ${{ secrets.DJANGO_SECRET_KEY }}
          envkey_DJANGO_SECURE_SSL_REDIRECT: True
          envkey_GOOGLE_APPLICATION_ASSETS_BUCKET_NAME: windquest-assets-test-github-actions
          envkey_GOOGLE_APPLICATION_CREDENTIALS_JSON: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_JSON }}
          envkey_GOOGLE_MAP_API_KEY: ${{ secrets.GOOGLE_MAP_API_KEY }}
          envkey_KOMBU_FERNET_KEY: ${{ secrets.KOMBU_FERNET_KEY }}
          envkey_MAILGUN_API_KEY: ${{ secrets.MAILGUN_API_KEY }}
          envkey_MAILGUN_SENDER_DOMAIN: mailgun.wind-pioneers.com
          file_name: .env

Other information

Please give as much detail as you can, like:

Analysis attributes are None when they shouldn't be

I'm submitting a ...

support request
bug report
feature request

Please fill out the relevant sections below.

Bug report

What is the current behavior?

analysis.input_dir is None
>>> True
analysis.data_dir is None
>>> True
analysis.output_dir is None
>>> True
output_manifest is None
>>> True

What is the expected behavior?

Analysis *_dir attributes should be either not available or should be not None (i.e. correct)
Where a twine file has an output_manifest strand, the output manifest should be pre-created.
There should be an example of their use in either the documentation of the demo apps

Other information

@time-trader please could you attach the twine file for this case that we discussed this afternoon?

CLI raises TypeError when running an app from IDE

While running octue 0.1.3 app from IDE:

python app.py run

Traceback (most recent call last):
File "app.py", line 56, in
octue_cli(args)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 1256, in invoke
Command.invoke(self, ctx)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
TypeError: octue_cli() missing 3 required positional arguments: 'data_dir', 'input_dir', and 'tmp_dir'

Complete field lookups

The Dataset.get_files method allows only a very limited set of field lookups.

Implement a complete set like this

Dataset has no len()

I'm submitting a ...

support request
bug report
feature request

Please fill out the relevant sections below.

Feature request

Use Case

I'm creating a dataset to match NASA's Digital Elevation Model, and want to cross-check that I've added all 22911 files to it.

Current state

print(dataset)
>>> Dataset 7430991f-bf5e-4e17-8a42-b44e02495717

len(dataset)
>>> Traceback (most recent call last):
>>>   File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in >>> Exec
>>>     exec(exp, global_vars, local_vars)
>>>   File "<input>", line 1, in <module>
>>> TypeError: object of type 'Dataset' has no len()

What is the expected behavior?

To return the number of files in the dataset

Workaround:

print(len(dataset.files))
>>> 22911

Ensure test topics and subscriptions are deleted even on test failure

This should apply when the delete_topic_and_subscription_on_exit flag is set to True for tests of Service and the CLI start command. It works currently unless the test fails, in which case deletion is missed.

Easier pass-through of files to child services

Feature request

Use Case

A parent service may need to hand off individual files within its manifest to child services.

It would be nice if we had a helper to pass files directly through to twins (possibly by specifying their path, within the input_values schema or similar.

Current state

Currently, you would create a new Manifest for running the child service, with a single Dataset containing the Datafiles required.

You'd then pass that manifest to the child service so it could access the files it requires.

But, this is a cumbersome way of passing files through to children.

Workaround

@time-trader when working locally (i.e. having all the data files on one machine) is using the workaround of specifying the path directly to the file, as an input_value but this won't scale in general to a system where datafiles aren't necessarily local.

Documentation embedded links don't work on ReadTheDocs

Bug report

What is the current behavior?

The links embedded in the ReadTheDocs documentation aren't working (possibly because they're pointing to .rst rather than .html files.

What is the expected behavior?

The links link to the expected page on ReadTheDocs.

Ensure FileNotFound exceptions are consistent with the twine

@cortadocodes In the CLI at present for release/0.1.4 we have

    for directory in config_dir, input_dir:
        for filename in VALUES_FILENAME, MANIFEST_FILENAME:
            if not file_in_directory(VALUES_FILENAME, directory):
                raise exceptions.FileNotFoundException(f"No file named {filename} file found in {directory}.")

    # ... Runner ...


def file_in_directory(filename, directory):
    return os.path.isfile(os.path.join(directory, filename))

However, I think (!!! Getting back into this after a couple of months focussed elsewhere!) this is able to be handled at the twined level in the validate() method - default behaviour is that if the src (in this case a file path) is given but that strand is not in the twine, an error is thrown. If it's not given and not required by the twine, validation passes. However, it is possible to allow_extra in order that we can provide that src if the twine is empty and not have validation throw an error (it simply won't load the file).

Whereas this early check requires that those files are present regardless of whether they're requisite in the twine. So I'll comment this out for the time being in case it gives @time-trader any trouble using this new release.

TODO - write a unit test to make sure I'm actually correct in this and that twined will vomit when it's supposed to; no more, no less

Automate tag and release on merge of release branch

Deprecate python3.7

Caching properties

functools.cached_property is only available in python3.8 and above. To retain compatibility with python3.6 and python3.7, we've used this to the same effect:

@property
@functools.lru_cache(maxsize=None)

When we deprecate python3, we should update this to @functools.cached_property

Dramatiq actors - enable direct use as a worker

UPDATE: We're not doing it this way yet. Redis isn't suitable for direct connection from a potentially malicious source because it doesn't have auth built in. We need a solution we can auth directly eg pubsub or straightforwardly using django-twined.

~~A step toward solving octue/twined-server#2 and octue/django-twined#1 is to decorate an entrypoint to the app as dramatiq actor.~~

That way, workers can be run in any environment with any dependency set, picking tasks off of the broker... tasks which can be put onto the broker by any other app or worker or server with a differing environment.

~~Essentially this should allow us to call something like (from dramatiq -h):~~

  # Run dramatiq workers with actors defined in `./octue/actors.py`, and a broker named "redis_broker" defined in "octue.brokers", listening only to the "app-appname-version-0.0.1" queue.
  $ dramatiq octue.brokers:redis_broker octue.actors -Q app-appname-version-0.0.1

order_by filtering for Dataset files

The Dataset method get_file_sequence(self, field_lookup, files=None, filter_value=None, strict=True) does some vanilla sorting on the sequence key. However, related to #6 it should be possible to provide more generic ordering on the queryset.

Clean up cli testing to avoid race conditions

Tests that run the CLI create output directories that currently need to be deleted manually.

That's mildly annoying but indicative of a deeper concern that race conditions may exist between tests if executed in parallel and using the default directory

Use or adapt the callCli method of BaseTestCase throughout to ensure that cli calls are done in their own NamedTemporaryDirectory context, ensuring that artifacts are automatically cleaned up and that tests are deterministic

Queryset for dataset files - chainable filters

Refactor the Dataset.get_files method to behave like Django's QuerySet

This would return an object that could act like a chainable filter so you can do like:

       my_dataset.filter('sequence__not', True).filter('extension', 'csv').filter('posix_timestamp__between', [123456, 135790]).all()

The final all() method should yield a generator. Basically, we're talking about django's filtering here!

>>> fs = Dataset.filter(criterion=whatever).filter(othercriterion=whatever2)
>>> class(fs)
FilteredSet

Note: We have TagGroups, but Datasets. Make TagGroup => Tagset
*Note 2: Datasets are currently lists, not sets. They probably should be sets.

Potential `FilterSet` methods

.as_object() creates a new , e.g. a Dataset from the FilteredSet
.filter(criterion=something) returns a new FilterSet with additional filter criteria applied
.order_by(field)
yield to iterate through the results
.apply() applies a callable to the yielded results

Filter Criteria

Implement at first some basics (see #7 for potential further criteria to implement in future)

Tests failing due to dependency

Unit tests all pass when run in series, but in travis, two tests fail.

These are two tests of the run command of the CLI that pass by themselves, but not when run in serial with test_fractal_configuration. The two tests pass when run in serial with all other tests (i.e. when test_fractal_configuration is disabled.

@cortadocodes thinks it's something to do with paths being leaked from the template setting and making their way into other tests using AppFrom.

Failure message, on Python 3.6.3

======================================================================
FAIL: test_run_command_can_be_added (tests.test_cli.RunnerTestCase)
Test that an arbitrary run command can be used in the run command of the CLI.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/octue/octue-sdk-python/tests/test_cli.py", line 41, in test_run_command_can_be_added
    assert CUSTOM_APP_RUN_MESSAGE in result.output
AssertionError
======================================================================
FAIL: test_run_command_works_with_data_dir (tests.test_cli.RunnerTestCase)
Test that the run command of the CLI works with the --data-dir option.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/octue/octue-sdk-python/tests/test_cli.py", line 55, in test_run_command_works_with_data_dir
    assert CUSTOM_APP_RUN_MESSAGE in result.output
AssertionError
----------------------------------------------------------------------

Remove tmp-dir

I'm considering removing the tmp-dir functionality, which @cortadocodes has pointed out is presently unhooked in the 0.1.4 release (#25)

The reason is that a unique tmp_dir needs to be assigned for each analysis in order to remain thread-safe. Although it could be useful as a cache between runs, I believe the expectation of something named like this would be that it's temporary to that run (so for example listing all the files created in tmp_dir would be deterministic for a given analysis).

In which case it would be more helpful to our users if we manage this for them, by creating a NamedTemporaryDirectory scoped to the analysis. Then if caches are required to cross analyses, we could handle that in a more sensitive way, providing a helpful API and docs specifically for this.

@time-trader are you using tmp_dir for storing temporary files anywhere? If so I'll implement this change carefully to avoid breaking your code. Otherwise I'll use a sledgehammer.

Aggregation functions on the get_files method of Dataset

Django queryset syntax allows aggregation as part of the queryset:
expressions, output_field, filter, **extra, Avg, Count, Max, Min, StdDev, Sum, Variance

Could consider enabling this to apply expressions to queryset (e.g. in the Dataset.get_files method) as is being developed in #6

For example, it's be good to query for a subset of the files, then .apply() a function to them.
eg

subset = Dataset.get_files(has_tag="something", extension="csv", whatever="whatever) # or whatever the query syntax is to get a data subset

subset.apply_to_all_files(lambda file: print(file.hash_value),  async=true)

This will allow Scientists to

easily get a subset of the dataset in order to access files / do work on files (sequentially or in parallel)

This will allow us to

more straightforwardly implement helpers to merge/split/recreate datasets (e.g. enables issue #4 )

Merge and split datasets

Add methods to the Dataset() class which

Merge multiple datasets
Split a dataset based on a criterion.
[-] ~~Mapreduce a dataset based on a filter~~ See #5

Ultimately I think this'll become quite useful, but we don't have anybody desperately needing it right now. Any views on whether this is a worthwhile feature?

Likely to become easy to implement if #6 is properly solved.

Update SDK readme and ensure that documentation gets served

I'm submitting a ...

bug report

Please fill out the relevant sections below.

Bug report

What is the current behavior?

Docs not found at the (supposed) link from the README due to outdated location (https://octue.readthedocs.io/en/latest/?badge=latest)
Both badge and links to documentation need to be updated with correct location https://readthedocs.org/projects/octue-python-sdk/
Docs not building on Readthedocs
https://readthedocs.org/projects/octue-python-sdk/builds/12694260/
README file is still based on python library template, so needs to be cut down (to match twined README or similar)

What is the expected behavior?

Docs build, readme is relevant and has correct links!

At the moment we recognise the wider issue that documentation will be moved around substantially in #69 but at present, we just want to get what we have to serve correctly.

Log analysis to socket

When setting up the analysis logging, optionally create an additional handler to send log entries to a websocket, whose a URI could be passed as a CLI option then into the runner.

This will enable the logs from the analyses to be streamed back to the end user.

@cortadocodes how do you feel about taking this one on?

Testing

I'm not sure how to create a websocket for testing purposes but it must be doable.

Review: normalising path splitter respecting absolute paths

From amy.utils.path we have a path splitter which retains the initial '/' if present in a path.

Review and refactor into the SDK if helpful.

import os.path
import sys


def split(path, max_depth=1):
   """
   http://nicks-liquid-soapbox.blogspot.co.uk/2011/03/splitting-path-to-list-in-python.html
   :param path:        str     Path to split
   :param max_depth:   int     Recursion limit (max number of directories in the path to be split out), default 1.
                               Setting max_depth=1 gives normal (head, tail) result consistent with
                               os.path.split(path), whilst setting max_depth=None allows recursion to the system limit
   :return:
   """

   if max_depth is None:
       # Add one so that the system will raise a recursion exception when the limit is reached, instead of quietly returning the wrong thing.
       max_depth = sys.getrecursionlimit() + 1

   def splitpath_recurse(path, depth=1):
       # TODO replace the last \ or / with os.path.sep to allow processing of paths not generated on this system
       (head, tail) = os.path.split(path)
       return splitpath_recurse(head, depth - 1) + [tail] if depth and head and head != path else [head or tail]

   return splitpath_recurse(path, max_depth)

Investigate gRPC performance and security for communicating between services

@thclark and I have decided to use python-socketio to allow communication between services as it is simple to use and we already have an outline implementation. However, it is not secure by default and may not provide any compression of data, both of which will become a problem as we scale. To address this, we think we could replace the use of python-socketio with gRPC at some point in the future as it has security and compression built in. However, for the first iteration of inter-service communication, gRPC seems a little complex as it requires a .proto javascript-like file to define message types outside of python.

Useful links

gRPC

socket.io

Wiki
Docs

Demonstrate use of OpenFOAM in responding to a question

Support request

About me

Octue user, developing a digital twin for HAWT under AeroSense project.

Request

FSI Module is the overarching service for running simulations and retrieving results, which should be called by other parts of Aerosense.

Path management accessible from analysis

For 0.1.3 I removed the cumbersome path-based system for running analyses, upgrading to the Runner() class.

However, this has the side effect of no longer having self-made paths. It'll be more flexible, because a dataset can sit there on a system with a fixed path regardless of which analysis it belongs to... but means there's no baked-in path on the dataset.

We need to resolve this, either by attaching a path index to the Analysis instance which manages that all... or by doing something like a mixin that gets instantiated with the appropriate path and local_path_prefix.

Inspired by the Datafile class this might look something like the following, but I haven't figured out a way to instantiate it yet which will then be easily used.

import os
import uuid

from octue.exceptions import InvalidInputException
from octue.utils import gen_uuid


class Pathable:
    """ Mixin to allow a class to have a path attribute (which may be a directory or file path and name)

    Prevents setting path after an object is instantiated.

    ```
    class MyResource(Pathable):
        pass

    MyResource().id  # Some generated uuid
    MyResource(id='not_a_uuid')  # Raises exception
    MyResource(id='a10603a0-194c-40d0-a7b7-fcf9952c3690').id  # That same uuid
    ```
    """

    _path_field = None

    def __init__(self, *args, path=None, local_path_prefix=".", **kwargs):
        """ Constructor for Pathable class

        # TODO Update datafile to use this mixin
        """
        super().__init__(*args, **kwargs)
        path = path or self._get_default_path()
        self._path = self._clean_path(path)

        self.local_path_prefix = str(os.path.abspath(local_path_prefix))

    def _get_default_path(self):
        if self._path_field is not None:
            return str(getattr(self, self._path_field))

    @staticmethod
    def _clean_path(path):
        if path is not None:
            return str(os.path.normpath(path)).lstrip(r"\/")

    @property
    def name(self):
        return str(os.path.split(self.path)[-1])

    @property
    def full_path(self):
        return os.path.join(self.local_path_prefix, self.path)

    @property
    def path(self):
        return self._path

Remove travis builds

Remove travis file from repo for #36 now that github actions are used instead.

Update CLI

Redesign CLI to a single CLI, instead of a mechanism that defines a new CLI for every app.

give an app path argument to specify which app to run
allow user to specify a --data-dir argument which, on instantiation of Analysis(), gives a hint to the location of sources which aren't specified on the command (allowing use straight from the terminal)
get version from git or from setup.py to remove the need for @octue_version decorator
remove deprecated logs directory
add CLI usage instructions to README

Human-friendly file-size utils

Code presently in amy.utils.files should be refactored to the Octue SDK utils, allowing us to present friendly file sizes for datasets and datafiles:

def size_kb(size_bytes):
    if not size_bytes:
        return 0
    return size_bytes / 1024


def size_mb(size_bytes):
    if not size_bytes:
        return 0
    return size_bytes / 1048576


def size_gb(size_bytes):
    if not size_bytes:
        return 0
    return size_bytes / 1073741824


def size_tb(size_bytes):
    if not size_bytes:
        return 0
    return size_bytes / 1099511627776


def size_str(size_bytes, fmt='%.02f '):
    """ Return sensible human formatted size string
    :param size_bytes: file/dataset size in bytes
    :param fmt: string format specifier for the floating point size. default '%.02f '
    :return: str
    """
    if size_bytes >= 1099511627776:
        return fmt % size_tb(size_bytes) + 'tb'
    if size_bytes >= 1073741824:
        return fmt % size_gb(size_bytes) + 'gb'
    if size_bytes >= 1048576:
        return fmt % size_mb(size_bytes) + 'mb'
    if size_bytes >= 1024:
        return fmt % size_kb(size_bytes) + 'kb'
    return fmt % size_bytes + 'b'

Consider path hinting in the CLI

The current way of explicitly finding data directories is good and works well.

Longer term, if we find users are struggling to get all the right locations hooked up, we could consider taking a path hinting approach.

A snippet of code towards this (removed because it was unused in release 0.1.4):


FOLDERS = (
    "configuration",
    "input",
    "tmp",
    "output",
)

def from_path(path_hints, folders=FOLDERS):
    """ NOT IMPLEMENTED YET - Helper to find paths to individual configurations from hints
    TODO Fix this
    """
    # Set paths
    paths = dict()
    if isinstance(path_hints, str):
        if not os.path.isdir(path_hints):
            raise exceptions.FolderNotFoundException(f"Specified data folder '{path_hints}' not present")

        paths = {folder: os.path.join(path_hints, folder) for folder in folders}

    else:
        if (
            not isinstance(paths, dict)
            or (len(paths.keys()) != len(folders))
            or not all([k in folders for k in paths.keys()])
        ):
            raise exceptions.InvalidInputException(
                f"Input 'paths' should be a dict containing directory paths with the following keys: {folders}"
            )

    # Ensure paths exist on disc??
    for folder in FOLDERS:
        isfolder(paths[folder], make_if_absent=True)

Add hashes of input data

I'm submitting a ...

support request
bug report
feature request

Please fill out the relevant sections below.

Feature request

Use Case

Please [describe your motivation and use case].

I need to be able to check that a particular set of inputs was used to create an output.

Hashes of input_values, configuration_values, input_manifest, configuration_manifest, made available on the analysis object, would allow me to tag output data with the input hashes for auditability.

Add the hashes and attach to analysis object
Add a section in the documentation (and/or an app demo template) describing usage.

Consider use of tmp_dir

The original design allowed for a tmp_dir option in which to place working files, which would be cleared up afterward.

But, python 3.something introduced NamedTemporaryDirectories, which are a cleaner way of managing this, so we probably shouldn't be encouraging the use of tmp_dir.

It's possible, however, to introduce the concept of a cache_dir (for persisting intermediate results between analyses) but this requires much more thought.

Original tmp_dir option in the CLI was:

@click.option(
    "--tmp-dir",
    type=click.Path(),
    default="<data-dir>/tmp",
    show_default=True,
    help="Absolute or relative path to a folder where intermediate files should be saved. Will be cleaned up post-analysis",
)

Runner().run() method within parent twin: Avoid using 'app' as a python module, except for your main entrypoint

I'm submitting a ...

support request
bug report
feature request

Please fill out the relevant sections below.

Support request

I'm trying to use a Runner(). run() method to run a "child twin" from another "parent" twin. What would be a proper setup for this?

Use Case

Parent twins.

Current state

Running FSI Simulation twin:
octue-app run --data-dir "data"
Returns:
"Module 'app' already on system path. Using 'AppFrom' context will yield unexpected results. Avoid using 'app' as a python module, except for your main entrypoint"

Your environment

Library Version: Octue 0.1.6
Platform Linux

Additional helper methods on Taggables

I'm submitting a ...

feature request

Feature request

Use Case

Please descrie your use case

I'm trying to extract tags (and subtags) from taggable groups.

e.g if I have a datafile entry with:

{
   tags: `a-tag another:23`
}

That gives me a TagGroup object.

I want to be able to get 23 easily so that I can use it in searching for other files. with the same tag.

Current state

Please describe what you're doing presently to work around this or achieve what you're doing.

workaround by doing str(tags) then manually parsing the string with a loop. very annoying.

Use separate environments for local child services to avoid conflicts.

Figure out how to run child services in their own virtual environment using tox, just like pre-commit does.

This will solve #57
Need to build and install the virtual environments, rebuilding on dependency changes.
Ideally store somewhere so that we don't have to recreate at each invocation.
And have a method for deleting them all when everything goes to hell
It may be that the analysis.children object is actually a ServiceManager that makes sure they're available and created, or perhaps the Service resource does that itself.

Note: we decided to move this into a separate issue from #46

Output values not encoding properly

In the yuriyfoil demo, the following code:

    # Assign to the analysis outputs
    for key, value in zip(('cl', 'cdp', 'cdv', 'cp_x', 'cp'), results):
        # TODO see issue 
        analysis.output_values[key] = value

Results in:

analysis-fa8f9ad9-8c01-4709-8e46-54ee8aada52d ERROR 2020-10-06 22:52:29,699 runner 1 140199177582336 array([0.25859084]) is not of type 'array'

Failed validating 'type' in schema['properties']['cl']:
    {'description': 'Output cl values corresponding to input alpha values',
     'items': {'type': 'number'},
     'title': 'cl',
     'type': 'array'}

On instance['cl']:
    array([0.25859084])

It's fixed by manually casting the numpy arrays to lists:

    # Assign to the analysis outputs
    for key, value in zip(('cl', 'cdp', 'cdv', 'cp_x', 'cp'), results):
        # TODO see issue
        analysis.output_values[key] = value.tolist()

Clearly, the output isn't getting correctly passed through the encoder

Dynamically add/remove click options based on available_strands

Update the available command line options to only include those defined in the twine, so that the CLI is customised to the actual application.

Local file_like compatibility

Currently, the Datafile resources only record locations on the current filesystem, from which (or to which) files can be read (or written) in the user's analysis code.

Providing a set of methods, or a class inheritance so that the Datafile can be used as a context manager for opening the file itself, could provide a powerful way of easing the creation of results file.

Something like:

df = Datafile(path='my_file.bmp')
with open(df, 'w') as fp:
  fp.write('data')

or (less desirable as it's not a standard pattern but far easier to implement)

df = Datafile(path='my_file.bmp')
with df.open('w') as fp:
  fp.write('data')

Would be more elegant than

df = Datafile(path='my_file.bmp')  # or getting it from the manifest
with open(df.full_name, 'w') as fp:
    fp.write(data)

Even better, being able to use NamedTemporary files and similar could be useful to avoid hassle in garbage collection:

with NamedTemporaryFile(suffix='.csv') as fp:
       df = Datafile(fp=fp)
       self.assertEqual('csv', df.extension)

Here as a function written toward achieving that (feature presently shelved as it's tricky to get right):

def get_local_path_prefix_from_fp(fp, path=None):
    """ Handles extraction of path and local_path_prefix from a file-like object, with checking around a bunch of edge
    cases.

    Useful when you have a file-like object you've created during an analysis, to find the local_path_prefix
    you need to create a datafile:
    my_file = 'a/file/to/put/analysis/results.in'
    with open(my_file) as fp:
        # ...
        # Write stuff to file
        # ...
        # Create datafile
        path, local_path_prefix = get_local_path_prefix_from_fp(fp, path=my_file)
        Datafile(path=path, local_path_prefix=local_path_prefix)
    """

    # TODO Revamp to use path-likes properly instead of managing strings

    # Allow file-likes or class (like the tempfile classes) that wrap file-likes with a .file attribute
    instance_check = isinstance(fp.file, io.IOBase) if hasattr(fp, "file") else isinstance(fp, io.IOBase)
    if (not instance_check) or (not hasattr(fp, "name")):
        raise InvalidFilePointerException("'fp' must be a file-like object with a 'name' attribute")

    # Allow `path` to define what portion of the file path is considered a local prefix and what portion is
    # considered to be this file's path within a dataset
    fp_name = str(fp.name)  # Allows use of temporary files, whose name might be interpreted as an integer (sigh!).

    # If path not given, use the filename only
    if path is not None:
        path = fp_name.split("/\\")[-1]

    # Remove any directory prefix on the path, which should always be relative
    path = path.lstrip("\\/")

    # Check that the path given actually properly matches the end of the real location on disc
    if not fp_name.endswith(path):
        raise InvalidInputException(f"'path' ({path}) must match the end of the file path on disc ({fp_name}).")

    # Check that the path given is a whole portion
    # TODO this could be tidier. Split both paths and iterate back from the filename up the directory tree,
    #  checking at each step that things match
    local_path_prefix = utils.strip_from_end(fp_name, path.strip("\\/"))
    if len(local_path_prefix) > 0 and not local_path_prefix.endswith(("\\", "/")):
        raise InvalidInputException(f"The 'path' provided ({path}) is not a valid portion of the file path ({fp_name})")

    return path, local_path_prefix

Here are some test cases for it:

    def test_with_temporary_file(self):
        """ Ensures that a datafile can be created using an un-named temporary file.
        """
        with TemporaryFile() as fp:
            df = Datafile(fp=fp)
            self.assertEqual('', df.extension)
    
    def test_with_named_temporary_file(self):
        """ Ensures that if a user creates a namedTemporaryFile and shoves data into it, they can create a Datafile from
        it which picks up the name successfully
        """
        with NamedTemporaryFile(suffix='.csv') as fp:
            df = Datafile(fp=fp)
            self.assertEqual('csv', df.extension)
    
    def test_with_fp_and_conflicting_name(self):
        """ Ensures that a conflicting name won't work if instantiating a file pointer
        """
        with NamedTemporaryFile(suffix='/me.csv') as fp:
            # temp_name = fp.name.split('/\\')[-1].split('.')[0]
            with self.assertRaises(exceptions.InvalidInputException):
                Datafile(fp=fp, name=f'some_other_name.and_extension')
    
    def test_with_fp_and_correct_name(self):
        """ Ensures that a matching name will correctly split the file name and local path
        """
        with NamedTemporaryFile(suffix='.csv') as fp:
            temp_name = fp.name.split('/\\')[-1]
            print(temp_name)
            df = Datafile(fp=fp, path=temp_name)
            self.assertEqual(temp_name, df.full_path)
            self.assertEqual(fp.name, df.full_path)

Decide on correct OSS License and implement throughout

Licensing

We have several libraries now under OSS but havent formally decided what license to use.

@AndyClifton suggested some variant (?) of the BSD license was superior to MIT (which we currently use) because of .

Andy, can you remember the reason and clarify that variant?

Once we have decided which to use, we'll apply throughout to all repos public on github.com/octue including this one.

Stop tests leaving lots of data files in repository after each test run

Figurefile - specialised datafile class

Ideally, we'd have a Figurefile which is a specialised data file subclass, validating that produced json is actually a figure according to the plotly spec and handling wtite funcitonality etc.

Needs to be part of a wider decision about how to specialise and subclass resources.

In the now deprecated matlab sdk we had something like add_figure and the pathetic start at porting it to python looked like this (mainly useful for the pkg_resources import of the plotly schema)

import json
import pkg_resources

# TODO use __get_attr__ to lazy load this once we can rely on use of python 3.7 and upward
plotly_schema = json.loads(pkg_resources.resource_string("twined", "twined/schema/plotly_schema.json"))


def add_figure(**kwargs):
    """ Adds a figure to an output dataset. Automatically adds the tags 'type:fig extension:json'
    %
    %   ADDFIGURE(p) writes a JSON file from a plotlyfig object p (see
    %   figure.m example file in octue-app-matlab).
    %
    %   ADDFIGURE(data, layout) writes a JSON file from data and layout
    %   structures, which must be compliant with plotly spec, using MATLAB's native
    %   json encoder (2017a and later).
    %
    %   ADDFIGURE(..., tags) adds a string of tags to the figure to help the
    %   intelligence system find it. These are appended to the automatically added
    %   tags identifying it as a file.
    %
    %   uuid = ADDFIGURE(...) Returns the uuid string of the created figure,
    %   allowing you to find and refer to it from anywhere (e.g. report templates or
    %   in hyperlinks to sharable figures).

    % Generate a unique filename and default tags
    % TODO generate on the octue api so that the figure can be trivially registered
    % in the DB and rendered
    uuid = octue.utils.genUUID;
    key = [uuid '.json'];
    name = fullfile(octue.get('OutputDir'), [uuid '.json']);
    tags = 'type:fig extension:json ';

    % Parse arguments, appending tags and generating json
    % TODO validate inputs, parse more elegantly, and accept cases where the data
    % and layout keys are part of the structure or not.
    if nargin == 1
        str = plotly_json(varargin{1});

    elseif (nargin == 2) && (isstruct(varargin{2}))
        data = varargin{1};
        layout = varargin{2};
        str = jsonencode({data, layout});

    elseif (nargin == 2)
        str = plotly_json(varargin{1});
        tags = [tags varargin{2}];

    elseif nargin == 3
        data = varargin{1};
        layout = varargin{2};
        str = jsonencode({data, layout});
        tags = [tags varargin{3}];

    end

    % Write the file
    fid = fopen(name, 'w+');
    fprintf(fid, '%s', str);
    fclose(fid);

    % Append it to the output manifest
    file = octue.DataFile(name, key, uuid, tags);
    octue.get('OutputManifest').Append(file)

    end

    function str = plotly_json(p)
    %PLOTLY_JSON extracts json data from a plotlyfig object.

    jdata = m2json(p.data);
    jlayout = m2json(p.layout);
    str = sprintf('{"data": %s, "layout": %s}', escapechars(jdata), escapechars(jlayout));

    end
    """
    uuid = None
    return uuid

Manifest helpers

amy.utils.files has some helpers for creating manifests of files, validating presence of inputs and recreating file/folder structures.

Review which are needed still, update and refactor to the SDK so they can be used if necessary

def replicate_folder_structure_with_empty_files(input_folder, output_folder=None):
    """ Walk the contents of an input folder and create the same file and folder structure in the output (by touching),
    with empty files. This is good for testing the file uploader with extremely quick upload speeds
    :param input_folder:
    :param output_folder:
    :return:
    """

    # TODO refactor this out to the octue/octue-utils library and maybe use the walker for a more elegant solution

    if output_folder is None:
        input_folder_dir, input_folder_name = os.path.split(input_folder.rstrip('/\\'))
        output_folder = os.path.join(input_folder_dir, input_folder_name + '_empty')

    print('Replicating folder structure with empty files...')
    print('Input folder:', input_folder)
    print('Output folder', output_folder)

    # Traverse the input directory
    for dir_path, dirs, files in os.walk(input_folder):

        print('Traversing directory:', dir_path)

        # Make the output directory if it doesn't already exist
        rel_path = dir_path.partition(input_folder)[2]
        print('    Relative path:', rel_path)
        output_path = output_folder + rel_path
        print('    Absolute path:', output_path)
        try:
            os.mkdir(output_path)
            print('    Created directory:', output_path)
        except FileExistsError:
            print('    Output directory already exists:', output_path)

        # Touch each file in the current directory
        for file in files:
            new_filename = os.path.join(output_path, file)
            Path(new_filename).touch()
            print('    Touched file:', new_filename)


def make_input_folder_from_local(octue_input_folder='.', manifest_file='manifest.json', hints=list, link=True):
    """ Ensure that each file in the manifest is present in the octue_input_folder. If not, search for the file in the list of top level directories given in hints and symlink to those files to create the correct input data structure.

    :param octue_input_folder:
    :type octue_input_folder: str

    :param manifest_file:
    :type manifest_file: str

    :param hints: list of strings containing possible local directories where the data structure described in the manifest resides.
    :type hints: list

    :param link: If true (default), creates symlinks to local files. If false, copies them to the correct location, renaming
    :type link: bool

    :return: None
    """

    # TODO refactor this out to the octue/octue-utils library and maybe use the walker for a more elegant solution

    # TODO use the octue SDK to read the manifest

    man_file_path = os.path.join(octue_input_folder, manifest_file)
    print('Loading manifest from %s', man_file_path)
    with open(man_file_path) as man_file:
        man = json.load(man_file)
    print('Attemping to find', len(man['files']), 'files')
    for file in man['files']:

        local_path = None

        # Check for the file at the correct location and at the original path name
        path = os.path.join(octue_input_folder, file['data_file']['key'])
        alt_path = os.path.join(octue_input_folder, file['name'].lstrip('\/'))
        abs_path = os.path.join(octue_input_folder, file['name'])
        if os.path.isfile(path):
            local_path = path

        elif os.path.isfile(alt_path):
            local_path = alt_path

        elif os.path.isfile(abs_path):
            local_path = alt_path

        else:
            # Check each hint, stopping on finding the first one
            for hint in hints:
                if local_path is None:
                    # Check for both the key and the name
                    hint_path = os.path.join(hint, file['data_file']['key'])
                    alt_hint_path = os.path.join(hint, file['name'].lstrip('\/'))
                    if os.path.isfile(hint_path):
                        local_path = hint_path

                    elif os.path.isfile(alt_hint_path):
                        local_path = alt_hint_path

        # Error on missing file
        if local_path is None:
            raise Exception('Unable to locate file: %s', file)

        # Either symlink or copy, if the file isn't in the right place
        if local_path != path:
            if not os.path.isdir(os.path.dirname(path)):
                os.makedirs(os.path.dirname(path))

            if link:
                os.symlink(local_path, path)
                print('Symlinked %s to %s', local_path, path)
            else:
                shutil.copyfile(local_path, path)
                print('Copied %s to %s', local_path, path)

Replace one-line if/else statements with multi-line versions (excluding default mutable arguments)

We lose information on test coverage on one-line statements, so @thclark and I have agreed to only use them for default mutable arguments.

Make pre-commit build sphinx docs only if documentation has changed

~~This will make the pre-commit hooks faster and more efficient. It applies to all repos using pre-commit~~

Bug report: pre-commit hooks cause tox.ini to be deleted without cause

What is the current behavior?

From the repo root on the feature/search-for-subtags and main branches (and probably others), If I run:

python -m unittest
git add tests
git commit
git add tests
git commit
<Cancel commit>
git restore --staged tests
<Cancel commit>
git status

I then find that tox.ini has been deleted.

What is the expected behavior?

tox.ini should not be touched.

Your environment

Library Version: feature/search-for-subtags or main branches (and probably others)
Platform: MacOS in the Pycharm terminal
pre-commit version: 2.9.3
git version: 2.23.0

Other information

It might be the last file alphabetically that is deleted
I think either pre-commit or one of the steps in it is deleting this file (and others in other circumstances)

Github Action Version Check

If we're on a branch, release/x.y.z, then the version from setup.py should be matching x.y.z.

Create a check in github actions that fails if it doesn't match, apply it to this repo and all the others where github actions are used (e.g. twined, django-twined etc etc)

@cortadocodes this is busy work for filling in gap time, not high priority right now.

Consolidate Twined repository

Epic

We want a clear proposition for what the different octue products are - at the moment "octue" the company has a range of services and applications, whilst "octue" the python package is about services and functions (ie actually implements what we call twined).

We wish to rework this so that octue (the SDK package) can span the whole range of Octue's offerings, and so that the twined ecosystem has a clear offering, purpose and place to exist.

Decision

We had considered splitting out to create several separate packages, but for the sake of bulletproof version management (and because our set of services isn't so large that a single repository becomes cumbersome).
Based on discussions here, we've decided to:

keep code in a single repository
in a way that subpackages could be split out later (then their CLIs reimported into the octue CLI) should we grow to need a more maintainable solution
in a way that is logically separated and presentationally separated so we can show clear product offerings within the one package, beginning with twined.

Migration plan

Move twined repo code into a twined subpackage of octue-sdk
Archive twined repository
Move twined-related elements of octue-sdk down into twined subpackage
Move current CLI commands down into octue twined subcommand
Redraft documentation so that either there’s a clear ‘twined’ top level heading (eg octue.readthedocs.org/twined) or there is a separate documentation set for each subpackage (eg twined.readthedocs.org, strands.readthedocs.org)
- If the former, make sure twined.readthedocs.org has a redirection notice to the new docs
- Documentation should be comprehensively redrafted to be much more approachable for newcomers
- Use the section dividers in https://sphinx-themes.org/sample-sites/sphinx-rtd-theme/ to help
Rename octue deploy create-push-subscription subcommand to octue twined create-push-subscription, factor the logic out into a function within a deploy subpackage, and consider whether a CLI command is needed at all (if it isn’t, change the github action to use the function)
Versioning carries on as normal, as we only deal with the octue-sdk repo for twined

This means:

Users carry on installing octue-sdk as normal
New sub-SDKs and clients can be added to octue-sdk as they’re developed as either a subpackage or as separate repositories that are imported
In the future, we can move the twined subpackage back into its own repository if needed
- Corollary: we sit on the twined pypi package rather than releasing it
CLI has a breaking change but it’s not like changing the whole thing to twined run, twined start, etc.

octue / octue-sdk-python Goto Github PK

octue-sdk-python's People

Contributors

Stargazers

Watchers

Forkers

octue-sdk-python's Issues

Bug report

What is the expected behavior?

What is the current behavior?

Hints

Other information

Bug report

What is the current behavior?

What is the expected behavior?

Other information

Feature request

Use Case

Current state

What is the expected behavior?

Workaround:

Feature request

Use Case

Current state

Workaround

Bug report

What is the current behavior?

What is the expected behavior?

Caching properties

UPDATE: We're not doing it this way yet. Redis isn't suitable for direct connection from a potentially malicious source because it doesn't have auth built in. We need a solution we can auth directly eg pubsub or straightforwardly using django-twined.

Potential FilterSet methods

Filter Criteria

Bug report

What is the current behavior?

What is the expected behavior?

Testing

Useful links

Support request

About me

Request

Feature request

Use Case

Support request

Use Case

Current state

Your environment

Feature request

Use Case

Current state

Licensing

What is the current behavior?

What is the expected behavior?

Your environment

Other information

Epic

Decision

Migration plan

Recommend Projects

Recommend Topics

Recommend Org

Potential `FilterSet` methods