octue / octue-sdk-python Goto Github PK
View Code? Open in Web Editor NEWThe python SDK for @Octue services and digital twins.
Home Page: https://octue.com
License: Other
The python SDK for @Octue services and digital twins.
Home Page: https://octue.com
License: Other
Site (parent, local service) *Accept as input a set of lat/lon location pairs and return site characteristics. Return as output an average wind speed for each location, and an elevation in m above WGS84 for each location*
|
| -> Atmosphere (child1, local service)
| -> Elevation (child2, remote service)
Specify the children requirements in the Site's twine, and specify their details in a children
JSON string/file that gets parsed on run.
Add a Child
Service
Resource
which is instantiated (analogous to input_manifest
), from this JSON specification of the child services.
In the templates, attempt to use what feels like the most appropriate API for asking questions of the children. This will probably invoke a method; perhaps something like:
analysis.children['elevation-service-or-whatever-key-you-gave-it-in-the-twine'].ask(input_values={
"locations": [[0, 0], [51, 0]]
})
The method, which is fundamentally async
, should invoke the same pattern as the tests of django-twined
Figure out how to run child services in their own virtual environment using tox, just like pre-commit does.
Establish a way of communicating between the children, to exchange messages
Figure out how to configure local services!!! Possibly by adding configuration to the children
json file?
Some of the log messages are interpolating variables even when they're not being emitted. We can avoid this, saving lots of processing, by not interpolating but by using percent-notation and passing the variables in to e.g. logger.info
as extra positional arguments. i.e.
a = 3
logger.info('This is %s', a)
rather than
a = 3
logger.info(f'This is {a}')
This applies to all repos using python logging.
Use the coolname library to automatically generate slugified names like thine-mega-thingy
as a mixin, which could subclass
the Nameable mixin. This would allow users to create resources whose names are recognisable by default.
Here's some code which uses coolname from amy
(requires pip install coolname
), which is designed as a model mixin but can be refactored simply to an sdk library mixin.
from coolname import generate_slug
class CoolNamed:
""" Sets the 'name' field (if not already populated) using a cool name string and the id field
"""
def save(self, *args, **kwargs):
if hasattr(self, 'name') and self.name is None:
id_appendix = ''
if hasattr(self, 'id') and self.id is not None:
id_appendix = '-' + str(self.id)[:7]
self.name = '{coolname}{id}'.format(coolname=generate_slug(2), id=id_appendix)
super(CoolNamed, self).save(*args, **kwargs)
The following DeprecationWarning
is raised when running the tests:
/repos/octue-sdk-python/venv/lib/python3.8/site-packages/jsonschema/validators.py:928: DeprecationWarning: The metaschema specified by $schema was not found. Using the latest draft to validate, but this will raise an error in the future.
cls = validator_for(schema)
We should address it at some point if it's easy to do so.
Using:
env:
GCP_SERVICE_ACCOUNT: {{ secrets.GCP_SERVICE_ACCOUNT }}
in the tox testing action should lead to the value of that being in the environment ready for testing.
It doesn't
Over in windquest we were able to create a .env file containing secrets - using the following action step:
- name: Create .env File (enables injection of operational secrets into test container)
uses: SpicyPizza/create-envfile@v1
with:
envkey_GITHUB_ACTIONS: True
envkey_CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
envkey_DJANGO_SECRET_KEY: ${{ secrets.DJANGO_SECRET_KEY }}
envkey_DJANGO_SECURE_SSL_REDIRECT: True
envkey_GOOGLE_APPLICATION_ASSETS_BUCKET_NAME: windquest-assets-test-github-actions
envkey_GOOGLE_APPLICATION_CREDENTIALS_JSON: ${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_JSON }}
envkey_GOOGLE_MAP_API_KEY: ${{ secrets.GOOGLE_MAP_API_KEY }}
envkey_KOMBU_FERNET_KEY: ${{ secrets.KOMBU_FERNET_KEY }}
envkey_MAILGUN_API_KEY: ${{ secrets.MAILGUN_API_KEY }}
envkey_MAILGUN_SENDER_DOMAIN: mailgun.wind-pioneers.com
file_name: .env
Please give as much detail as you can, like:
I'm submitting a ...
Please fill out the relevant sections below.
analysis.input_dir is None
>>> True
analysis.data_dir is None
>>> True
analysis.output_dir is None
>>> True
output_manifest is None
>>> True
@time-trader please could you attach the twine file for this case that we discussed this afternoon?
While running octue 0.1.3 app from IDE:
python app.py run
Traceback (most recent call last):
File "app.py", line 56, in
octue_cli(args)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 1256, in invoke
Command.invoke(self, ctx)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/batman/Software/anaconda3/envs/foam_2d_twine/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
TypeError: octue_cli() missing 3 required positional arguments: 'data_dir', 'input_dir', and 'tmp_dir'
The Dataset.get_files
method allows only a very limited set of field lookups.
Implement a complete set like this
I'm submitting a ...
Please fill out the relevant sections below.
I'm creating a dataset to match NASA's Digital Elevation Model, and want to cross-check that I've added all 22911 files to it.
print(dataset)
>>> Dataset 7430991f-bf5e-4e17-8a42-b44e02495717
len(dataset)
>>> Traceback (most recent call last):
>>> File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in >>> Exec
>>> exec(exp, global_vars, local_vars)
>>> File "<input>", line 1, in <module>
>>> TypeError: object of type 'Dataset' has no len()
To return the number of files in the dataset
print(len(dataset.files))
>>> 22911
This should apply when the delete_topic_and_subscription_on_exit
flag is set to True
for tests of Service
and the CLI start
command. It works currently unless the test fails, in which case deletion is missed.
A parent service may need to hand off individual files within its manifest to child services.
It would be nice if we had a helper to pass files directly through to twins (possibly by specifying their path, within the input_values
schema or similar.
Currently, you would create a new Manifest
for running the child service, with a single Dataset
containing the Datafile
s required.
You'd then pass that manifest to the child service so it could access the files it requires.
But, this is a cumbersome way of passing files through to children.
@time-trader when working locally (i.e. having all the data files on one machine) is using the workaround of specifying the path directly to the file, as an input_value
but this won't scale in general to a system where datafiles aren't necessarily local.
The links embedded in the ReadTheDocs documentation aren't working (possibly because they're pointing to .rst
rather than .html
files.
The links link to the expected page on ReadTheDocs.
@cortadocodes In the CLI at present for release/0.1.4 we have
for directory in config_dir, input_dir:
for filename in VALUES_FILENAME, MANIFEST_FILENAME:
if not file_in_directory(VALUES_FILENAME, directory):
raise exceptions.FileNotFoundException(f"No file named {filename} file found in {directory}.")
# ... Runner ...
def file_in_directory(filename, directory):
return os.path.isfile(os.path.join(directory, filename))
However, I think (!!! Getting back into this after a couple of months focussed elsewhere!) this is able to be handled at the twined level in the validate() method - default behaviour is that if the src
(in this case a file path) is given but that strand is not in the twine, an error is thrown. If it's not given and not required by the twine, validation passes. However, it is possible to allow_extra
in order that we can provide that src
if the twine is empty and not have validation throw an error (it simply won't load the file).
Whereas this early check requires that those files are present regardless of whether they're requisite in the twine. So I'll comment this out for the time being in case it gives @time-trader any trouble using this new release.
TODO - write a unit test to make sure I'm actually correct in this and that twined will vomit when it's supposed to; no more, no less
functools.cached_property
is only available in python3.8 and above. To retain compatibility with python3.6 and python3.7, we've used this to the same effect:
@property
@functools.lru_cache(maxsize=None)
When we deprecate python3, we should update this to @functools.cached_property
A step toward solving octue/twined-server#2 and octue/django-twined#1 is to decorate an entrypoint to the app as dramatiq actor.
That way, workers can be run in any environment with any dependency set, picking tasks off of the broker... tasks which can be put onto the broker by any other app or worker or server with a differing environment.
Essentially this should allow us to call something like (from dramatiq -h
):
# Run dramatiq workers with actors defined in `./octue/actors.py`, and a broker named "redis_broker" defined in "octue.brokers", listening only to the "app-appname-version-0.0.1" queue.
$ dramatiq octue.brokers:redis_broker octue.actors -Q app-appname-version-0.0.1
The Dataset method get_file_sequence(self, field_lookup, files=None, filter_value=None, strict=True)
does some vanilla sorting on the sequence key. However, related to #6 it should be possible to provide more generic ordering on the queryset.
Tests that run the CLI create output
directories that currently need to be deleted manually.
That's mildly annoying but indicative of a deeper concern that race conditions may exist between tests if executed in parallel and using the default directory
Use or adapt the callCli
method of BaseTestCase throughout to ensure that cli calls are done in their own NamedTemporaryDirectory
context, ensuring that artifacts are automatically cleaned up and that tests are deterministic
Refactor the Dataset.get_files
method to behave like Django's QuerySet
This would return an object that could act like a chainable filter so you can do like:
my_dataset.filter('sequence__not', True).filter('extension', 'csv').filter('posix_timestamp__between', [123456, 135790]).all()
The final all()
method should yield a generator. Basically, we're talking about django's filtering here!
>>> fs = Dataset.filter(criterion=whatever).filter(othercriterion=whatever2)
>>> class(fs)
FilteredSet
Note: We have TagGroups, but Datasets. Make TagGroup => Tagset
*Note 2: Datasets are currently lists, not sets. They probably should be sets.
FilterSet
methods.as_object()
creates a new , e.g. a Dataset from the FilteredSet.filter(criterion=something)
returns a new FilterSet with additional filter criteria applied.order_by(field)
yield
to iterate through the results.apply()
applies a callable to the yielded resultsImplement at first some basics (see #7 for potential further criteria to implement in future)
Unit tests all pass when run in series, but in travis, two tests fail.
These are two tests of the run command of the CLI that pass by themselves, but not when run in serial with test_fractal_configuration
. The two tests pass when run in serial with all other tests (i.e. when test_fractal_configuration
is disabled.
@cortadocodes thinks it's something to do with paths being leaked from the template setting and making their way into other tests using AppFrom.
Failure message, on Python 3.6.3
======================================================================
FAIL: test_run_command_can_be_added (tests.test_cli.RunnerTestCase)
Test that an arbitrary run command can be used in the run command of the CLI.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/travis/build/octue/octue-sdk-python/tests/test_cli.py", line 41, in test_run_command_can_be_added
assert CUSTOM_APP_RUN_MESSAGE in result.output
AssertionError
======================================================================
FAIL: test_run_command_works_with_data_dir (tests.test_cli.RunnerTestCase)
Test that the run command of the CLI works with the --data-dir option.
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/travis/build/octue/octue-sdk-python/tests/test_cli.py", line 55, in test_run_command_works_with_data_dir
assert CUSTOM_APP_RUN_MESSAGE in result.output
AssertionError
----------------------------------------------------------------------
I'm considering removing the tmp-dir
functionality, which @cortadocodes has pointed out is presently unhooked in the 0.1.4 release (#25)
The reason is that a unique tmp_dir needs to be assigned for each analysis in order to remain thread-safe. Although it could be useful as a cache between runs, I believe the expectation of something named like this would be that it's temporary to that run (so for example listing all the files created in tmp_dir would be deterministic for a given analysis).
In which case it would be more helpful to our users if we manage this for them, by creating a NamedTemporaryDirectory scoped to the analysis. Then if caches are required to cross analyses, we could handle that in a more sensitive way, providing a helpful API and docs specifically for this.
@time-trader are you using tmp_dir
for storing temporary files anywhere? If so I'll implement this change carefully to avoid breaking your code. Otherwise I'll use a sledgehammer.
Django queryset syntax allows aggregation as part of the queryset:
expressions, output_field, filter, **extra, Avg, Count, Max, Min, StdDev, Sum, Variance
Could consider enabling this to apply expressions to queryset (e.g. in the Dataset.get_files
method) as is being developed in #6
For example, it's be good to query for a subset of the files, then .apply()
a function to them.
eg
subset = Dataset.get_files(has_tag="something", extension="csv", whatever="whatever) # or whatever the query syntax is to get a data subset
subset.apply_to_all_files(lambda file: print(file.hash_value), async=true)
This will allow Scientists to
This will allow us to
Add methods to the Dataset() class which
Ultimately I think this'll become quite useful, but we don't have anybody desperately needing it right now. Any views on whether this is a worthwhile feature?
Likely to become easy to implement if #6 is properly solved.
I'm submitting a ...
Please fill out the relevant sections below.
Docs not found at the (supposed) link from the README due to outdated location (https://octue.readthedocs.io/en/latest/?badge=latest)
Both badge and links to documentation need to be updated with correct location https://readthedocs.org/projects/octue-python-sdk/
Docs not building on Readthedocs
https://readthedocs.org/projects/octue-python-sdk/builds/12694260/
README file is still based on python library template, so needs to be cut down (to match twined README or similar)
Docs build, readme is relevant and has correct links!
At the moment we recognise the wider issue that documentation will be moved around substantially in #69 but at present, we just want to get what we have to serve correctly.
When setting up the analysis logging, optionally create an additional handler to send log entries to a websocket, whose a URI could be passed as a CLI option then into the runner.
This will enable the logs from the analyses to be streamed back to the end user.
@cortadocodes how do you feel about taking this one on?
I'm not sure how to create a websocket for testing purposes but it must be doable.
From amy.utils.path
we have a path splitter which retains the initial '/' if present in a path.
Review and refactor into the SDK if helpful.
import os.path
import sys
def split(path, max_depth=1):
"""
http://nicks-liquid-soapbox.blogspot.co.uk/2011/03/splitting-path-to-list-in-python.html
:param path: str Path to split
:param max_depth: int Recursion limit (max number of directories in the path to be split out), default 1.
Setting max_depth=1 gives normal (head, tail) result consistent with
os.path.split(path), whilst setting max_depth=None allows recursion to the system limit
:return:
"""
if max_depth is None:
# Add one so that the system will raise a recursion exception when the limit is reached, instead of quietly returning the wrong thing.
max_depth = sys.getrecursionlimit() + 1
def splitpath_recurse(path, depth=1):
# TODO replace the last \ or / with os.path.sep to allow processing of paths not generated on this system
(head, tail) = os.path.split(path)
return splitpath_recurse(head, depth - 1) + [tail] if depth and head and head != path else [head or tail]
return splitpath_recurse(path, max_depth)
@thclark and I have decided to use python-socketio
to allow communication between services as it is simple to use and we already have an outline implementation. However, it is not secure by default and may not provide any compression of data, both of which will become a problem as we scale. To address this, we think we could replace the use of python-socketio
with gRPC
at some point in the future as it has security and compression built in. However, for the first iteration of inter-service communication, gRPC
seems a little complex as it requires a .proto
javascript-like file to define message types outside of python.
gRPC
socket.io
Octue user, developing a digital twin for HAWT under AeroSense project.
FSI Module is the overarching service for running simulations and retrieving results, which should be called by other parts of Aerosense.
For 0.1.3 I removed the cumbersome path-based system for running analyses, upgrading to the Runner() class.
However, this has the side effect of no longer having self-made paths. It'll be more flexible, because a dataset can sit there on a system with a fixed path regardless of which analysis it belongs to... but means there's no baked-in path on the dataset.
We need to resolve this, either by attaching a path index to the Analysis instance which manages that all... or by doing something like a mixin that gets instantiated with the appropriate path
and local_path_prefix
.
Inspired by the Datafile
class this might look something like the following, but I haven't figured out a way to instantiate it yet which will then be easily used.
import os
import uuid
from octue.exceptions import InvalidInputException
from octue.utils import gen_uuid
class Pathable:
""" Mixin to allow a class to have a path attribute (which may be a directory or file path and name)
Prevents setting path after an object is instantiated.
```
class MyResource(Pathable):
pass
MyResource().id # Some generated uuid
MyResource(id='not_a_uuid') # Raises exception
MyResource(id='a10603a0-194c-40d0-a7b7-fcf9952c3690').id # That same uuid
```
"""
_path_field = None
def __init__(self, *args, path=None, local_path_prefix=".", **kwargs):
""" Constructor for Pathable class
# TODO Update datafile to use this mixin
"""
super().__init__(*args, **kwargs)
path = path or self._get_default_path()
self._path = self._clean_path(path)
self.local_path_prefix = str(os.path.abspath(local_path_prefix))
def _get_default_path(self):
if self._path_field is not None:
return str(getattr(self, self._path_field))
@staticmethod
def _clean_path(path):
if path is not None:
return str(os.path.normpath(path)).lstrip(r"\/")
@property
def name(self):
return str(os.path.split(self.path)[-1])
@property
def full_path(self):
return os.path.join(self.local_path_prefix, self.path)
@property
def path(self):
return self._path
Remove travis file from repo for #36 now that github actions are used instead.
Redesign CLI to a single CLI, instead of a mechanism that defines a new CLI for every app.
@octue_version
decoratorlogs
directoryCode presently in amy.utils.files
should be refactored to the Octue SDK utils, allowing us to present friendly file sizes for datasets and datafiles:
def size_kb(size_bytes):
if not size_bytes:
return 0
return size_bytes / 1024
def size_mb(size_bytes):
if not size_bytes:
return 0
return size_bytes / 1048576
def size_gb(size_bytes):
if not size_bytes:
return 0
return size_bytes / 1073741824
def size_tb(size_bytes):
if not size_bytes:
return 0
return size_bytes / 1099511627776
def size_str(size_bytes, fmt='%.02f '):
""" Return sensible human formatted size string
:param size_bytes: file/dataset size in bytes
:param fmt: string format specifier for the floating point size. default '%.02f '
:return: str
"""
if size_bytes >= 1099511627776:
return fmt % size_tb(size_bytes) + 'tb'
if size_bytes >= 1073741824:
return fmt % size_gb(size_bytes) + 'gb'
if size_bytes >= 1048576:
return fmt % size_mb(size_bytes) + 'mb'
if size_bytes >= 1024:
return fmt % size_kb(size_bytes) + 'kb'
return fmt % size_bytes + 'b'
The current way of explicitly finding data directories is good and works well.
Longer term, if we find users are struggling to get all the right locations hooked up, we could consider taking a path hinting approach.
A snippet of code towards this (removed because it was unused in release 0.1.4):
FOLDERS = (
"configuration",
"input",
"tmp",
"output",
)
def from_path(path_hints, folders=FOLDERS):
""" NOT IMPLEMENTED YET - Helper to find paths to individual configurations from hints
TODO Fix this
"""
# Set paths
paths = dict()
if isinstance(path_hints, str):
if not os.path.isdir(path_hints):
raise exceptions.FolderNotFoundException(f"Specified data folder '{path_hints}' not present")
paths = {folder: os.path.join(path_hints, folder) for folder in folders}
else:
if (
not isinstance(paths, dict)
or (len(paths.keys()) != len(folders))
or not all([k in folders for k in paths.keys()])
):
raise exceptions.InvalidInputException(
f"Input 'paths' should be a dict containing directory paths with the following keys: {folders}"
)
# Ensure paths exist on disc??
for folder in FOLDERS:
isfolder(paths[folder], make_if_absent=True)
I'm submitting a ...
Please fill out the relevant sections below.
Please [describe your motivation and use case].
I need to be able to check that a particular set of inputs was used to create an output.
Hashes of input_values, configuration_values, input_manifest, configuration_manifest, made available on the analysis object, would allow me to tag output data with the input hashes for auditability.
The original design allowed for a tmp_dir option in which to place working files, which would be cleared up afterward.
But, python 3.something introduced NamedTemporaryDirectories, which are a cleaner way of managing this, so we probably shouldn't be encouraging the use of tmp_dir.
It's possible, however, to introduce the concept of a cache_dir (for persisting intermediate results between analyses) but this requires much more thought.
Original tmp_dir option in the CLI was:
@click.option(
"--tmp-dir",
type=click.Path(),
default="<data-dir>/tmp",
show_default=True,
help="Absolute or relative path to a folder where intermediate files should be saved. Will be cleaned up post-analysis",
)
I'm submitting a ...
Please fill out the relevant sections below.
I'm trying to use a Runner(). run() method to run a "child twin" from another "parent" twin. What would be a proper setup for this?
Parent twins.
Running FSI Simulation twin:
octue-app run --data-dir "data"
Returns:
"Module 'app' already on system path. Using 'AppFrom' context will yield unexpected results. Avoid using 'app' as a python module, except for your main entrypoint"
I'm submitting a ...
Please descrie your use case
I'm trying to extract tags (and subtags) from taggable groups.
e.g if I have a datafile entry with:
{
tags: `a-tag another:23`
}
That gives me a TagGroup object.
I want to be able to get 23
easily so that I can use it in searching for other files. with the same tag.
Please describe what you're doing presently to work around this or achieve what you're doing.
workaround by doing str(tags) then manually parsing the string with a loop. very annoying.
Figure out how to run child services in their own virtual environment using tox, just like pre-commit does.
Note: we decided to move this into a separate issue from #46
In the yuriyfoil demo, the following code:
# Assign to the analysis outputs
for key, value in zip(('cl', 'cdp', 'cdv', 'cp_x', 'cp'), results):
# TODO see issue
analysis.output_values[key] = value
Results in:
analysis-fa8f9ad9-8c01-4709-8e46-54ee8aada52d ERROR 2020-10-06 22:52:29,699 runner 1 140199177582336 array([0.25859084]) is not of type 'array'
Failed validating 'type' in schema['properties']['cl']:
{'description': 'Output cl values corresponding to input alpha values',
'items': {'type': 'number'},
'title': 'cl',
'type': 'array'}
On instance['cl']:
array([0.25859084])
It's fixed by manually casting the numpy arrays to lists:
# Assign to the analysis outputs
for key, value in zip(('cl', 'cdp', 'cdv', 'cp_x', 'cp'), results):
# TODO see issue
analysis.output_values[key] = value.tolist()
Clearly, the output isn't getting correctly passed through the encoder
Update the available command line options to only include those defined in the twine, so that the CLI is customised to the actual application.
Currently, the Datafile
resources only record locations on the current filesystem, from which (or to which) files can be read (or written) in the user's analysis code.
Providing a set of methods, or a class inheritance so that the Datafile can be used as a context manager for opening the file itself, could provide a powerful way of easing the creation of results file.
Something like:
df = Datafile(path='my_file.bmp')
with open(df, 'w') as fp:
fp.write('data')
or (less desirable as it's not a standard pattern but far easier to implement)
df = Datafile(path='my_file.bmp')
with df.open('w') as fp:
fp.write('data')
Would be more elegant than
df = Datafile(path='my_file.bmp') # or getting it from the manifest
with open(df.full_name, 'w') as fp:
fp.write(data)
Even better, being able to use NamedTemporary files and similar could be useful to avoid hassle in garbage collection:
with NamedTemporaryFile(suffix='.csv') as fp:
df = Datafile(fp=fp)
self.assertEqual('csv', df.extension)
Here as a function written toward achieving that (feature presently shelved as it's tricky to get right):
def get_local_path_prefix_from_fp(fp, path=None):
""" Handles extraction of path and local_path_prefix from a file-like object, with checking around a bunch of edge
cases.
Useful when you have a file-like object you've created during an analysis, to find the local_path_prefix
you need to create a datafile:
my_file = 'a/file/to/put/analysis/results.in'
with open(my_file) as fp:
# ...
# Write stuff to file
# ...
# Create datafile
path, local_path_prefix = get_local_path_prefix_from_fp(fp, path=my_file)
Datafile(path=path, local_path_prefix=local_path_prefix)
"""
# TODO Revamp to use path-likes properly instead of managing strings
# Allow file-likes or class (like the tempfile classes) that wrap file-likes with a .file attribute
instance_check = isinstance(fp.file, io.IOBase) if hasattr(fp, "file") else isinstance(fp, io.IOBase)
if (not instance_check) or (not hasattr(fp, "name")):
raise InvalidFilePointerException("'fp' must be a file-like object with a 'name' attribute")
# Allow `path` to define what portion of the file path is considered a local prefix and what portion is
# considered to be this file's path within a dataset
fp_name = str(fp.name) # Allows use of temporary files, whose name might be interpreted as an integer (sigh!).
# If path not given, use the filename only
if path is not None:
path = fp_name.split("/\\")[-1]
# Remove any directory prefix on the path, which should always be relative
path = path.lstrip("\\/")
# Check that the path given actually properly matches the end of the real location on disc
if not fp_name.endswith(path):
raise InvalidInputException(f"'path' ({path}) must match the end of the file path on disc ({fp_name}).")
# Check that the path given is a whole portion
# TODO this could be tidier. Split both paths and iterate back from the filename up the directory tree,
# checking at each step that things match
local_path_prefix = utils.strip_from_end(fp_name, path.strip("\\/"))
if len(local_path_prefix) > 0 and not local_path_prefix.endswith(("\\", "/")):
raise InvalidInputException(f"The 'path' provided ({path}) is not a valid portion of the file path ({fp_name})")
return path, local_path_prefix
Here are some test cases for it:
def test_with_temporary_file(self):
""" Ensures that a datafile can be created using an un-named temporary file.
"""
with TemporaryFile() as fp:
df = Datafile(fp=fp)
self.assertEqual('', df.extension)
def test_with_named_temporary_file(self):
""" Ensures that if a user creates a namedTemporaryFile and shoves data into it, they can create a Datafile from
it which picks up the name successfully
"""
with NamedTemporaryFile(suffix='.csv') as fp:
df = Datafile(fp=fp)
self.assertEqual('csv', df.extension)
def test_with_fp_and_conflicting_name(self):
""" Ensures that a conflicting name won't work if instantiating a file pointer
"""
with NamedTemporaryFile(suffix='/me.csv') as fp:
# temp_name = fp.name.split('/\\')[-1].split('.')[0]
with self.assertRaises(exceptions.InvalidInputException):
Datafile(fp=fp, name=f'some_other_name.and_extension')
def test_with_fp_and_correct_name(self):
""" Ensures that a matching name will correctly split the file name and local path
"""
with NamedTemporaryFile(suffix='.csv') as fp:
temp_name = fp.name.split('/\\')[-1]
print(temp_name)
df = Datafile(fp=fp, path=temp_name)
self.assertEqual(temp_name, df.full_path)
self.assertEqual(fp.name, df.full_path)
We have several libraries now under OSS but havent formally decided what license to use.
@AndyClifton suggested some variant (?) of the BSD license was superior to MIT (which we currently use) because of .
Andy, can you remember the reason and clarify that variant?
Once we have decided which to use, we'll apply throughout to all repos public on github.com/octue including this one.
Ideally, we'd have a Figurefile which is a specialised data file subclass, validating that produced json is actually a figure according to the plotly spec and handling wtite funcitonality etc.
Needs to be part of a wider decision about how to specialise and subclass resources.
In the now deprecated matlab sdk we had something like add_figure
and the pathetic start at porting it to python looked like this (mainly useful for the pkg_resources import of the plotly schema)
import json
import pkg_resources
# TODO use __get_attr__ to lazy load this once we can rely on use of python 3.7 and upward
plotly_schema = json.loads(pkg_resources.resource_string("twined", "twined/schema/plotly_schema.json"))
def add_figure(**kwargs):
""" Adds a figure to an output dataset. Automatically adds the tags 'type:fig extension:json'
%
% ADDFIGURE(p) writes a JSON file from a plotlyfig object p (see
% figure.m example file in octue-app-matlab).
%
% ADDFIGURE(data, layout) writes a JSON file from data and layout
% structures, which must be compliant with plotly spec, using MATLAB's native
% json encoder (2017a and later).
%
% ADDFIGURE(..., tags) adds a string of tags to the figure to help the
% intelligence system find it. These are appended to the automatically added
% tags identifying it as a file.
%
% uuid = ADDFIGURE(...) Returns the uuid string of the created figure,
% allowing you to find and refer to it from anywhere (e.g. report templates or
% in hyperlinks to sharable figures).
% Generate a unique filename and default tags
% TODO generate on the octue api so that the figure can be trivially registered
% in the DB and rendered
uuid = octue.utils.genUUID;
key = [uuid '.json'];
name = fullfile(octue.get('OutputDir'), [uuid '.json']);
tags = 'type:fig extension:json ';
% Parse arguments, appending tags and generating json
% TODO validate inputs, parse more elegantly, and accept cases where the data
% and layout keys are part of the structure or not.
if nargin == 1
str = plotly_json(varargin{1});
elseif (nargin == 2) && (isstruct(varargin{2}))
data = varargin{1};
layout = varargin{2};
str = jsonencode({data, layout});
elseif (nargin == 2)
str = plotly_json(varargin{1});
tags = [tags varargin{2}];
elseif nargin == 3
data = varargin{1};
layout = varargin{2};
str = jsonencode({data, layout});
tags = [tags varargin{3}];
end
% Write the file
fid = fopen(name, 'w+');
fprintf(fid, '%s', str);
fclose(fid);
% Append it to the output manifest
file = octue.DataFile(name, key, uuid, tags);
octue.get('OutputManifest').Append(file)
end
function str = plotly_json(p)
%PLOTLY_JSON extracts json data from a plotlyfig object.
jdata = m2json(p.data);
jlayout = m2json(p.layout);
str = sprintf('{"data": %s, "layout": %s}', escapechars(jdata), escapechars(jlayout));
end
"""
uuid = None
return uuid
amy.utils.files
has some helpers for creating manifests of files, validating presence of inputs and recreating file/folder structures.
Review which are needed still, update and refactor to the SDK so they can be used if necessary
def replicate_folder_structure_with_empty_files(input_folder, output_folder=None):
""" Walk the contents of an input folder and create the same file and folder structure in the output (by touching),
with empty files. This is good for testing the file uploader with extremely quick upload speeds
:param input_folder:
:param output_folder:
:return:
"""
# TODO refactor this out to the octue/octue-utils library and maybe use the walker for a more elegant solution
if output_folder is None:
input_folder_dir, input_folder_name = os.path.split(input_folder.rstrip('/\\'))
output_folder = os.path.join(input_folder_dir, input_folder_name + '_empty')
print('Replicating folder structure with empty files...')
print('Input folder:', input_folder)
print('Output folder', output_folder)
# Traverse the input directory
for dir_path, dirs, files in os.walk(input_folder):
print('Traversing directory:', dir_path)
# Make the output directory if it doesn't already exist
rel_path = dir_path.partition(input_folder)[2]
print(' Relative path:', rel_path)
output_path = output_folder + rel_path
print(' Absolute path:', output_path)
try:
os.mkdir(output_path)
print(' Created directory:', output_path)
except FileExistsError:
print(' Output directory already exists:', output_path)
# Touch each file in the current directory
for file in files:
new_filename = os.path.join(output_path, file)
Path(new_filename).touch()
print(' Touched file:', new_filename)
def make_input_folder_from_local(octue_input_folder='.', manifest_file='manifest.json', hints=list, link=True):
""" Ensure that each file in the manifest is present in the octue_input_folder. If not, search for the file in the list of top level directories given in hints and symlink to those files to create the correct input data structure.
:param octue_input_folder:
:type octue_input_folder: str
:param manifest_file:
:type manifest_file: str
:param hints: list of strings containing possible local directories where the data structure described in the manifest resides.
:type hints: list
:param link: If true (default), creates symlinks to local files. If false, copies them to the correct location, renaming
:type link: bool
:return: None
"""
# TODO refactor this out to the octue/octue-utils library and maybe use the walker for a more elegant solution
# TODO use the octue SDK to read the manifest
man_file_path = os.path.join(octue_input_folder, manifest_file)
print('Loading manifest from %s', man_file_path)
with open(man_file_path) as man_file:
man = json.load(man_file)
print('Attemping to find', len(man['files']), 'files')
for file in man['files']:
local_path = None
# Check for the file at the correct location and at the original path name
path = os.path.join(octue_input_folder, file['data_file']['key'])
alt_path = os.path.join(octue_input_folder, file['name'].lstrip('\/'))
abs_path = os.path.join(octue_input_folder, file['name'])
if os.path.isfile(path):
local_path = path
elif os.path.isfile(alt_path):
local_path = alt_path
elif os.path.isfile(abs_path):
local_path = alt_path
else:
# Check each hint, stopping on finding the first one
for hint in hints:
if local_path is None:
# Check for both the key and the name
hint_path = os.path.join(hint, file['data_file']['key'])
alt_hint_path = os.path.join(hint, file['name'].lstrip('\/'))
if os.path.isfile(hint_path):
local_path = hint_path
elif os.path.isfile(alt_hint_path):
local_path = alt_hint_path
# Error on missing file
if local_path is None:
raise Exception('Unable to locate file: %s', file)
# Either symlink or copy, if the file isn't in the right place
if local_path != path:
if not os.path.isdir(os.path.dirname(path)):
os.makedirs(os.path.dirname(path))
if link:
os.symlink(local_path, path)
print('Symlinked %s to %s', local_path, path)
else:
shutil.copyfile(local_path, path)
print('Copied %s to %s', local_path, path)
We lose information on test coverage on one-line statements, so @thclark and I have agreed to only use them for default mutable arguments.
This will make the pre-commit hooks faster and more efficient. It applies to all repos using pre-commit
From the repo root on the feature/search-for-subtags
and main
branches (and probably others), If I run:
python -m unittest
git add tests
git commit
git add tests
git commit
<Cancel commit>
git restore --staged tests
<Cancel commit>
git status
I then find that tox.ini
has been deleted.
tox.ini
should not be touched.
feature/search-for-subtags
or main
branches (and probably others)pre-commit
version: 2.9.3git
version: 2.23.0pre-commit
or one of the steps in it is deleting this file (and others in other circumstances)If we're on a branch, release/x.y.z
, then the version from setup.py
should be matching x.y.z
.
Create a check in github actions that fails if it doesn't match, apply it to this repo and all the others where github actions are used (e.g. twined, django-twined etc etc)
@cortadocodes this is busy work for filling in gap time, not high priority right now.
We want a clear proposition for what the different octue products are - at the moment "octue" the company has a range of services and applications, whilst "octue" the python package is about services and functions (ie actually implements what we call twined
).
We wish to rework this so that octue (the SDK package) can span the whole range of Octue's offerings, and so that the twined ecosystem has a clear offering, purpose and place to exist.
We had considered splitting out to create several separate packages, but for the sake of bulletproof version management (and because our set of services isn't so large that a single repository becomes cumbersome).
Based on discussions here, we've decided to:
This means:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.