Giter VIP home page Giter VIP logo

ckanext-blob-storage's Introduction

ckanext-blob-storage

Build Status Tests Coverage Status

Move CKAN resource storage management to an external micro-service

ckanext-blob-storage replace's CKAN's default local blob storage functionality with pluggable storage layer supporting cloud and local. It supports direct to cloud file uploading following the design in https://tech.datopian.com/blob-storage/#ckan-v3

The design is pluggable so one can use all the major storage backends as well as local, cloud based (e.g. S3, Azure Blobs, GCP, etc.) or any other storage. In addition, the service allows clients (typically browsers) to upload and download files directly to storage without passing them through CKAN, which can greatly improve file access efficiency.

Authentication and authorization to the blob storage management service is done via JWT tokens provided by ckanext-authz-service.

Internally, the blob storage management service is in fact a Git LFS server implementation, which means access via 3rd party Git based tools is also potentially possible.

Configuration settings

ckanext.blob_storage.storage_service_url = 'https://...'

Set the URL of the blob storage microservice (the Git LFS server). This must be a URL accessible to browsers connecting to the service.

ckanext.blob_storage.storage_namespace = my-ckan-instance

Set the in-storage namespace used for this CKAN instance. This is useful if multiple CKAN instances are using the same storage microservice instance, and you need to seperate permission scopes between them.

If not specified, ckan will be used as the default namespace.

Required resource fields

There are a few resource fields that are required for ckanext-blob-storage to operate. API / SDK users needs to set them on the requests to create new resources.

The required fields are:

  • url: the file name, without path (required by vanilla CKAN not just by blob storage)
  • url_type: set to "upload" for uploaded files
  • sha256: the SHA256 of the file
  • size: the size of the file in bytes
  • lfs_prefix: the LFS server path of where the file has been stored by Giftless. Something like org/dataset or storage_namespace/dataset_id.

If sha256, size or lfs_prefix are missing for uploads ('url_type == 'upload'), the API call will return a ValidationError:

{
  "help": "http://ckan:5000/api/3/action/help_show?name=resource_create",
  "success": false,
  "error": {
    "__type": "Validation Error",
    "url_type": [
      "Resource's sha256 field cannot be missing for uploads.",
      "Resource's size field cannot be missing for uploads.",
      "Resource's lfs_prefix field cannot be missing for uploads."
    ]
  }
}

Requirements

  • This extension works with CKAN 2.8.x and CKAN 2.9.x.
  • ckanext-authz-service must be installed and enabled
  • A working and configured Git LFS server accessible to the browser. We recommend usign Giftless but other implementations may be configured to work as well.

Installation

To install ckanext-blob-storage:

  1. Activate your CKAN virtual environment, for example:
. /usr/lib/ckan/default/bin/activate
  1. Install the ckanext-blob-storage Python package into your virtual environment:
pip install ckanext-blob-storage
  1. Add blob_storage to the ckan.plugins setting in your CKAN config file (by default the config file is located at /etc/ckan/default/production.ini).

  2. Restart CKAN. For example if you've deployed CKAN with Apache on Ubuntu:

sudo service apache2 reload

Developer installation

To install ckanext-blob-storage for development, do the following:

  1. Pull the project code from Github
git clone https://github.com/datopian/ckanext-blob-storage.git
cd ckanext-blob-storage
  1. Create a Python 2.7 virtual environment (The flag -p py27 is used to ensure that you are using the right Python version when create the virtualenv).
virtualenv .venv27 -p py27
source .venv27/bin/activate
  1. Run the following command to bootstrap the entire environment
make dev-start

This will pull and install CKAN and all it's dependencies into your virtual environment, create all necessary configuration files, launch external services using Docker Compose and start the CKAN development server.

You can create an user using the web interface at localhost:5000 but the user will not be an admin with permissions to create organizations or datasets. If you need to turn your user in an admin, make sure the virtual environment is still active and use this command, replacing the <USERNAME> with the user name you created:

paster --plugin=ckan sysadmin -c ckan/development.ini add <USERNAME>

You can repeat the last command at any time to start developing again.

Type make help to get a like of user commands useful to managing the local environment.

Update DataPub (resource editor) app

  1. Init submodule for the resource editor app
git submodule init
git submodule update
  1. Build the resource editor app
cd datapub
yarn
yarn build
  1. Replace bundles in fanstatic directory
rm ckanext/blob_storage/fanstatic/js/*
cp datapub/build/static/js/*.js ckanext/blob_storage/fanstatic/js/

If you also want to re-use stylesheets:

rm ckanext/blob_storage/fanstatic/css/*
cp datapub/build/static/css/*.css ckanext/blob_storage/fanstatic/css/
  1. Now, make sure to update the resources in templates/blob_storage/snippets/upload_module.html
{% resource 'blob-storage/css/main.{hash}.chunk.css' %}

{% resource 'blob-storage/js/runtime-main.{hash}.js' %}
{% resource 'blob-storage/js/2.{hash}.chunk.js' %}
{% resource 'blob-storage/js/main.{hash}.chunk.js' %}

Installing with Docker

Unlike other CKAN extensions, blob storage needs node modules to be installed and build in order to work properly. You will need to install node and npm. Below is how your Dockerfile might look like

RUN apt-get -q -y install \
        python-pip \
        curl \
        git-core

RUN curl -sL https://deb.nodesource.com/setup_14.x | bash - && apt-get install nodejs && npm version

# Install ckanext-blob-storage
RUN git clone --branch ${CKANEXT_BLOB_STORAGE_VERSION} https://github.com/datopian/ckanext-blob-storage
RUN pip install --no-cache-dir -r "ckanext-blob-storage/requirements.py2.txt"
RUN pip install -e ckanext-blob-storage

# Install other extensions
...

NOTE: We assume that you have Giftless server running with configuration as in giftless.yaml and nginx is configured as in nginx.conf

Working with requirements.txt files

tl;dr

  • You do not touch *requirements.*.txt files directly. We use pip-tools and custom make targets to manage these files.
  • Use make develop to install the right development time requirements into your current virtual environment
  • Use make install to install the right runtime requirements into your current virtual environment
  • To add requirements, edit requirements.in or dev-requirements.in and run make requirements. This will recompile the requirements file(s) for your current Python version. You may need to do this for the other Python version by switching to a different Python virtual environment before committing your changes.

More background

This project manages requirements in a relatively complex way, in order to seamlessly support Python 2.7 and 3.x.

For this reason, you will see 4 requirements files in the project root:

  • requirements.py2.txt - Python 2 runtime requirements
  • requirements.py3.txt - Python 3 runtime requirements
  • dev-requirements.py2.txt - Python 2 development requirements
  • dev-requirements.py3.txt - Python 3 development requirements

These are generated using the pip-compile command (a part of pip-tools) from the corresponding requirements.in and dev-requirements.in files.

To understand why pip-compile is used, read the pip-tools manual. In short, this allows us to pin dependencies of dependencies, thus resolving potential deployment conflicts, without the headache of managing the specific version of each Nth-level dependency.

In order to support both Python 2.7 and 3.x, which tend to require slightly different dependencies, we use requirements.in files to generate major-version specific requirements files. These, in turn, should be used when installing the package.

In order to simplify things, the make targets specified above will automate the process for the current Python version.

Adding Requirements

Requirements are managed in .in files - these are the only files that should be edited directly.

Take care to specify a version for each requirement, to the level required to maintain future compatibility, but not to specify an exact version unless necessary.

For example, the following are good requirements.in lines:

pyjwt[crypto]==1.7.*
pyyaml==5.*
pytz

This allows these packages to be upgraded to a minor version, without the risk of breaking compatibility.

Note that pytz is specified with no version on purpose, as we want it updated to the latest possible version on each new rebuild.

Developers wanting to add new requirements (runtime or development time), should take special care to update the requirements.txt files for all supported Python versions by running make requirements on different virtual environment, after updating the relevant .in file.

Applying Patch-level upgrades to requirements

You can delete *requirements.*.txt and run make requirements.

TODO: we can probably do this in a better way - create a make target for this.

Tests

To run the tests, do:

make test

To run the tests and produce a coverage report, first make sure you have coverage installed in your virtualenv (pip install coverage) then run:

make coverage

Releasing a new version of ckanext-blob-storage

ckanext-blob-storage should be available on PyPI as https://pypi.org/project/ckanext-blob-storage. To publish a new version to PyPI follow these steps:

  1. Update the version number in the setup.py file. See PEP 440 for how to choose version numbers.

  2. Make sure you have the latest version of necessary packages:

    pip install --upgrade setuptools wheel twine
  1. Create a source and binary distributions of the new version:
    python setup.py sdist bdist_wheel && twine check dist/*

Fix any errors you get.

  1. Upload the source distribution to PyPI:
    twine upload dist/*
  1. Commit any outstanding changes:
    git commit -a
  1. Tag the new release of the project on GitHub with the version number from the setup.py file. For example if the version number in setup.py is 0.0.1 then do:
    git tag 0.0.1
    git push --tags

ckanext-blob-storage's People

Contributors

amercader avatar anuveyatsu avatar cotts avatar cuducos avatar dependabot[bot] avatar mariorodeghiero avatar mbeilin avatar pdelboca avatar rufuspollock avatar shevron avatar tomeksabala avatar zelima avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ckanext-blob-storage's Issues

Implement resource download flow

Implement the resource download flow for resources uploaded to external storage. Most likely should be based on a route that gets a signed URL from LFS on the server side and redirects the user to it.

Acceptance Criteria:

  • Resources uploaded to external storage via LFS are accessible to download from all CKAN UI links
  • Resources uploaded to previous storage model are still accessible to download
  • External URL resources are still accessible
  • Users without full read access or :data subscope read access to the resource cannot download the resource

Technical Tasks

  • Register a CKAN blueprint to override the download route
  • Blueprint should check for the lfs_prefix resource extra attribute and if not set, fall back to CKAN's default behavior (or cloudstorage's behavior if it is installed)
  • If it is an LFS managed file:
    • Get an authz token by internally calling authz_authorize action
    • Call Giftless from Python and check the response
    • Redirect the user to the pre-signed download URL provided by Giftless
  • Even better: that logic should be encapsulated in an action; This allows using the API to get the download URL

Open Questions

  • We cannot maintain headers when redirecting (e.g. if LFS provides header: { Authorization: ... }). Should we rely on URLs including everything that the client needs in order to download?
  • How do we fall back to legacy download methods?

Authorization breaks down once a dataset is moved or renamed, can't download

Right now, when a dataset is moved to a different organization, or is renamed, or the organization is renamed, authorization will break down and the resource will no longer be available for download.

To reproduce:

  1. Set Giftless up in a manner that requires JWT based authorization to read a file
  2. Create a private dataset + resource stored in Blob Storage.
  3. Move the dataset to a new organization or rename the dataset or rename the organization
  4. Try to download the file

Analysis

In get_authz_token() we obtain an authorization token to download resources of the organization / dataset who's name is saved in lfs_prefix.

  • Problem 1: when the name is no longer valid, getting an auth token for it will no longer work
  • Problem 2: If we use the dataset / org actual name (not the old one saved in lfs_prefix), we will be rejected by Giftless because it has the resource stored based on lfs_prefix.

Potential fixes:

  1. Switching to store using UUIDs and not names: will only partially solve the problem, won't help if a dataset is moved between organizations

  2. Writing custom resource authorization handler in ckanext-blob-storage that authorizes based on the actual resource, then provides a token for lfs_prefix - may work but not sure this is possible given current ckanext-authz-service API which may not let set a custom scope for a resource auth request. This will require decoupling the context dataset / organization package the scope is requested for, from the scope string itself in some way - that is some work on ckanext-authz-service.

  3. Add support for object-specific scopes in Giftless and generate / use these kind of scopes in ckanext-blob-storage. For example, instead of the current: obj:my-org/my-dataset/*:read tokens that we use to get download access (in which my-org/mydataset come from lfs_prefix), we generate tokens that look like obj:<sha256>:read. This will need to be supported by Giftless first (this will require slight modifications to Giftless authorization code). Then add generation of such tokens in ckanext-blob-storage. This may also require some modification in ckanext-authz-service to make scope formats more flexible, as with 2 above. I kind of prefer this to 2 as we'll end up with slightly cleaner scopes. Downloading will still require us to keep lfs_prefix and use it for download batch requests (but not in the JWT token).

  4. Do 3 but also do away with the hierarchical storage structure in Giftless entirely (or at least make it optional), so that all objects are accessible without lfs_prefix and just require sha256 + size. This will be the cleanest solution but will require the most refactoring on all of giftless, ckanext-blob-storage and ckanext-authz-service. Benefits: no need to keep lfs_prefix around, as long as an object's sha256 doesn't change you can read it (if it changes it's not the same object...). Download scopes will need to be for a specific sha256. Upload tokens - need to do more analysis but probably you can always upload (assuming you have write access to anywhere in CKAN). Overwriting objects is not possible with Giftless anyway ("should not happen":tm:) This also adds the benefit of de-duplicating uploads across all objects, not just if they happen to share organization / dataset. This requires some deeper analysis but is most likely the cleanest, most robust but also more expensive solution.

[epic] Support uploading new resources via JWT Authorization -> Giftless -> Cloud Storage flow

Override CKAN's default resource uploading flow with a client-side driven flow:

     ┌───────┐          ┌────────┐                               ┌────────┐          ┌────────────┐          ┌────┐
     │Browser│          │AuthzAPI│                               │Giftless│          │CloudStorage│          │CKAN│
     └───┬───┘          └───┬────┘                               └───┬────┘          └─────┬──────┘          └─┬──┘
         │    authorize     │                                        │                     │                   │   
         │─────────────────>│                                        │                     │                   │   
         │                  │                                        │                     │                   │   
         │      token       │                                        │                     │                   │   
         │<─────────────────│                                        │                     │                   │   
         │                  │                                        │                     │                   │   
         │authorize_upload  │                                        │                     │                   │   
         │(token, supported adapters, resource sha256, resource size)│                     │                   │   
         │───────────────────────────────────────────────────────────>                     │                   │   
         │                  │                                        │                     │                   │   
         │             upload action                                 │                     │                   │   
         │             (transfer adapter, href, headers)             │                     │                   │   
         │<───────────────────────────────────────────────────────────                     │                   │   
         │                  │                                        │                     │                   │   
         │                  │         PUT (href, headers, file)      │                     │                   │   
         │─────────────────────────────────────────────────────────────────────────────────>                   │   
         │                  │                                        │                     │                   │   
         │                  │              OK (file meta)            │                     │                   │   
         │<─────────────────────────────────────────────────────────────────────────────────                   │   
         │                  │                                        │                     │                   │   
         │                  │            add resource (resource meta, file meta)           │                   │   
         │─────────────────────────────────────────────────────────────────────────────────────────────────────>   
         │                  │                                        │                     │                   │   
         │                  │                              OK        │                     │                   │   
         │<─────────────────────────────────────────────────────────────────────────────────────────────────────   
     ┌───┴───┐          ┌───┴────┐                               ┌───┴────┐          ┌─────┴──────┐          ┌─┴──┐
     │Browser│          │AuthzAPI│                               │Giftless│          │CloudStorage│          │CKAN│
     └───────┘          └────────┘                               └────────┘          └────────────┘          └────┘

Most of the client-side logic for this is implemented in the ckan3-js-dsk work done here: https://gitlab.com/datopian/experiments/ckan3-js-sdk

Acceptance Criteria

  • When a new resource is added to a Dataset and the button Save is clicked the resource should be uploaded to the specific Blob and the page should redirect to the dataset's view.

  • When a new resource is added to a Dataset and the button Save & Add Another is clicked the resource should be uploaded to the specific Blob and the page should be redirected to the resource_new form so the user can add another resource

  • The name of the uploaded resource should be: filename? url?

Tasks

  • Create a new javascript ckan module and override the upload actions to call the SDK instead of the CKAN logic.

  • The following templates should be overriden to add the new logic:

    • resource_edit.hmtl

    • new_resource.html

    • new_resource_not_draft.html

Analysis

Open Questions:

  • Where are we going to call the SDK? (I guess when clicking on Save or Save Another)

  • Which parameters it will receive? (I guess a complete Dataset object with the new resource/list of resources?)

  • Which parameters it will return? (If we call ckan create api from the SDK then probably it will just redirect to the dataset page)

  • What are we going to store in CKAN Metastore? (Link to the dataset? Hash? Filename?)

Allow other extensions to provide fallback download methods

For the sake of backwards compatibility and extensibility, it would be really nice if other extensions could be used to handle downloads of some resources, e.g. resources uploaded to cloud storage before ckanext-external-storage was in use.

For this purpose, I suggest defining a new interface that will allow other CKAN plugins to register a download handler, and call these download handlers in order until one of them decides to handle the download. This process will be called by the blueprint responsible for handling the download resource route.

The ckanext-external-storage download handler could also be registered as such a handler instead of being hard-coded in the process. This will allow plugin load order to decide who gets priority.

As a last resort, if no handler would do the job, we can fall back to the built-in CKAN download logic of redirecting to a URL or calling Flask's send_file assuming the file is local.

Discussion re extending download to support multiple blob files per resource

# bit more complex as have verify and multipart
def get_upload_url_and_headers(file_id=sha256, file_size, org/dataset, my-identity) => 
  {
    upload_url,
    headers
  }

def get_download_url_and_headers(file_id=file_sha256, file_size, org/dataset, my-identity) => {download_url}

Storage bucket:

{bucket}/{configured prefix}/{lfs_prefix := org/dataset}/{file-path usually sha256}

Uploading files:

  • Get an authz token from CKAN
  • Go to giftless with that token
  • Exchange that token for an upload token
  • Upload

Downloading files ...

  • Have a desired url (?)
  • Get an authz token from CKAN
  • Go to giftless with that token + desired url
  • Exchange that token for download token (ie. url + headers)
  • Download the file ...

How CKAN (ckanext-blob-storage) works (?):

Questions:

  • Should we assume everything in the storage is always non-public and you needed signed credentials to get in? YES, this is the secure way to do things
    • Atm you can't set this per resource in your call to Giftless.
  • What is stored onto a resource?
    • What is added to resource_show / package_show? Preferenece would be to have a url with time limited expiry ... No, that does not work as they have their browswer open for 30m and then does not work
  • How do i set download name for a file as part of upload? (can i set a header)? or is that set on downloading?
  • How would giftless work without CKAN e.g. integrated with git? Specifically, how would one do authorization ..

Proposal

Proposal B:

Modify download handler to be a bit more generic:

{ckan-instance}/dataset/65050ec0-5abd-48ce-989d-defc08ed837e/resource/26f3d260-9b90-40c8-90de-c540704f59ac/download/sha256:{sha256::size}

  => 404, 401 or 302 to download location

Proposal A: download_url API - rejected because we don't want an API ...

resource:

{
  // current
  url: {ckan-instance}/dataset/65050ec0-5abd-48ce-989d-defc08ed837e/resource/26f3d260-9b90-40c8-90de-c540704f59ac/download/{file-name}
  // new
  url: {ckan-instance}/api/3/action/download_url?sha256=...,size=...,dataset=...,resource=...
  zip_url: {ckan-instance}/api/3/action/download_url?...
}

What I want ...

is the ability to store more than one piece of blob data for a resource and get download urls for that

[upload] browser crashes while trying to upload large file (~600Mb)

Browser crashes while trying to upload large CSV file (~600Mb) for the resource in NHS Staging (https://ckan.nhs.staging.datopian.com/)

Please see the screencast

Acceptance

  • The large files (currently up to ~6Gb) are uploaded directly through the GUI portal.

Analysis

@shevron as per our discussion previously today, and from I checked after, it looks like the problem here is
that we are sending the actual file through Giftless and not directly into the cloud storage (GoogleCloud specifically here).
So this approach is working fine with the small files (up to ~100Mb), but not with the big ones.

Installing with wheels fails

The following instruction fails:
RUN pip3 wheel git+https://github.com/datopian/[email protected]#egg=ckanext-blob-storage

This throws the following error:

ERROR: Could not find a version that satisfies the requirement ckanext-authz-service (from ckanext-blob-storage) (from versions: none)
ERROR: No matching distribution found for ckanext-authz-service

Example of how we're implementing it: https://github.com/opticrd/datos-portal-backend/blob/master/ckan/Dockerfile#L48

The other extensions are working just fine

Integrate new resource editor into this app

Integrate new resource editor created here datopian/datapub#1

Acceptance criteria

  • When this extension is installed, create a new resource UI uses the DataPub app
  • I can quickly pull latest changes from DataPub project into this extension, eg, by running couple of commands

Tasks

  • git submodule vs other approaches
  • remove current implementation of new resource
  • enable the DataPub app
  • document

Analysis

Git submodule

  1. Create submodule in the root directory.
  2. Build bundles by cd ./datapub && yarn && yarn build
  3. Move bundles to fanstatic directory mv ./static/js/ ./ckanext/external-storage/fanstatic/js/datapub
  4. Keep hashes in the bundle names for version control. Every time a new bundle is built, we will need to update the template.

Do we need to use stylesheets? If yes, we can also move CSS files:

mv ./static/css/ ./ckanext/external-storage/fanstatic/css/datapub

Use pre built bundles

If we build and commit bundles in the DataPub repo, we can use them directly from Github (?):

  1. Bundle file is built at github.com/datopian/datapub/bundle.js - can't have hashes in file names here as it would break once changed.
  2. From CKAN template we import the module...

Supported Versions of CKAN

CKAN has announced that they are in the process to release 2.10: https://ckan.org/blog/getting-ready-for-ckan-210

This release will include dropping support for 2.8. Also, support for Python 2.7 in extensions is gonna be dropped starting 2023.

Is the plan of this extension to continue providing support for 2.8 and Python 2.7?

I'm asking in the context of a migration of this extension to 2.10. Should a future PR keep supporting old versions or is it safe to drop it?

Not Compatible with the latest versions of SetupTools

Using setuptools version > 67.2.0, this line:

import ckanext.blob_storage

causes an error to be raised during install:

ModuleNotFoundError: No module named 'ckanext.blob_storage'

To circumnavigate the issue for now we have just fixed setuptools to "==67.2.0" but this is probably not a permanent solution.

pip install setuptools==67.2.0

Would appreciate knowing if any others are seeing the same issue, and opinions on whether this is an issue to be raised with setuptools, or an issue to be fixed by a PR to this repo.

Simplify development setup for new developers

Currently it is quite hard (as with all CKAN extension) to quickly set up a new development environment for new developers less familiar with CKAN extension development process.

It would be really useful to create some automation (+ documentation) that will assist setting up a development environment.

This can be done via docker-compose + some Make targets

Move to storage layout of {dataset-uuid}/{sha256}

Follow up to #45: the current blob storage approach has an issue when one moves a dataset from one organization to another (or the dataset is renamed). This is because we are storing data in blob storage at {org}/{dataset-name}/ and using the information when performing scope validation in giftless.

To avoid this, it is proposed to use <static-prefix>/<dataset-UUID> as the LFS prefix when storing new resources.

  • The <static-prefix> part is set in config for the whole CKAN site. Thus is not technically essential, but having it will allow us to not make any modifications to Giftless, which expects a two-part (org name / repo name) prefix.
  • Using dataset-UUID instead of dataset-name means the resource's container prefix will not need to be rewritten / mangled if the dataset is renamed or moved to a different organization, or if the organization is renamed.
  • This will allow us to drop the Scope Normalizer solution all together as scopes will always be <static-prefix>/<dataset-UUID>/<sha256>.
  • As a compatibility measure, we can:
    • Keep lfs_prefix for now
    • If lfs_prefix is set to something that doesn't look like the new prefix format, still go through scope normalization
    • Run migration to move all in-storage objects to new format containers
    • Stop using lfs_prefix all together, as it is not needed anymore (although we don't have to and it can be beneficial at some point, e.g. if an additional change will be required).

Tasks

  • Change ckanext-blob-storage to use static-prefix/uuid as prefix when uploading ~2d
    • Need to decide if we keep lfs_prefix around or not. If we do not, need to flag resources that are in Git LFS in some other way (or rely on sha256 being set as the indicator)
    • Static prefix should be config based
    • Token handling - can be by registering a new auth handler, by using a (different) scope normalizer logic (probably easiest) or by making some adjustments to ckanext-authz-service.
    • Upload location - probably an easy change
  • Ensure backwards compatibility with already-migrated resources ~1d
    • e.g. by dealing with lfs_prefix in scope normalizer - not needed if we don't need BC e.g. can do migration during downtime.
  • Write and run migration script to move resources from name-based LFS prefix to UUID based - ~2d
  • Deployment and testing ~1-2d

Analysis

What's the problem

Imagine i want to download the blob related to a resource ...

  • I get the resource metadata
  • I go to ckanext-authz endpoint and say: given me a token to read a resource
    • The token I ask for will contain the scope: obj:myorg/mydataset/*:read - to read every resource of myorg/mydataset.
  • I take that token to giftless and given the token along with XXX
    • How does giftless know whether it should grant access?
    • It looks at the request storage object with identifier:
      • POST to /myorg/myrepo/object/batch with {oid: <sha256>} - this identifies the object and can be checked against the scope in the token.
        • Can i do POST to /{prefix}/object/batch with {oid: <sha256>} and prefix can be
    • and compares it with the provided scopes ...
      • How does it know that storage object is covered by the scope? A scope accepted by Giftless looks something like obj:myorg/mydataset/*:read
  • Giftless gives me a token for the storage (a url)

Where this goes wrong is if i have moved the dataset ... because now the giftless location is still old dataset whilst scope is for new dataset ...

Options

  • Flat namespace /{sha256} in storage space that is pure content addressed
  • Scoped storage with entity UUID: /{dataset-uuid}/sha256 Preferred
  • Relocate data ...
  • Temporary solution

Quick Fix: Scope Normalizer based Quick Fix - DONE in #47

Change from obj:myorg/myrepo/sha256:read to obj:*/*/sha256:read or even obj:sha256:read

Assumption: a scope normalizer function registered in ckanext-blob-storage for obj scopes can mangle requests for res:<org>/<dataset>/<sha256>:read to something like res:*/*/<sha256>.

If this is true, we can:

  • Fix up Giftless to accept such scopes and only check the sha256 (most likely quick)
  • Fix up ckanext-blob-storage and all relevant JS code handling downloads (if any) to include the sha256 in the scope auth request and ensure this input format is accepted and scope is granted
  • This should work around this problem. It means that:
    • We continue to rely on lfs_prefix to send the batch request but not to get the auth token when downloading
    • Uploads will continue to work as now

Document resource fields required for blob-storage to operate

There are a few resource fields that are required for ckanext-blob-storage to operate. Normally, users should not care about these, but when integrating datapub there were a few issues with some of these fields not being set. If users will use the API / SDK to create new resources, they might face the same issues. Most of these fields are standard CKAN fields but some may not be, and we should document them.

The required fields are:

  • url - the file name, without path (required by vanilla CKAN not just by blob storage)
  • url_type - set to "upload" for uploaded files
  • sha256 - the SHA256 of the file
  • size - the size of the file in bytes
  • lfs_prefix - the LFS server path of where the file has been stored by Giftless. Something like <org>/<dataset>. This is important so that if the org / dataset are renamed in the future, the original file can still be found.

Migration script from ckanext-cloudstorage to this storage

A script to migrate resources can be written based on the CKAN API (+ Python SDK, as it will need access to Git LFS as well as CKAN), or by directly accessing the DB and Azure Blob Storage. The former is most likely preferrable and easier though the latter may be faster.

Acceptance

  • Script to migrate data from one storage bucket (on Azure) to another
  • Script to update metadata in CKAN MetaStore
  • Test for this script (?)

Tasks

  • Get all resources that do not have lfs_prefix and sha256 attributes set
  • For each resource found, download the file
  • Calculate sha256 and upload the resource to blob storage via Git LFS server NOTE: we probably want to use azure copy commands rather than download and upload as data is large. We still need to download to calculate the sha
  • Update the resource lfs_prefix and sha256 attributes
  • Iterate until no more resources are found
  • Test out ...

This can easily be parallelized if we need to (e.g. if we need to run fast if the system is taken down) or run slowly in the background. It can also be restarted if needed and will continue from where it stopped.

Figure out what UI, if at all, we bundle with this extension

In most current deployments, the default UI bundled with this extension is overridden with specific datapub customization, and is never used. On the other hand, shipping this extension with no UI at all kind of defeats the purpose, as the whole process of uploading resources with Git LFS based storage is driven by GUI.

We need to have a discussion and figure out what UI (templates + JS + CSS) code we ship with this extension, and how it allows customization by other extensions.

After that is done, clean up the code here and remove any redundant / unused code.

Add actions tests

Create tests to cover the actions resource_sample_show and resource_schema_show after the PR #30

Acceptance

  • Add test for resource_sample_show
  • Add test for resource_schema_show

Tasks

  • Create new actions_test file
  • Add test for resource_sample_show
  • Add test for resource_schema_show

cc: @shevron

This package depends on OpenSSL version no longer supported by the OpenSSL project

pip-installing from requirements.py2.txt' gives this error (surely we can workaround with the suggested env var, but I think it worth it reporting anyway see below):

Traceback (most recent call last):
  File "/usr/bin/pip", line 5, in <module>
    from pip._internal import main
  File "/usr/lib/python2.7/site-packages/pip/_internal/__init__.py", line 40, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/usr/lib/python2.7/site-packages/pip/_internal/cli/autocompletion.py", line 8, in <module>
    from pip._internal.cli.main_parser import create_main_parser
  File "/usr/lib/python2.7/site-packages/pip/_internal/cli/main_parser.py", line 12, in <module>
    from pip._internal.commands import (
  File "/usr/lib/python2.7/site-packages/pip/_internal/commands/__init__.py", line 6, in <module>
    from pip._internal.commands.completion import CompletionCommand
  File "/usr/lib/python2.7/site-packages/pip/_internal/commands/completion.py", line 6, in <module>
    from pip._internal.cli.base_command import Command
  File "/usr/lib/python2.7/site-packages/pip/_internal/cli/base_command.py", line 18, in <module>
    from pip._internal.download import PipSession
  File "/usr/lib/python2.7/site-packages/pip/_internal/download.py", line 15, in <module>
    from pip._vendor import requests, six, urllib3
  File "/usr/lib/python2.7/site-packages/pip/_vendor/requests/__init__.py", line 97, in <module>
    from pip._vendor.urllib3.contrib import pyopenssl
  File "/usr/lib/python2.7/site-packages/pip/_vendor/urllib3/contrib/pyopenssl.py", line 46, in <module>
    import OpenSSL.SSL
  File "/usr/lib/python2.7/site-packages/OpenSSL/__init__.py", line 8, in <module>
    from OpenSSL import crypto, SSL
  File "/usr/lib/python2.7/site-packages/OpenSSL/crypto.py", line 17, in <module>
    from OpenSSL._util import (
  File "/usr/lib/python2.7/site-packages/OpenSSL/_util.py", line 6, in <module>
    from cryptography.hazmat.bindings.openssl.binding import Binding
  File "/usr/lib/python2.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 222, in <module>
    _verify_openssl_version(Binding.lib)
  File "/usr/lib/python2.7/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 183, in _verify_openssl_version
    "You are linking against OpenSSL 1.0.2, which is no longer "
RuntimeError: You are linking against OpenSSL 1.0.2, which is no longer supported by the OpenSSL project. To use this version of cryptography you need to upgrade to a newer version of OpenSSL. For this version only you can also set the environment variable CRYPTOGRAPHY_ALLOW_OPENSSL_102 to allow OpenSSL 1.0.2.
The command '/bin/sh -c pip install --no-cache-dir -r ckanext-external-storage/requirements.py2.txt' returned a non-zero code: 1

Delete a file from the external storage when deleting a resource

After deleting an uploaded resource, the resource disappears from the resource list of the related dataset. But after uploading the same file again, the UI displays a message informing that the file already exists in the external storage.

When a user deletes a resource in the edit resource view, the user expects that the resource is also deleted from the external storage.

Steps to reproduce:

  1. Create a dataset
  2. Create a resource
  3. Upload a file
  4. Delete the created resource
  5. Create a new resource
  6. Upload the same deleted file again
  7. UI shows a message warning the user that file already exists in the storage.

Screenshot from 2020-10-01 19-20-51

Acceptance

  • Have the file deleted from the storage after deleting a resource.
  • It's possible to upload a file that was deleted previously

Setting lfs_prefix on the server side

This is a follow up of #51

Re lfs_prefix it kind of makes sense to create it on the server side, but it also doesn't - the uploading client should set it after
doing the upload, after all the server isn't uploading. I'm thinking that maybe the right solution would be an endpoint that
provides the "right" lfs_prefix, along with the LFS server URL and an auth token. This can be used by API clients to upload and
then set the right prefix. I'm not sure about this. What do you think?

Main issue:

Nowadays all the API calls on SDKs to upload files needs to know implementation details of the server to be able to set lfs_prefix correctly. How can we provide a better UX for this?

Rename extension and Python module to ckanext-blob-storage

external is really unclear - storage is always external ...

Really this is cloud storage so ckanext-cloud-storage would be best but i get there is already taken. In which case I recommend the simple:

ckanext-storage

Or, if we want to be pedantic:

ckanext-blob-storage

/cc @shevron - any thoughts?

Integrate ckan3-js-sdk into extension

Integrate the standalone JS SDK into the extension in a clean way, using modern JS / ES build tools.

Acceptance Criteria

  • ckan3-js-sdk is specified in package.json and can be installed by running npm install
  • Running npm build or a similar command will generate a bundle JS file in ckanext/external_storage/fanstatic/js/index.js (or similar) which will include all the required, transpiled ES code in a single bundle
  • This file can be included in templates using {% resource() %} and defines a CKAN JS module using ckan3-js-sdk API
  • Add a Makefile target to wrap the NPM build command and help automate the process

Technical Tasks

  • Set up package.json with the right dependencies + commands
  • Set up Webpack / Babel config files to generate the bundle
  • Create a wrapper script defining CKAN JS module
  • Add Makefile target to automate the build process

References

See https://github.com/datopian/ckanext-querytool for an example of a CKAN extension with a modern JS environment

Implement resource upload flow

Acceptance Criteria

  • When a file is uploaded for a resource, the file is saved via Git LFS to storage and the resource metadata is saved correctly in CKAN DB
  • When a resource is an external URL normal CKAN behavior is maintained
  • This is tested and works in all the following cases:
    • Add single resource to new dataset
    • Add multiple resources to a new dataset
    • Add single / multiple resources to an existing dataset
    • Replace a file in an existing resource (edit)

Technical Tasks

The following should happen in a single template (prob. templates/external_storage/snippets/resource_upload.html included from other templates):

  • When user selects a file, the file's hash is calculated (this can also be done when "Save" is clicked but would be nice to start early)
  • When user clicks "Save" or "Save and Add Another":
    • Calculate file hash if not already done
    • Get an authz token (or verify we have one cached) (set the right scope)
    • Send batch API upload request to LFS server
    • Perform the upload + show progress bar
    • Verify Upload
    • Store file metadata in resource object - this must include the original file name, oid / sha256, and file size (needed to later download from LFS)
    • Post resource form
  • Continue normal CKAN flow - e.g. redirect to resource / dataset page

Tidy up README ...

README seems to have a lot of boilerplate about extensions that I'm not sure is relevant.

Also very simple instructions on how to install the extension in e.g. a docker setup would be useful.

Create a new CKAN api action `custom_resoruce_show` to return schema and sample attributes as JSON object instead of string

The resource_show for schema and sample is a string with Unicode like the code below and we need this information as a JSON to be able to render/edit in tableschema component when the user is in the edit page.

"{u'fields': [{u'type': u'string', u'name': u'Series_reference', u'format': u'default'}, {u'type': u'number', u'name': u'Period', u'format': u'default'}, {u'type': u'integer', u'name': u'Data_value', u'format': u'default'}, {u'type': u'string', u'name': u'STATUS', u'format': u'default'}, {u'type': u'string', u'name': u'UNITS', u'format': u'default'}, {u'type': u'integer', u'name': u'MAGNTUDE', u'format': u'default'}, {u'type': u'string', u'name': u'Subject', u'format': u'default'}, {u'type': u'string', u'name': u'Group', u'format': u'default'}, {u'type': u'string', u'name': u'Series_title_1', u'format': u'default'}, {u'type': u'string', u'name': u'Series_title_2', u'format': u'default'}, {u'type': u'string', u'name': u'Series_title_3', u'format': u'default'}, {u'type': u'string', u'name': u'Series_title_4', u'format': u'default'}, {u'type': u'string', u'name': u'Series_title_5', u'format': u'default'}], u'missingValues': [u'']}"

more details about the issue in datopian/datapub#37

Acceptance

Resource edit page works correctly

  • create a new CKAN api action custom_resource_show to return the schema and sample properties as JSON not as string. Or to override resource_show method in the extension.

Tasks

  • create a new CKAN api action custom_resource_show
  • change the return of the schema and sample properties as JSON not as string.

Should we commit JS dependencies and built JS bundle?

This CKAN extension bundles some modern ES6 code, and requires running npm install && npm run build to properly function after installation. This is unusual for CKAN extensions and causes some confusion during installation.

One option to avoid this is to commit our build bundle to the repo.

This means users will not need to do anything special on installation, but that every change to our JS source code or even worse, to any of our tree of npm dependencies, will require a commit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.