dials / data Goto Github PK

DIALS Regression Data Manager

Home Page: https://pypi.python.org/pypi/dials_data

License: BSD 3-Clause "New" or "Revised" License

Python 87.59% Shell 12.41%

data's Introduction

DIALS: Diffraction Integration for Advanced Light Sources

X-ray crystallography for structural biology has benefited greatly from a number of advances in recent years including high performance pixel array detectors, new beamlines capable of delivering micron and sub-micron focus and new light sources such as XFELs. The DIALS project is a collaborative endeavour to develop new diffraction integration software to meet the data analysis requirements presented by these recent advances. There are three end goals: to develop an extensible framework for the development of algorithms to analyse X-ray diffraction data; the implementation of algorithms within this framework and finally a set of user facing tools using these algorithms to allow integration of data from diffraction experiments on synchrotron and free electron sources.

Website

https://dials.github.io

Reference

Winter, G., Waterman, D. G., Parkhurst, J. M., Brewster, A. S., Gildea, R. J., Gerstel, M., Fuentes-Montero, L., Vollmar, M., Michels-Clark, T., Young, I. D., Sauter, N. K. and Evans, G. (2018) Acta Cryst. D74.

Funding

DIALS development at Diamond Light Source is supported by the BioStruct-X EU grant, Diamond Light Source, and CCP4.

DIALS development at Lawrence Berkeley National Laboratory is supported by National Institutes of Health / National Institute of General Medical Sciences grant R01-GM117126. Work at LBNL is performed under Department of Energy contract DE-AC02-05CH11231.

data's People

Contributors

Watchers

Forkers

graeme-winter ndevenish jbeilstenedmands dagewa isikhar toastisme d-j-hatton dalekreitler-bnl elena-pascal lgtm-migrator

data's Issues

Need parallel downloads

It's so slow.

Dependency Dashboard

This issue provides visibility into Renovate updates and their statuses. Learn more

Edited/Blocked

These updates have been manually edited so Renovate will no longer make changes. To discard all commits and start over, click on a checkbox.

Pin dependencies (importlib_resources, pytest, wheel)

Check this box to trigger a request for Renovate to run again on this repository

Allow test data to be downloaded from google drive

Dependency += gdown

Example code snippet:

import gdown

files = {"indexed.refl":"1iXejo0YSBpBq_WezDnz4bKuA4CpVE0dz",
         "indexed.expt":"145e0pSGSYtu5uZ9pfvyXwhWnhklkAA-m"}

for out in files:
    fid = files[out]

    gdown.download("https://drive.google.com/uc?id=%s" % fid, out, quiet=True)

would make it very easy to fetch data down from somewhere where people can easily add

Download thaumatin_i04 from zenodo instead

depends on #12

Add SPring8 CCP4 workshop data from zenodo

File ccp4school2018_bl41xu_data03.tar.xz from https://zenodo.org/record/1443110#.XEHhgc_7TOR

Apparently compressed files aren't yet supported.

Evaluate pooch

Learnt of the existence of pooch at SciPy 2020.
This could be a potential new backend for dials-data dealing with downloading and unpacking and all that stuff.

Set up continuous deployment via Travis

So that every commit on master triggers a release on pypi.
Then close master branch for commits and only allow pull requests from that point onwards.

Documentation does not say how to add a new dataset

how do .yml files work, etc.

setup.py does not exist

I was following the instructions in https://github.com/dials/data/blob/master/CONTRIBUTING.rst#get-started until I hit the line python setup.py develop, but setup.py does not exist.

Documentation refers to both `dials_data` and `dials-data`

And I'm pretty sure there is also a dials.data. Maybe even a dials/data. I think the most consistent package name is 'dials-data' (while pip is happy to understand dials_data conda is not) but github repo and pytest fixture are dials_data.

I'm trying to move all package references to dials-datain #400, but not sure that reduces entropy.

refinement_test_data needs a directory structure

The refinement_test_data directory on data-files has sub-directories to mirror the original files on dials_regression. I forgot about that when adding the definition, so the resulting dials-data dataset is missing some files, which have shared file names.

Is it possible to define the directory structure? I note the spring8_ccp4_2018 dataset does this, but by downloading and unpacking a tar archive.

Support compressed data sources

Prerequisite for #11

Retry downloads

Sometimes I see sporadic 403 errors or timeouts, so should retry downloads a few times before giving up.

Get data directory path via command line

It would be useful to have a way to get the directory path for a given dials.data dataset. Essentially, a simpler way of facilitating command lines such as:

$ xia2.multiplex $(dials.data get multi_crystal_proteinase_k | grep : | cut -d: -f 2)/multi_crystal_proteinase_k/*

.expt files do not consistently point at the correct image file locations

DIALS Regression Data version: v2.4.0
Python version: 3.8.5
Operating System: Ubuntu 18.04

Description

.expt files often do not have the correct image file locations, so any tests involving accessing the images will fail.
e.g. insulin_processed/*.expt assume images are in the same directory, but they are actually in ../insulin/.

It's not entirely clear if it's even desirable to correct this, as then the description of how the data were obtained would not quite be correct (i.e repeating the same steps would not lead to having .expt files with the same imageset paths). But I think it's worth flagging.

What I Did

>>> ExperimentListFactory.from_json_file(dials_data("insulin_processed", pathlib=True)/"refined.expt")
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/model/experiment_list.py", line 767, in from_json_file
    return ExperimentListFactory.from_json(
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/model/experiment_list.py", line 755, in from_json
    return ExperimentListFactory.from_dict(
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/model/experiment_list.py", line 743, in from_dict
    experiments = ExperimentListDict(
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/model/experiment_list.py", line 355, in decode
    imageset = self._imageset_from_imageset_data(imageset_data, models)
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/model/experiment_list.py", line 214, in _imageset_from_imageset_data
    imageset = self._make_sequence(
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/model/experiment_list.py", line 433, in _make_sequence
    return ImageSetFactory.make_sequence(
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/imageset.py", line 597, in make_sequence
    format_class = dxtbx.format.Registry.get_format_class_for_file(
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/format/Registry.py", line 122, in get_format_class_for_file
    if scheme in format_class.schemes and format_class.understand(image_file_str):
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/format/FormatCBF.py", line 24, in understand
    with FormatCBF.open_file(image_file, "rb") as fh:
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/format/Format.py", line 565, in open_file
    return cls.get_cache_controller().check(filename_str, fh_func)
  File "/home/davidmcdonagh/work/dials/modules/dxtbx/src/dxtbx/filecache_controller.py", line 69, in check
    self._cache = dxtbx.filecache.lazy_file_cache(open_method())
FileNotFoundError: [Errno 2] No such file or directory: '/home/davidmcdonagh/work/dials/build/dials_data/insulin_processed/insulin_1_001.img'

Add mypy check in the CI

If we aspire to use type hints, might it be worth a Mypy check in the CI, to catch this sort of thing?

Originally posted by @benjaminhwilliams in #330 (comment)

Download four-sweep l-cysteine data from Zenodo instead

Depends on #12.

This would simply involve pointing the relevant lines in dials_data/definitions/l_cysteine_dials_output.yml to the equivalent .cbf.gz images in https://zenodo.org/record/51405.

PR builds broken on Python 3.6

I would like to get #254 and #256 merged soon, but builds have been broken on Python 3.6 for all recent PRs. This issue with the cryptography package may be relevant:
pyca/cryptography#5771

Could these PRs be merged anyway please? I have two dxtbx PRs that are waiting for these data definitions.

Out of date version number in repo vs PyPI?

@Anthchirp, I seem to be having trouble with dials_data versions when running this test (though Jenkins seems to be fine, instead falling over at the expected values of beam shift, which I was trying to look into). My dials_data is installed from a clone of the dials_data repo, with libtbx.pip install -e <dials_data directory>. My copy seems to be up to date, but I have a version number 1.0.0-dev < 1.0.5, so Pytest complains:

$ libtbx.pip show dials_data
Name: dials-data
Version: 1.0.0
Summary: DIALS Regression Data Manager
Home-page: https://github.com/dials/data
Author: Markus Gerstel
Author-email: [email protected]
License: BSD license
Location: /home/ekm22040/DIALS/data
Requires: pytest, pyyaml, setuptools, six
Required-by:

$ dials.data
usage: dials.data <command> [<args>]

DIALS regression data manager v1.0.0-dev

<...etc.>

Meanwhile, there appear to be several newer versions listed on the PyPI:

$ libtbx.pip install dials_data==
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting dials_data==
  Could not find a version that satisfies the requirement dials_data== (from versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.6.1, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.0.6, 1.0.7, 1.0.8, 1.0.9, 1.0.10, 1.0.11, 1.0.12, 1.0.13)
No matching distribution found for dials_data==

I notice that the module's __version__ hasn't changed since 63ca7a6. Should this be bumped to 1.0.13 now, and then bumped by Travis with each update file information PR?

Automated Zenodo uploader

DIALS Regression Data version: current
Python version: 3.6
Operating System: UNIX based

Description

Propose to add an automated zenodo data uploader which could also generate the appropriate JSON text for the new data set - there is a REST API which appears to work simply enough. Will require a user generate an upload token using instructions at:

https://zenodo.org/account/settings/applications/tokens/new/

What I Did

import requests
import os
import sys
import pprint

# get yourself an access token from:
#
# https://zenodo.org/account/settings/applications/tokens/new/

ACCESS_TOKEN = "aaaaaaaaa"

headers = {"Content-Type": "application/json"}
r = requests.post(
    "https://zenodo.org/api/deposit/depositions",
    params={"access_token": ACCESS_TOKEN},
    json={},
    headers=headers,
)
print(r.status_code)
print(r.json())

d_id = r.json()["id"]

for directory in sys.argv[1:]:
    for filename in os.listdir(directory):
        print(filename)
        data = {"name": filename}
        files = {"file": open(os.path.join(directory, filename), "rb")}
        r = requests.post(
            "https://zenodo.org/api/deposit/depositions/%s/files" % d_id,
            params={"access_token": ACCESS_TOKEN},
            data=data,
            files=files,
        )
        pprint.pprint(r.json())

allows automated upload of every file in a directory, as an example - the token can have permission to complete the upload and publish, but in my test case I did not test this out, just used it to upload 3,450 files.

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

Add Rigaku HyPix 6000 datasets for multi-sweep image to indexing solution tests

@biochem-fan pointed out some suitable datasets here: cctbx/dxtbx#653 (comment)

To do: look through these and find a reasonably small subset that can be added to dials-data in order to test multi-sweep indexing across different scan axes and $2\theta$ swings.

As discussed in a DIALS catch up, this would be a fairly long test, but it could be added to xia2 tests rather than DIALS. xia2 has a --regression-full option that can be used for long running tests.

dials / data Goto Github PK

data's Introduction

DIALS: Diffraction Integration for Advanced Light Sources

Website

Reference

Funding

data's People

Contributors

Watchers

Forkers

data's Issues

Edited/Blocked

Description

What I Did

Description

What I Did

Recommend Projects

Recommend Topics

Recommend Org