materials-data-facility / scythe Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 3.0 7.38 MB

An extensible library of tools that extract metadata from scientific files

License: Apache License 2.0

Python 100.00%

scythe's People

Contributors

Stargazers

Watchers

Forkers

jat255 tskluzac trellixvulnteam

scythe's Issues

Create a Public Repository for Complaint Parsers

We should make it easy for listing libraries that hold other MaterialsIO-compatible parsers.

Find another target library that is a good resource
Figure out how to host the library (perhaps GH-pages, or read-the-docs?)
Remove TODO from read the docs

Shortcut for Running "group then parse"

Just running a parser on all files in a path requires:

metadata = [parser.parse(x) for x in parser.group('.') if parser.is_valid(x)]

We could simplify that into a single function call, which should also leverage pipelining (e.g., via generators) and, ideally, multiprocessing.

Then remove the TODO from read-the-docs

Parsers Should Perform "Best-Effort" Not "All-or-Nothing"

Add this concept to the documentation

Define What a Group Is

Describe a definition for "group" within the documentation.

We have agreed that a "file group" is defined as "a set of files that can be parsed together by a parser that, when treated together, can be used to create better metadata."

Add this documentation:

User documentation
Source code documentation within the parse and group methods
Contributor documentation

`test_file.py` tests fail dependent on version of `libmagic` installed

test_file.py is failing on a system with a more up to date libmagic due to a change in the output for a given file.

(on my more up-to-date system)

$ file --version
file-5.41
magic file from /usr/share/file/misc/magic
seccomp support include

$ file ./MaterialsIO/tests/data/image/dog2.jpeg
./MaterialsIO/tests/data/image/dog2.jpeg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 300x300, segment length 16, Exif Standard: [\012- TIFF image data, little-endian, direntries=2, GPS-Data], baseline, precision 8, 1910x1000, components 3

(on an older Debian system):

$ file --version
file-5.32
magic file from /etc/magic:/usr/share/misc/magic

$ file MaterialsIO/tests/data/image/dog2.jpeg
./MaterialsIO/tests/data/image/dog2.jpeg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 300x300, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=2, GPS-Data], baseline, precision 8, 1910x1000, frames 3

This causes the test suite to fail on the more up-to-date system. Notice in particular the \012- added to the output (which according to the man page, represents multiple flags were present) and also the difference between "components" and "frames" at the end.

Since the specific version of libmagic cannot be mandated (because it's a system package), my suggestion would be to loosen up the requirement on the data_type assertion in the test suit. Something like the following:

for i in ['JPEG image data', 'density 300x300', 'TIFF image data', '1910x1000']:
    assert i in output['data_type']

The following test method works on my system:

def test_file():
    my_file = os.path.join(os.path.dirname(__file__), 'data', 'image', 'dog2.jpeg')
    parser = GenericFileParser(store_path=True, compute_hash=True)
    output = parser.parse([my_file])
    expected = {
        'mime_type': 'image/jpeg',
        'length': 269360,
        'filename': 'dog2.jpeg',
        'path': my_file,
        'data_type': 'JPEG image data, JFIF standard 1.01, resolution (DPI), '
                     'density 300x300, segment length 16, Exif Standard: [TIFF '
                     'image data, little-endian, direntries=2, GPS-Data], '
                     'baseline, precision 8, 1910x1000, frames 3',
        'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c'
                  '9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf27'
                  '9d281270211cff8f90'}
    for i in ['JPEG image data', 'density 300x300', 'TIFF image data',
              '1910x1000']:
        assert i in output['data_type']
    del output['data_type']
    del expected['data_type']
    assert output == expected
    assert isinstance(parser.schema, dict)

user_guide question

In the last line below, should "parse_files" be "parse"?

The main operation for any parser is the data extraction operation: parse.

In most cases, the parse operation takes the path to a file and and returns a summary of the data the file holds:

metadata = parser.parse_files(['/my/file'])

Confusing Error Messages from python-magic

The error message from python-magic that occurs when the underlying libmagic library is not installed is confusing to users.

The error message "failed to find libmagic" means that users need to install the libmagic C library, but users interpret as them needing to pip install libmagic. We need to make this clearer.

We should also allow GenericFileParser to still work without libmagic, as that library is not compatible with Windows and not installed by default on some Mac/Linux systems.

Setup Github actions

As discussed in #38 (comment), we should use Github actions to setup automatic CI/CD, code coverage, and documentation generation.

Improve Parser Summary Page

Our parser summary is pretty ugly. We should:

Provide richer descriptions for the purpose, output of the parsers
- First describe the existing parsers in greater detail
Provide quick access to the output schemas for a full description
- First implement the schemas for existing parsers
Display the implementors and version.

We'll probably need to write some code that injects this information into the rst documentation before building the docs. I bet Sphinx has a clever trick for that, or we could modify the Makefile.

Add Parser and tests for stmpy files

This parser will be used to index data created by the CATS EFRC

https://github.com/harrispirie/stmpy

Add Molecule Reader (.mol, .sdf, .smi, etc)

RDKit includes this functionality but requires a conda instal
https://rdkit.org

Port MDF Connect Parsers to MaterialsIO

We need to port all of the parsers available in Connect over to the MDF:

DFT Parser (see Issue #7 )
Spreadsheet parser
JSON parser (with mapping)
CSV Parser

Remove setup.py in favor of pyproject.toml

As discussed in #37, we should remove setup.py and any references to it. We should also more thoroughly document the use of Poetry in the contributors and user guide

crystal_structure cif test not passing

I installed MaterialsIO using poetry (see https://github.com/jat255/MaterialsIO/tree/poetry_compat), and I'm getting test failures on the crystal structure parser. It appears the issue is the ASE adapter, as it results in an incorrect interpretation (at least according to test_crystal_structure.py). Here are the test failures:

>       assert isclose(output['crystal_structure']['number_of_atoms'], 5070.0)
E       assert False
E        +  where False = isclose(4656.0, 5070.0)

>       assert output == {'material': {'composition': 'Co270H1680C1872N324O924'},
                          'crystal_structure': {'space_group_number': 146,
                                                'stoichiometry': 'A45B54C154D280E312'}}
E       AssertionError: assert {'crystal_str...728N288O864'}} == {'crystal_stru...872N324O924'}}
E         Differing items:
E         {'material': {'composition': 'Co240H1536C1728N288O864'}} != {'material': {'composition': 'Co270H1680C1872N324O924'}}
E         {'crystal_structure': {'space_group_number': 210, 'stoichiometry': 'A5B6C18D32E36'}} != {'crystal_structure': {'space_group_number': 146, 'stoichiometry': 'A45B54C154D280E312'}}
E         Full diff:
E         - {'crystal_structure': {'space_group_number': 210,
E         ?                                              - ^
E         + {'crystal_structure': {'space_group_number': 146,
E         ?                                               ^^
E         -                        'stoichiometry': 'A5B6C18D32E36'},
E         ?                                             ^   ^^^  ^
E         +                        'stoichiometry': 'A45B54C154D280E312'},
E         ?                                           +  ^^  ++++ ^  ^^
E         -  'material': {'composition': 'Co240H1536C1728N288O864'}}
E         ?                                  ^   --     -  ^^ ^^
E         +  'material': {'composition': 'Co270H1680C1872N324O924'}}
E         ?                                  ^    ++  +   + ^ ^^

Here is the list of packages installed in my environment. I tried to follow the package versions in requirements.txt and setup.py:

Output of "pip list":

Package               Version
--------------------- -----------
ase                   3.22.1
atomicwrites          1.4.0
attrs                 21.4.0
backcall              0.2.0
boto3                 1.20.35
botocore              1.23.35
cached-property       1.5.2
certifi               2021.10.8
cffi                  1.15.0
chardet               4.0.0
charset-normalizer    2.0.10
click                 8.0.3
cloudpickle           2.0.0
coverage              6.2
coveralls             3.3.1
cryptography          36.0.1
cycler                0.11.0
Cython                0.29.26
dask                  2021.12.0
debugpy               1.5.1
decorator             5.1.1
dftparse              0.3.0
dfttopif              1.1.0
dill                  0.3.4
dnspython             2.2.0
docopt                0.6.2
docutils              0.17.1
entrypoints           0.3
et-xmlfile            1.1.0
fair-research-login   0.2.6
fett                  0.3.2
flake8                4.0.1
fsspec                2022.1.0
globus-nexus-client   0.4.1
globus-sdk            3.2.1
greenlet              1.1.2
h5py                  3.6.0
hyperspy              1.6.5
idna                  3.3
ijson                 3.1.4
imageio               2.9.0
importlib-metadata    4.2.0
importlib-resources   5.4.0
ipykernel             6.7.0
ipyparallel           8.1.0
ipython               7.31.0
isodate               0.6.1
jedi                  0.18.1
jmespath              0.10.0
jsonlines             3.0.0
jsonschema            4.4.0
jupyter-client        7.1.0
jupyter-core          4.9.1
kiwisolver            1.3.2
linear-tsv            1.1.0
llvmlite              0.34.0
locket                0.2.1
materials-io          0.0.1
matplotlib            3.4.3
matplotlib-inline     0.1.3
mccabe                0.6.1
mdf-toolbox           0.5.11
monty                 2022.1.12.1
more-itertools        8.12.0
mpmath                1.2.1
natsort               8.0.2
nest-asyncio          1.5.4
networkx              2.6.3
numba                 0.51.2
numexpr               2.8.1
numpy                 1.21.1
openpyxl              3.0.9
packaging             21.3
palettable            3.3.0
pandas                1.1.5
parso                 0.8.3
partd                 1.2.0
pbr                   5.8.0
pexpect               4.8.0
pickleshare           0.7.5
Pillow                7.2.0
Pint                  0.18
pip                   21.3.1
pluggy                1.0.0
prettytable           3.0.0
prompt-toolkit        3.0.24
psutil                5.9.0
ptyprocess            0.7.0
py                    1.11.0
pycalphad             0.9.2
pycodestyle           2.8.0
pycparser             2.21
PyDispatcher          2.0.5
pyflakes              2.4.0
Pygments              2.11.2
PyJWT                 2.3.0
pymatgen              2018.12.12
pymongo               3.12.3
pyparsing             3.0.6
pypif                 2.1.2
pyrsistent            0.18.0
pytest                3.10.1
pytest-cov            2.9.0
python-dateutil       2.8.2
python-jsonrpc-server 0.3.4
python-magic          0.4.24
pytz                  2021.3
PyWavelets            1.2.0
PyYAML                5.4.1
pyzmq                 22.3.0
requests              2.27.1
rfc3986               2.0.0
ruamel.yaml           0.17.20
ruamel.yaml.clib      0.2.6
s3transfer            0.5.0
scikit-image          0.19.1
scipy                 1.6.1
setuptools            59.5.0
setuptools-scm        6.3.2
six                   1.16.0
snooty-lextudio       1.11.6
sparse                0.13.0
spglib                1.16.3
SQLAlchemy            1.4.29
stevedore             1.32.0
symengine             0.7.2
sympy                 1.8
tableschema           1.20.2
tabulate              0.8.9
tabulator             1.53.5
tifffile              2021.11.2
tinydb                4.5.2
toml                  0.10.2
tomli                 2.0.0
toolz                 0.11.2
tornado               6.1
tqdm                  4.62.3
traitlets             5.1.1
traits                6.3.2
typing-extensions     3.10.0.2
ujson                 1.35
unicodecsv            0.14.1
urllib3               1.26.8
watchdog              1.0.2
wcwidth               0.2.5
wheel                 0.37.0
xarray                0.20.2
xlrd                  2.0.1
xmltodict             0.12.0
zipp                  3.7.0

Rename Parsers to Extractors

Based on our group conversations, the team seems to prefer extract/extractors to parse/parsers. Let's change MaterialsIO to reflect that. @jgaff

DFT parser putting too many files into groups yielding no metadata

So the problem is that the DFT parser seems to oftentimes group the entire directory into 1 group. However in the case of distributed data processing, it becomes difficult to think of large directories as atomic structures that must be processed as a whole.

The problem directory I'm dealing with in the MDF data is at "/MDF/mdf_connect/prod/data/h2o_13_v1-1/split_xyz_files/watergrid_60_HOH_180__0.7_rOH_1.8_vario_PBE0_AV5Z_delta_PS_data" (on Petrel).

In this directory (at the time of this writing) there are 6,900 files in the directory, and the DFT grouping for this directory yields just 1 group containing all 6,900 files.

I've spoken with @WardLT, and we've discussed that this issue is non-urgent.

Versioning Individual Parsers

Back in our original discussions, we identified a potential need for MaterialsIO being that users should be able to change their versions of parsers individually. Implementing several parsers in this package seems like it will complicate that feature, as the versions for each of the parsers are interleaved with each other.

Should we break the complicated parsers out into separate libraries before making this widely available?

Add Expected Extensions or Filename Patterns for Parsers?

Would it be helpful to define a set of expected file extensions or filename patterns that a parser may be best matched against?

Add DFT Parser

The MDF DFT parser showcases the group operation, and leveraging external libraries (i.e., dfttopif)

When done, remove group TODO from read-the-docs

More Descriptive Exception Types

We do not currently define any practice for returning exceptions that are due to incompatible files and those due to other problems. There could cases where exceptions due to incompatible files should be ignored, and those due to other problems should halt operation of a parsing program.

One option to differentiating these types of errors would be providing a special exception type for "this file is incompatible."

Make Example Outputs for Parsers

Writing tools that use parsers would be much easier with example output data.

Is this something we could automate with Sphinx and make available through the readthedocs?

Remove mdf_toolbox dependency

We only use it for 5 functions, and the main developer of MDF Toolbox is no longer active.

Do We Need BaseSingleFileParser?

We currently define the parse operation to operation on a single logical group of files. However, all of our implementations take a list of files and generate data for each file individually - a task I generalized in BaseSingleFileParser.

Should we instead enforce that files must always be treated as a group? If you have a group of files that are all self-containing, you must run parse for each file rather than on a list of the entire group.

Consistently name tools "extractors"

The MaRDA automated extractors group calls tools like ours "extractors," so let's go that way

Some tools are named "extractors," others "parsers." The documentation uses the term "parser."

Improve Documentation

For all of the existing parsers, add the following information:

Schema describing the output format
Types of files that the model can run on

Describe Adapter/Parser Model

We haven't quite described how to integrate MaterialsIO into other applications. See the stub in our docs:

Implement a parser/adaptor pair (DFT, perhaps?)
Describe the approach in the documentation
Point to a project that uses MaterialsIO in their pipeline (MDF, perhaps?)
Make a schematic for how it is deployed in the MDF
Remove TODO from read-the-docs

Package should be published on PyPI

To facilitate easier install, materials_io should be published on PyPI, meaning it would then be possible to install via
pip install materials_io rather than having to clone the repo and install using poetry or in editable mode.

Add Documentation Instructions to Documentation

We need more details on how to document new parsers in the documentation.