Giter VIP home page Giter VIP logo

scythe's People

Contributors

blaiszik avatar ianfoster avatar jat255 avatar jgaff avatar maxtuecke avatar tskluzac avatar wardlt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

scythe's Issues

crystal_structure cif test not passing

I installed MaterialsIO using poetry (see https://github.com/jat255/MaterialsIO/tree/poetry_compat), and I'm getting test failures on the crystal structure parser. It appears the issue is the ASE adapter, as it results in an incorrect interpretation (at least according to test_crystal_structure.py). Here are the test failures:

>       assert isclose(output['crystal_structure']['number_of_atoms'], 5070.0)
E       assert False
E        +  where False = isclose(4656.0, 5070.0)

>       assert output == {'material': {'composition': 'Co270H1680C1872N324O924'},
                          'crystal_structure': {'space_group_number': 146,
                                                'stoichiometry': 'A45B54C154D280E312'}}
E       AssertionError: assert {'crystal_str...728N288O864'}} == {'crystal_stru...872N324O924'}}
E         Differing items:
E         {'material': {'composition': 'Co240H1536C1728N288O864'}} != {'material': {'composition': 'Co270H1680C1872N324O924'}}
E         {'crystal_structure': {'space_group_number': 210, 'stoichiometry': 'A5B6C18D32E36'}} != {'crystal_structure': {'space_group_number': 146, 'stoichiometry': 'A45B54C154D280E312'}}
E         Full diff:
E         - {'crystal_structure': {'space_group_number': 210,
E         ?                                              - ^
E         + {'crystal_structure': {'space_group_number': 146,
E         ?                                               ^^
E         -                        'stoichiometry': 'A5B6C18D32E36'},
E         ?                                             ^   ^^^  ^
E         +                        'stoichiometry': 'A45B54C154D280E312'},
E         ?                                           +  ^^  ++++ ^  ^^
E         -  'material': {'composition': 'Co240H1536C1728N288O864'}}
E         ?                                  ^   --     -  ^^ ^^
E         +  'material': {'composition': 'Co270H1680C1872N324O924'}}
E         ?                                  ^    ++  +   + ^ ^^

Here is the list of packages installed in my environment. I tried to follow the package versions in requirements.txt and setup.py:

Output of "pip list":
Package               Version
--------------------- -----------
ase                   3.22.1
atomicwrites          1.4.0
attrs                 21.4.0
backcall              0.2.0
boto3                 1.20.35
botocore              1.23.35
cached-property       1.5.2
certifi               2021.10.8
cffi                  1.15.0
chardet               4.0.0
charset-normalizer    2.0.10
click                 8.0.3
cloudpickle           2.0.0
coverage              6.2
coveralls             3.3.1
cryptography          36.0.1
cycler                0.11.0
Cython                0.29.26
dask                  2021.12.0
debugpy               1.5.1
decorator             5.1.1
dftparse              0.3.0
dfttopif              1.1.0
dill                  0.3.4
dnspython             2.2.0
docopt                0.6.2
docutils              0.17.1
entrypoints           0.3
et-xmlfile            1.1.0
fair-research-login   0.2.6
fett                  0.3.2
flake8                4.0.1
fsspec                2022.1.0
globus-nexus-client   0.4.1
globus-sdk            3.2.1
greenlet              1.1.2
h5py                  3.6.0
hyperspy              1.6.5
idna                  3.3
ijson                 3.1.4
imageio               2.9.0
importlib-metadata    4.2.0
importlib-resources   5.4.0
ipykernel             6.7.0
ipyparallel           8.1.0
ipython               7.31.0
isodate               0.6.1
jedi                  0.18.1
jmespath              0.10.0
jsonlines             3.0.0
jsonschema            4.4.0
jupyter-client        7.1.0
jupyter-core          4.9.1
kiwisolver            1.3.2
linear-tsv            1.1.0
llvmlite              0.34.0
locket                0.2.1
materials-io          0.0.1
matplotlib            3.4.3
matplotlib-inline     0.1.3
mccabe                0.6.1
mdf-toolbox           0.5.11
monty                 2022.1.12.1
more-itertools        8.12.0
mpmath                1.2.1
natsort               8.0.2
nest-asyncio          1.5.4
networkx              2.6.3
numba                 0.51.2
numexpr               2.8.1
numpy                 1.21.1
openpyxl              3.0.9
packaging             21.3
palettable            3.3.0
pandas                1.1.5
parso                 0.8.3
partd                 1.2.0
pbr                   5.8.0
pexpect               4.8.0
pickleshare           0.7.5
Pillow                7.2.0
Pint                  0.18
pip                   21.3.1
pluggy                1.0.0
prettytable           3.0.0
prompt-toolkit        3.0.24
psutil                5.9.0
ptyprocess            0.7.0
py                    1.11.0
pycalphad             0.9.2
pycodestyle           2.8.0
pycparser             2.21
PyDispatcher          2.0.5
pyflakes              2.4.0
Pygments              2.11.2
PyJWT                 2.3.0
pymatgen              2018.12.12
pymongo               3.12.3
pyparsing             3.0.6
pypif                 2.1.2
pyrsistent            0.18.0
pytest                3.10.1
pytest-cov            2.9.0
python-dateutil       2.8.2
python-jsonrpc-server 0.3.4
python-magic          0.4.24
pytz                  2021.3
PyWavelets            1.2.0
PyYAML                5.4.1
pyzmq                 22.3.0
requests              2.27.1
rfc3986               2.0.0
ruamel.yaml           0.17.20
ruamel.yaml.clib      0.2.6
s3transfer            0.5.0
scikit-image          0.19.1
scipy                 1.6.1
setuptools            59.5.0
setuptools-scm        6.3.2
six                   1.16.0
snooty-lextudio       1.11.6
sparse                0.13.0
spglib                1.16.3
SQLAlchemy            1.4.29
stevedore             1.32.0
symengine             0.7.2
sympy                 1.8
tableschema           1.20.2
tabulate              0.8.9
tabulator             1.53.5
tifffile              2021.11.2
tinydb                4.5.2
toml                  0.10.2
tomli                 2.0.0
toolz                 0.11.2
tornado               6.1
tqdm                  4.62.3
traitlets             5.1.1
traits                6.3.2
typing-extensions     3.10.0.2
ujson                 1.35
unicodecsv            0.14.1
urllib3               1.26.8
watchdog              1.0.2
wcwidth               0.2.5
wheel                 0.37.0
xarray                0.20.2
xlrd                  2.0.1
xmltodict             0.12.0
zipp                  3.7.0

Add DFT Parser

The MDF DFT parser showcases the group operation, and leveraging external libraries (i.e., dfttopif)

When done, remove group TODO from read-the-docs

Consistently name tools "extractors"

The MaRDA automated extractors group calls tools like ours "extractors," so let's go that way

Some tools are named "extractors," others "parsers." The documentation uses the term "parser."

Rename Parsers to Extractors

Based on our group conversations, the team seems to prefer extract/extractors to parse/parsers. Let's change MaterialsIO to reflect that. @jgaff

Improve Parser Summary Page

Our parser summary is pretty ugly. We should:

  • Provide richer descriptions for the purpose, output of the parsers
    • First describe the existing parsers in greater detail
  • Provide quick access to the output schemas for a full description
    • First implement the schemas for existing parsers
  • Display the implementors and version.

We'll probably need to write some code that injects this information into the rst documentation before building the docs. I bet Sphinx has a clever trick for that, or we could modify the Makefile.

Describe Adapter/Parser Model

We haven't quite described how to integrate MaterialsIO into other applications. See the stub in our docs:

  • Implement a parser/adaptor pair (DFT, perhaps?)
  • Describe the approach in the documentation
  • Point to a project that uses MaterialsIO in their pipeline (MDF, perhaps?)
  • Make a schematic for how it is deployed in the MDF
  • Remove TODO from read-the-docs

Package should be published on PyPI

To facilitate easier install, materials_io should be published on PyPI, meaning it would then be possible to install via
pip install materials_io rather than having to clone the repo and install using poetry or in editable mode.

Define What a Group Is

Describe a definition for "group" within the documentation.

We have agreed that a "file group" is defined as "a set of files that can be parsed together by a parser that, when treated together, can be used to create better metadata."

Add this documentation:

  • User documentation
  • Source code documentation within the parse and group methods
  • Contributor documentation

Improve Documentation

For all of the existing parsers, add the following information:

  • Schema describing the output format
  • Types of files that the model can run on

Make Example Outputs for Parsers

Writing tools that use parsers would be much easier with example output data.

Is this something we could automate with Sphinx and make available through the readthedocs?

More Descriptive Exception Types

We do not currently define any practice for returning exceptions that are due to incompatible files and those due to other problems. There could cases where exceptions due to incompatible files should be ignored, and those due to other problems should halt operation of a parsing program.

One option to differentiating these types of errors would be providing a special exception type for "this file is incompatible."

Create a Public Repository for Complaint Parsers

We should make it easy for listing libraries that hold other MaterialsIO-compatible parsers.

  • Find another target library that is a good resource
  • Figure out how to host the library (perhaps GH-pages, or read-the-docs?)
  • Remove TODO from read the docs

`test_file.py` tests fail dependent on version of `libmagic` installed

test_file.py is failing on a system with a more up to date libmagic due to a change in the output for a given file.

(on my more up-to-date system)

$ file --version
file-5.41
magic file from /usr/share/file/misc/magic
seccomp support include

$ file ./MaterialsIO/tests/data/image/dog2.jpeg
./MaterialsIO/tests/data/image/dog2.jpeg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 300x300, segment length 16, Exif Standard: [\012- TIFF image data, little-endian, direntries=2, GPS-Data], baseline, precision 8, 1910x1000, components 3

(on an older Debian system):

$ file --version
file-5.32
magic file from /etc/magic:/usr/share/misc/magic

$ file MaterialsIO/tests/data/image/dog2.jpeg
./MaterialsIO/tests/data/image/dog2.jpeg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 300x300, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=2, GPS-Data], baseline, precision 8, 1910x1000, frames 3

This causes the test suite to fail on the more up-to-date system. Notice in particular the \012- added to the output (which according to the man page, represents multiple flags were present) and also the difference between "components" and "frames" at the end.

Since the specific version of libmagic cannot be mandated (because it's a system package), my suggestion would be to loosen up the requirement on the data_type assertion in the test suit. Something like the following:

for i in ['JPEG image data', 'density 300x300', 'TIFF image data', '1910x1000']:
    assert i in output['data_type']

The following test method works on my system:

def test_file():
    my_file = os.path.join(os.path.dirname(__file__), 'data', 'image', 'dog2.jpeg')
    parser = GenericFileParser(store_path=True, compute_hash=True)
    output = parser.parse([my_file])
    expected = {
        'mime_type': 'image/jpeg',
        'length': 269360,
        'filename': 'dog2.jpeg',
        'path': my_file,
        'data_type': 'JPEG image data, JFIF standard 1.01, resolution (DPI), '
                     'density 300x300, segment length 16, Exif Standard: [TIFF '
                     'image data, little-endian, direntries=2, GPS-Data], '
                     'baseline, precision 8, 1910x1000, frames 3',
        'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c'
                  '9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf27'
                  '9d281270211cff8f90'}
    for i in ['JPEG image data', 'density 300x300', 'TIFF image data',
              '1910x1000']:
        assert i in output['data_type']
    del output['data_type']
    del expected['data_type']
    assert output == expected
    assert isinstance(parser.schema, dict)

Shortcut for Running "group then parse"

Just running a parser on all files in a path requires:

metadata = [parser.parse(x) for x in parser.group('.') if parser.is_valid(x)]

We could simplify that into a single function call, which should also leverage pipelining (e.g., via generators) and, ideally, multiprocessing.

Then remove the TODO from read-the-docs

Versioning Individual Parsers

Back in our original discussions, we identified a potential need for MaterialsIO being that users should be able to change their versions of parsers individually. Implementing several parsers in this package seems like it will complicate that feature, as the versions for each of the parsers are interleaved with each other.

Should we break the complicated parsers out into separate libraries before making this widely available?

Do We Need BaseSingleFileParser?

We currently define the parse operation to operation on a single logical group of files. However, all of our implementations take a list of files and generate data for each file individually - a task I generalized in BaseSingleFileParser.

Should we instead enforce that files must always be treated as a group? If you have a group of files that are all self-containing, you must run parse for each file rather than on a list of the entire group.

user_guide question

In the last line below, should "parse_files" be "parse"?


The main operation for any parser is the data extraction operation: parse.

In most cases, the parse operation takes the path to a file and and returns a summary of the data the file holds:

metadata = parser.parse_files(['/my/file'])

Confusing Error Messages from python-magic

The error message from python-magic that occurs when the underlying libmagic library is not installed is confusing to users.

The error message "failed to find libmagic" means that users need to install the libmagic C library, but users interpret as them needing to pip install libmagic. We need to make this clearer.

We should also allow GenericFileParser to still work without libmagic, as that library is not compatible with Windows and not installed by default on some Mac/Linux systems.

DFT parser putting too many files into groups yielding no metadata

So the problem is that the DFT parser seems to oftentimes group the entire directory into 1 group. However in the case of distributed data processing, it becomes difficult to think of large directories as atomic structures that must be processed as a whole.

The problem directory I'm dealing with in the MDF data is at "/MDF/mdf_connect/prod/data/h2o_13_v1-1/split_xyz_files/watergrid_60_HOH_180__0.7_rOH_1.8_vario_PBE0_AV5Z_delta_PS_data" (on Petrel).

In this directory (at the time of this writing) there are 6,900 files in the directory, and the DFT grouping for this directory yields just 1 group containing all 6,900 files.

I've spoken with @WardLT, and we've discussed that this issue is non-urgent.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.