materials-data-facility / scythe Goto Github PK
View Code? Open in Web Editor NEWAn extensible library of tools that extract metadata from scientific files
License: Apache License 2.0
An extensible library of tools that extract metadata from scientific files
License: Apache License 2.0
RDKit includes this functionality but requires a conda instal
https://rdkit.org
I installed MaterialsIO using poetry
(see https://github.com/jat255/MaterialsIO/tree/poetry_compat), and I'm getting test failures on the crystal structure parser. It appears the issue is the ASE adapter, as it results in an incorrect interpretation (at least according to test_crystal_structure.py
). Here are the test failures:
> assert isclose(output['crystal_structure']['number_of_atoms'], 5070.0)
E assert False
E + where False = isclose(4656.0, 5070.0)
> assert output == {'material': {'composition': 'Co270H1680C1872N324O924'},
'crystal_structure': {'space_group_number': 146,
'stoichiometry': 'A45B54C154D280E312'}}
E AssertionError: assert {'crystal_str...728N288O864'}} == {'crystal_stru...872N324O924'}}
E Differing items:
E {'material': {'composition': 'Co240H1536C1728N288O864'}} != {'material': {'composition': 'Co270H1680C1872N324O924'}}
E {'crystal_structure': {'space_group_number': 210, 'stoichiometry': 'A5B6C18D32E36'}} != {'crystal_structure': {'space_group_number': 146, 'stoichiometry': 'A45B54C154D280E312'}}
E Full diff:
E - {'crystal_structure': {'space_group_number': 210,
E ? - ^
E + {'crystal_structure': {'space_group_number': 146,
E ? ^^
E - 'stoichiometry': 'A5B6C18D32E36'},
E ? ^ ^^^ ^
E + 'stoichiometry': 'A45B54C154D280E312'},
E ? + ^^ ++++ ^ ^^
E - 'material': {'composition': 'Co240H1536C1728N288O864'}}
E ? ^ -- - ^^ ^^
E + 'material': {'composition': 'Co270H1680C1872N324O924'}}
E ? ^ ++ + + ^ ^^
Here is the list of packages installed in my environment. I tried to follow the package versions in requirements.txt
and setup.py
:
Package Version
--------------------- -----------
ase 3.22.1
atomicwrites 1.4.0
attrs 21.4.0
backcall 0.2.0
boto3 1.20.35
botocore 1.23.35
cached-property 1.5.2
certifi 2021.10.8
cffi 1.15.0
chardet 4.0.0
charset-normalizer 2.0.10
click 8.0.3
cloudpickle 2.0.0
coverage 6.2
coveralls 3.3.1
cryptography 36.0.1
cycler 0.11.0
Cython 0.29.26
dask 2021.12.0
debugpy 1.5.1
decorator 5.1.1
dftparse 0.3.0
dfttopif 1.1.0
dill 0.3.4
dnspython 2.2.0
docopt 0.6.2
docutils 0.17.1
entrypoints 0.3
et-xmlfile 1.1.0
fair-research-login 0.2.6
fett 0.3.2
flake8 4.0.1
fsspec 2022.1.0
globus-nexus-client 0.4.1
globus-sdk 3.2.1
greenlet 1.1.2
h5py 3.6.0
hyperspy 1.6.5
idna 3.3
ijson 3.1.4
imageio 2.9.0
importlib-metadata 4.2.0
importlib-resources 5.4.0
ipykernel 6.7.0
ipyparallel 8.1.0
ipython 7.31.0
isodate 0.6.1
jedi 0.18.1
jmespath 0.10.0
jsonlines 3.0.0
jsonschema 4.4.0
jupyter-client 7.1.0
jupyter-core 4.9.1
kiwisolver 1.3.2
linear-tsv 1.1.0
llvmlite 0.34.0
locket 0.2.1
materials-io 0.0.1
matplotlib 3.4.3
matplotlib-inline 0.1.3
mccabe 0.6.1
mdf-toolbox 0.5.11
monty 2022.1.12.1
more-itertools 8.12.0
mpmath 1.2.1
natsort 8.0.2
nest-asyncio 1.5.4
networkx 2.6.3
numba 0.51.2
numexpr 2.8.1
numpy 1.21.1
openpyxl 3.0.9
packaging 21.3
palettable 3.3.0
pandas 1.1.5
parso 0.8.3
partd 1.2.0
pbr 5.8.0
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.2.0
Pint 0.18
pip 21.3.1
pluggy 1.0.0
prettytable 3.0.0
prompt-toolkit 3.0.24
psutil 5.9.0
ptyprocess 0.7.0
py 1.11.0
pycalphad 0.9.2
pycodestyle 2.8.0
pycparser 2.21
PyDispatcher 2.0.5
pyflakes 2.4.0
Pygments 2.11.2
PyJWT 2.3.0
pymatgen 2018.12.12
pymongo 3.12.3
pyparsing 3.0.6
pypif 2.1.2
pyrsistent 0.18.0
pytest 3.10.1
pytest-cov 2.9.0
python-dateutil 2.8.2
python-jsonrpc-server 0.3.4
python-magic 0.4.24
pytz 2021.3
PyWavelets 1.2.0
PyYAML 5.4.1
pyzmq 22.3.0
requests 2.27.1
rfc3986 2.0.0
ruamel.yaml 0.17.20
ruamel.yaml.clib 0.2.6
s3transfer 0.5.0
scikit-image 0.19.1
scipy 1.6.1
setuptools 59.5.0
setuptools-scm 6.3.2
six 1.16.0
snooty-lextudio 1.11.6
sparse 0.13.0
spglib 1.16.3
SQLAlchemy 1.4.29
stevedore 1.32.0
symengine 0.7.2
sympy 1.8
tableschema 1.20.2
tabulate 0.8.9
tabulator 1.53.5
tifffile 2021.11.2
tinydb 4.5.2
toml 0.10.2
tomli 2.0.0
toolz 0.11.2
tornado 6.1
tqdm 4.62.3
traitlets 5.1.1
traits 6.3.2
typing-extensions 3.10.0.2
ujson 1.35
unicodecsv 0.14.1
urllib3 1.26.8
watchdog 1.0.2
wcwidth 0.2.5
wheel 0.37.0
xarray 0.20.2
xlrd 2.0.1
xmltodict 0.12.0
zipp 3.7.0
As discussed in #38 (comment), we should use Github actions to setup automatic CI/CD, code coverage, and documentation generation.
We need to port all of the parsers available in Connect over to the MDF:
The MDF DFT parser showcases the group
operation, and leveraging external libraries (i.e., dfttopif
)
When done, remove group
TODO from read-the-docs
The MaRDA automated extractors group calls tools like ours "extractors," so let's go that way
Some tools are named "extractors," others "parsers." The documentation uses the term "parser."
Based on our group conversations, the team seems to prefer extract/extractors to parse/parsers. Let's change MaterialsIO to reflect that. @jgaff
We only use it for 5 functions, and the main developer of MDF Toolbox is no longer active.
Our parser summary is pretty ugly. We should:
We'll probably need to write some code that injects this information into the rst documentation before building the docs. I bet Sphinx has a clever trick for that, or we could modify the Makefile.
We haven't quite described how to integrate MaterialsIO into other applications. See the stub in our docs:
To facilitate easier install, materials_io
should be published on PyPI, meaning it would then be possible to install via
pip install materials_io
rather than having to clone the repo and install using poetry or in editable mode.
We need more details on how to document new parsers in the documentation.
Describe a definition for "group" within the documentation.
We have agreed that a "file group" is defined as "a set of files that can be parsed together by a parser that, when treated together, can be used to create better metadata."
Add this documentation:
parse
and group
methodsFor all of the existing parsers, add the following information:
Writing tools that use parsers would be much easier with example output data.
Is this something we could automate with Sphinx and make available through the readthedocs?
We do not currently define any practice for returning exceptions that are due to incompatible files and those due to other problems. There could cases where exceptions due to incompatible files should be ignored, and those due to other problems should halt operation of a parsing program.
One option to differentiating these types of errors would be providing a special exception type for "this file is incompatible."
This parser will be used to index data created by the CATS EFRC
We should make it easy for listing libraries that hold other MaterialsIO-compatible parsers.
test_file.py
is failing on a system with a more up to date libmagic due to a change in the output for a given file.
(on my more up-to-date system)
$ file --version
file-5.41
magic file from /usr/share/file/misc/magic
seccomp support include
$ file ./MaterialsIO/tests/data/image/dog2.jpeg
./MaterialsIO/tests/data/image/dog2.jpeg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 300x300, segment length 16, Exif Standard: [\012- TIFF image data, little-endian, direntries=2, GPS-Data], baseline, precision 8, 1910x1000, components 3
(on an older Debian system):
$ file --version
file-5.32
magic file from /etc/magic:/usr/share/misc/magic
$ file MaterialsIO/tests/data/image/dog2.jpeg
./MaterialsIO/tests/data/image/dog2.jpeg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 300x300, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=2, GPS-Data], baseline, precision 8, 1910x1000, frames 3
This causes the test suite to fail on the more up-to-date system. Notice in particular the \012-
added to the output (which according to the man page, represents multiple flags were present) and also the difference between "components" and "frames" at the end.
Since the specific version of libmagic
cannot be mandated (because it's a system package), my suggestion would be to loosen up the requirement on the data_type
assertion in the test suit. Something like the following:
for i in ['JPEG image data', 'density 300x300', 'TIFF image data', '1910x1000']:
assert i in output['data_type']
The following test method works on my system:
def test_file():
my_file = os.path.join(os.path.dirname(__file__), 'data', 'image', 'dog2.jpeg')
parser = GenericFileParser(store_path=True, compute_hash=True)
output = parser.parse([my_file])
expected = {
'mime_type': 'image/jpeg',
'length': 269360,
'filename': 'dog2.jpeg',
'path': my_file,
'data_type': 'JPEG image data, JFIF standard 1.01, resolution (DPI), '
'density 300x300, segment length 16, Exif Standard: [TIFF '
'image data, little-endian, direntries=2, GPS-Data], '
'baseline, precision 8, 1910x1000, frames 3',
'sha512': '1f47ed450ad23e92caf1a0e5307e2af9b13edcd7735ac9685c9f21c'
'9faec62cb95892e890a73480b06189ed5b842d8b265c5e47cc6cf27'
'9d281270211cff8f90'}
for i in ['JPEG image data', 'density 300x300', 'TIFF image data',
'1910x1000']:
assert i in output['data_type']
del output['data_type']
del expected['data_type']
assert output == expected
assert isinstance(parser.schema, dict)
Just running a parser on all files in a path requires:
metadata = [parser.parse(x) for x in parser.group('.') if parser.is_valid(x)]
We could simplify that into a single function call, which should also leverage pipelining (e.g., via generators) and, ideally, multiprocessing.
Then remove the TODO from read-the-docs
Back in our original discussions, we identified a potential need for MaterialsIO being that users should be able to change their versions of parsers individually. Implementing several parsers in this package seems like it will complicate that feature, as the versions for each of the parsers are interleaved with each other.
Should we break the complicated parsers out into separate libraries before making this widely available?
As discussed in #37, we should remove setup.py
and any references to it. We should also more thoroughly document the use of Poetry in the contributors and user guide
We currently define the parse
operation to operation on a single logical group of files. However, all of our implementations take a list of files and generate data for each file individually - a task I generalized in BaseSingleFileParser.
Should we instead enforce that files must always be treated as a group? If you have a group of files that are all self-containing, you must run parse
for each file rather than on a list of the entire group.
In the last line below, should "parse_files" be "parse"?
The main operation for any parser is the data extraction operation: parse.
In most cases, the parse operation takes the path to a file and and returns a summary of the data the file holds:
metadata = parser.parse_files(['/my/file'])
Add this concept to the documentation
The error message from python-magic that occurs when the underlying libmagic
library is not installed is confusing to users.
The error message "failed to find libmagic" means that users need to install the libmagic
C library, but users interpret as them needing to pip install libmagic
. We need to make this clearer.
We should also allow GenericFileParser to still work without libmagic
, as that library is not compatible with Windows and not installed by default on some Mac/Linux systems.
Would it be helpful to define a set of expected file extensions or filename patterns that a parser may be best matched against?
So the problem is that the DFT parser seems to oftentimes group the entire directory into 1 group. However in the case of distributed data processing, it becomes difficult to think of large directories as atomic structures that must be processed as a whole.
The problem directory I'm dealing with in the MDF data is at "/MDF/mdf_connect/prod/data/h2o_13_v1-1/split_xyz_files/watergrid_60_HOH_180__0.7_rOH_1.8_vario_PBE0_AV5Z_delta_PS_data" (on Petrel).
In this directory (at the time of this writing) there are 6,900 files in the directory, and the DFT grouping for this directory yields just 1 group containing all 6,900 files.
I've spoken with @WardLT, and we've discussed that this issue is non-urgent.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.