pedrobcst / xerus Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 6.0 17.43 MB

XRay Estimation and Refinement Using Similarity (XERUS)

License: MIT License

Python 89.70% Makefile 0.05% Shell 0.01% Batchfile 0.01% CSS 0.01% C++ 0.28% Fortran 3.11% HTML 6.83% SAS 0.03%

materials-informatics materials-science xray-diffraction

xerus's People

Contributors

Stargazers

Watchers

Forkers

qwe789qwec ml-evs lapressle alibh95 keshi06 oslopanda

xerus's Issues

Add timeout settings to Optimade querier

Currently OptimadeQuery interface does not use the predefined REQUESTS_TIMEOUT define in settings.py.
Add this timeout value to the requests.get call of OptimadeQuery interface.

update to newest pymatgen

update the codebase to be compatible with newest versions of pymatgen and bump pymatgen version

Update to newest Materials project API

MaterialsProject has released a new default interface and API. Currently the API key from the newest version of Materials project does not work in Xerus, only the legacy ones works.

TODO:

Move to newest version of materials project API
Rewrite the tests for new API
Update pymage (duplicate of #3)

As of currently, we have a possible new 'cif' testing method which is much faster and does not do any refinement. After the changes on how the default 'refinement' is done and modifying the simulations tests to be handled in place (when the data is being simulated), Xerus became much more stabler. The only 'test' now required is if GSASII engine can actually parse a CIF coming from a provider. The new tests justs then tries to open a CIF file using GSASII using a dummy project. This leads to a large increase of possible structures available that before were considered "system breaking".

This has to be implemented before #22.
TODO:

Re run benchmarks using new testing method on a clean mock database.

Unable to run examples

I am new to the Xerus package, and I have successfully installed it using the provided instructions. I have already set up the MongoDB server and the materials project API key.

I ran tests and noticed that some of the tests were failing. Upon debugging, I discovered that Xerus continues to use dependencies' deprecated functions from versions that were installed alongside the Xerus installation.

I have downgraded to the following packages:

scipy == v1.7.0
optimade = 0.16.0

Doing this has caused some tests to pass, but now, the test session gets stuck at "test_solvers.py::test_boxauto ".

When I tried running the Examples.ipynb, I get the following error:

FileNotFoundError                         Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_18928\2514958583.py in <module>
     15 
     16     # Analyze and store.
---> 17     run.analyze(n_runs="auto")
     18 
     19     # Save objecti nto memory?

~\Xerus\Xerus\__init__.py in analyze(self, n_runs, grabtop, delta, combine_filter, select_cifs, plot_all, ignore_provider, ignore_comb, ignore_ids, solver, group_method, auto_threshold, r_ori, n_jobs)
    545 
    546         # Get the cifs, simulate the patterns, run correlation (first phase)
--> 547         self.get_cifs(
    548             ignore_provider=ignore_provider,
    549             ignore_comb=ignore_comb,

~\Xerus\Xerus\__init__.py in get_cifs(self, ignore_provider, ignore_comb, ignore_ids)
    223         self
    224         """
--> 225         cif_meta, cif_notran, cif_notsim = LocalDB().get_cifs_and_write(
    226             element_list=self.elements,
    227             outfolder=self.working_folder,

~\Xerus\Xerus\db\localdb.py in get_cifs_and_write(self, element_list, name, outfolder, maxn, max_oxy)
    238         final_path = os.path.join(outfolder, folder_to_write)
    239         queries = make_system_types(element_list, maxn)
--> 240         self.check_all(queries, name = name)
    241 
    242         # check oxygen limit

~\Xerus\Xerus\db\localdb.py in check_all(self, system_types, name)
    198             else:
    199                 print("Checking the following combination:{}".format(combination))
--> 200                 self.check_and_download(combination, name = name)
    201         return self
    202 

~\Xerus\Xerus\db\localdb.py in check_and_download(self, system_type, name)
    174         if not self.check_system(system_type):
    175             elements = system_type.split("-")
--> 176             multiquery(elements, max_num_elem=len(elements), name = name)
    177         return self
    178 

~\Xerus\Xerus\queriers\multiquery.py in multiquery(element_list, max_num_elem, name, resync)
    160     # ## UPDATE DB ##
    161     print("Uploading database with cifs..")
--> 162     data = load_json(os.path.join(test_folder, 'cif.json'))
    163     print(len(data))
    164     if len(data) == 0:

~\Xerus\Xerus\queriers\multiquery.py in load_json(path)
     77     """
     78 
---> 79     with open(path, "r") as fp:
     80         return json.load(fp)
     81 

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Ali Bhatti\\Xerus\\Xerus\\queriers\\LiMn2O4+Li2MnO3_parsed.csv_Mn_cifs\\cif.json'

Update Examples notebooks

As of last release, Xerus is locally installable by pip, removing the need of path hacks. The notebooks have to be update to reflect this change.

Investigate speed ups

Currently there are few bottlenecks that can make a first run analysis slow. Investigate on how to solve these bottlenecks to makes things smoother

CIF test: Currently, all structures downloaded from a provider goes through a simple test by GSAS II, to check if they will work fine or not (some structures can literally break GSASII). Investigate the following:

Check if these tests are still needed or not after changing few refinement conditions
If needed, evaluate on how to parallelize this test
Think of a different away to validate the CIFs.

Caching Simulation: Currently, everytime an analysis is run, Xerus will simulate the patterns for all structures. In this case, even if the analysis is done for the sample just with different paramaters (ie, increase n_runs, change box width, ignore a structure etc), all the simulations will be redone. When there is a lot of structures to be simulated it can take a while. Check the following options:

Feasibility of caching the simulations during testing above. What to do when instrument paramater change?
When a analysis is ran, keep track of what is simulated for that run. Then, if its ran again with different paramaters re used previously simulated data, but do not store locally

Change streamlit from input to forms

Change the input paramaters to st.forms so we dont always reload the app.

Make it possible to load old search results

Currently, after a search is done all the results are saved in dataframe and exported through CSV. However, there is no option to load this CSV to re-visualize past results. The only way is by rerunning the search. Solve this issue so it is possible to use Xerus plotting functions to quick revisualize results and also allow to do optimization if necessary.

Add option to read 'Refinements Results'.csv

Deprecate 'tcif.py' and move testing to not be a script

As of PR #24, we do not do test refinements anymore. Since things became much stabler and there is no more errors that breaks and requires a totally script rerunning, the purpose of tcif.py can be moved elsewhere.

TODO:

Move testing to be a simple function that does the 'loading' internally
Remove tcif.py

discussion: Repeat querying

Hello,

Thank you for the nice work!
While using Xerus, i feel there might be many repeating querying from the database?
For instance a system with elements A, B, C, and D
the program try to query combinations A, B, C, D, (AB), (AC), (AD), (BC),(CD).....
However, if you just query for (ABCD) you will get all the combinations from the database? Or i am totally wrong?

Add support to create streamlit interface

To increase ease of use, add functionality to support the creation of Streamlit interface

TODO:

Allow plots to be returned as objects so they can be displayed in a Streamlit interface
Investigate whatever else is necessary

Xerus-streamlit project will be done in a different repo

Investigate how to deal when large amount of data is sent by provider through OPTIMADE

Currently when a provider sends a large amount of data back from an optimade structure query, it results in a error 503/504 and the querier stop, example is when querying COD for ['Si', 'O'] system.

For example this query URL will return 504

Example of extremely large file:
COD ID 1552091

Investigate how to handle this.
Ideas:

Maybe if query returns a 503 error, we should retry query reducing page_limit? (current default 10)
Skip current query and move to next page (add page_limit to page_offset?)

Investigate AFLOW error connection in CI

Currently, the CI is failed due to a connection issue with AFLOW.

Github actions is failing to connect to AFLOW using urllib, even though the connection to the COD seems to be working fine. This is leading to a error in the CI tests since the structures cannot be downloaded for testing the solvers. If the tests are run into a local machine (ie, my machine), they will successfuly pass.

Have to find a way to fix this. For the moment I will create a seperate branch where AFLOW is disabled and confirm the tests pass there.

Fix dummy not being added when a certain element combination returns no data

As of latest version, the "dummy" entry created into the database when no structures exist for a given element combination in any of the databases providers to avoid continuosly requerying that element combination is not being added anymore. This probably appeared after the testing method changed. Fix this.

How to handle multi user for the same installation? (Simultaneous query)

Currently, as designed, Xerus can only handle one user for one installation when querying for missing CIFs for a given chemical space.
With the development of the Streamlit beta interface, the possibility of one installation being used by many users will be possible.
In this scenario, Xerus cannot handle concurrent query for different chemical spaces, as it always save and tests the cifs using fixed names.
Change this to support multiquery use by multiple users simultaneously.

Ideas:

Instead of using a 'predefined' name for each provider as (mp_dump, cod_dump, aflow_dump, etc), we could use instead directly the chemical space name + provider as (Ho-B_mp_dump) etc and the same for queried_cifs folder.
If above, we have also to change the 'test_cif' to accept the final folder as final destination for final testing.

TODO:

Change query static folder names ('xx_dump') to a name based on dataset name + chemical space + provider, ie from 'mp_dump' to 'DataSetName_ChemSpace_mp_dump' (or something smaller)
Change testing cif to accept the final 'dump folder' also from static (queried_cifs) to the same as above
For the case of multiple user querying overlap space, before adding the CIFs, we should do a check to see if that structure already exist in the database OR we should 'clean up' the database of duplicates after querying (whatever is better, maybe the first option is best)
Write few tests (not to be used in CI, just to check the if everything works well), emulating multi user usage from with overlapping chemical spaces, identical chemical spaces and non overlappaing chemical spaces

Allow to use first run simulations for subsequent runs of a given element space

Currently, everytime the analyze function is ran (even if the same paremeters), Xerus will re-simulate, re-query the database. This can be time consuming, and actually makes the sometimes needed iterative process of hyperparameter tuning (ie, g, delta, n_runs, provider settings and so on) time consuming. In light of this problem, the following changes are needed:

After the first run, keep track of the already simulated patterns / correlations. (should be almost there..)
If this simulations already exist, use them instead of re simulating / re - querying
Keep a 'safe' information of simulations / structures that can always be accessed to be used for filtering purposes

Set HTTP `User-Agent` header for `requests`

It is usually very nice when software presents itself in the HTTP User-Agent header. Web browsers almost always present themselves in a verbose but neat way. This way the server-side developers are informed about what software makes requests to their servers, and may occasionally forward common issues to the client developers. Moreover, usage statistics could be drawn at the server-size, which is nice.

Xerus seems to be using Python requests package for HTTP requests. By default its User-Agent is python-requests/<version>. Changing this seems to be rather simple. It would be nice to see Xerus/<version> at least.

If user privacy is concern, it should not be forgotten that this is F/LOSS, thus anyone concerned with their privacy are free to patch their copy of the client to their liking.

Any interest in an OPTIMADE querier?

Hi there, very interesting package!

I am just wondering if you would be interested in implementing an OPTIMADE API querier class, which would provide unified access to all crystal structure databases in the OPTIMADE consortium. I would be interested in helping with the implementation!

Installation problem with conda/pip

Hi all, I get this error message tryinig to install xerus in a conda environment.
with pip install -e . I get this error:
ImportError: Error importing numpy: you should not try to import numpy from
its source directory; please exit the numpy source tree, and relaunch
your python interpreter from there.
I have no glue what can I do with this message.
Thank you
Klemens

Run tests in continuous integration and dockerize the package

Regarding CI, Ive looked around it but I could not find a way to set it up. Specially how to automatically start mongo, create a conda env, do pip, and set up all the config files automatically. If you have any clues on how to do it let me know. I was also planning to create a docker enviroment to easily deploy it but I got stuck on the same issue.

Originally posted by @pedrobcst in #4 (comment)

As far as I see it, there are a few packaging/testing things that could be added to the package for better re-use:

A setup.py to allow the package to be pip install'ed
Some CI (presumably GitHub actions) that installs the package and runs the existing tests, which require mongo
A Dockerfile (or multiple) that containerize the services required to run the overall package (e.g., database, filesystem and Python code)

I can spend an hour or so on this here and there to get the ball rolling, as these are all requirements for me to use the package elsewhere. Perhaps the overall containerization can wait until we have had a discussion in the future.

Black box optimizer

Hi
when running Examples notebook, I encounter such issue when getting to optimization step (both on Windows and Mac):

mixture.run_optimizer(n_trials = 200, # How many runs to try

                  n_startup = 20, # How many trials to start the search with

                  allow_pref_orient = True, # To allow pref orietnation to be refined

                  allow_atomic_params = True, # Allows atomic paramaters (X, U) to be considered

                  allow_broad = True, # Allow broadenign terms (Check GSAS II doc.) (test)

                  allow_strain = True, # Allow strain terms (test)

                  allow_angle = True, # To allow acute angle refinement first.

                  force_ori = False, # To always consider pref. orientation

                  verbose = 'silent', # Verbose

                  param = 'rwp', # Objetive goal

                  n_jobs = -1, # Number of cores to use

                  plot_best = True, # plot best result after opt.

                  show_lattice = True, # Prints obtained lattice parameters

                  random_state = 71, # Random state

                 )

[W 2023-06-26 21:38:25,378] Trial 5 failed because of the following error: AttributeError("Can't pickle local object 'BlackboxOptimizer.objective_mp..evaluate'")
Traceback (most recent call last):

CI cannot install gfortran anymore

Recently the CI job is failing at:

Unable to locate package libgfortran4

This is probably (?) due to newest version of Ubuntu (?, ie at ubuntu-latest). Think how to fix this.

Any ideas @ml-evs ?

Write tests for structure query through OPTIMADE

Write tests coverages for the following cases

Normal OQMD query through Optimade
OQMD query with one filter (_omqd_stab)
Query with more than one filter
COD query restricting structure volume
Assert all queries are returning 202

Change CI to be OS indepedent

As of possible new release (1.1b), we might support all OSes. In light of this, it might be necessary to update the CI to test in all oses.

This (hypothetically) might work [basically move to conda for enviroment management for CI]:

Create enviroment file for conda that install all Xerus dependencies
Create new enviroment using this file, so there is no need to do apt-get to install gfortran etc (it will come with conda)
Set the tests to run on windows-latests / macos-latest

GSASII Scriptable

Hi
I have an issue on Linux version (through a virtual box) with the GSASII scriptable installation. When I do the test, I run into an error for test_gsas2.py
ERROR GSAS-II binary libraries not found.

I did install it as well on windows and it worked fine. Any advice?

Evaluate and Implement adding temperature restriction to COD

Currently, one of the main issues of when querying the COD is the lack of control on what structure we obtain. As discussed in the paper, one of the main issues of missclassifications is when a distorted low temperature structure (that usually comes from the COD) is matched instead of the of room temperature one.

In this situtation, one possibility to avoid this is to implement one extra filter on the OPTIMADE querier of COD (_cod_celltemp) to restrict structures around room temperature only ( maybe 293 +- 5 K ?).

This info seems to not be available on the COD REST API, therefore the Optimade querier should become the main one.

To do this, evaluate:

Is there any change on the total amounts of structures if the we use _cod_celltemp as filter?
What will be the impact on the benchmark / examples of Xerus?
In the first case, if there a lot of structures with no _cod_celltemp, an option might be of doing the filter post-query, and keeping the structures that have no celltemp

Add option for 'is ceramic' in streamlit interface

When we are doing powder matching for ceramics it would be a ideal if we could have a check button that would search for 'oxygen' only spaces.

TODO:

Add a checkbox for ceramic search
If checkbox is checked, automatically add to ignore_combs all combinations that does not have oxygen
If an extra specificied ignore_comb is added, merge them if not already in the pre defined one

pedrobcst / xerus Goto Github PK

xerus's People

Contributors

Stargazers

Watchers

Forkers

xerus's Issues

Recommend Projects

Recommend Topics

Recommend Org