Comments (3)
I can see multiple potential issues/things hard to do with a "everything is a 2D array" mindset:
- librascal in practice does one model per central species, but computes a single large kernel. That means selecting sparse points per species, and computing the
K_MM
andK_NM
kernels needs to account for this. - When trying to predict "per structure" properties, we need to do a sum over the rows of the feature matrix/kernel. But this sum might run over different numbers of rows for different structures (because they have different number of atoms)
- sometimes (Chiheb's work on DOS) we need to only include some gradients out of all the possible gradients. That means making sure the gradient feature & gradient property are about the same atom/direction
from scikit-matter.
- librascal in practice does one model per central species, but computes a single large kernel. That means selecting sparse points per species, and computing the K_MM and K_NM kernels needs to account for this.
Is taken in account by giving the user the option to create custom kernel builder function. See more elaboration in the next answer.
- When trying to predict "per structure" properties, we need to do a sum over the rows of the feature matrix/kernel. But this sum might run over different numbers of rows for different structures (because they have different number of atoms)
The X object must have the information about the atoms in the structures and their species to be able to build a kernel. So X must be a librascal manager or some other object with meta information. To infer this information the kernel builder must know how the data format is structure. So the kernel builder and data format depend on each other.
- sometimes (Chiheb's work on DOS) we need to only include some gradients out of all the possible gradients. That means making sure the gradient feature & gradient property are about the same atom/direction
Seems to be possible with a specific kernel function interpreting the data format correctly. This information can be retrieved in librascal by getting the gradient info. I keep this in mind to try it out
This is the interface I am seeing at the moment. One still has to write a custom kernel builder function for each data type embedding the meta information about the structures differently. But I still see an advantage in having the solvers and the efficient way to set regularizers without recomputing the whole kernel, currently only implemented in the gaptools branch, also available outside of librascal.
class SparseGPR(MultiOutputMixin, RegressorMixin, BaseEstimator):
def __init__(self, kernel, alpha=1):
self.alpha = alpha
def fit(self, X, y):
XN = X[0]
XM = X[1]
# TODO mem copies are missing everywhere
KMM = self.kernel_fit(XM)
KNM, Y = self.kernel_fit(XM, XN, y, self.alpha)
K = KMM + np.dot(KNM.T, KNM)
Y = np.dot(KNM.T, Y)
# all the different solver could be moved here
# https://github.com/lab-cosmo/librascal/blob/33a8e7b2c8e7bcbe533eabd9a225c1389300e41d/bindings/rascal/models/krr.py#L40
self.coefs_ = np.linalg.lstsq(K, Y, rcond=None)[0]
self.X_fit_ = XM
return self
def set_regularizers(self, regularizer=1.0, jitter=0.0):
# would be copy of
# https://github.com/lab-cosmo/librascal/blob/33a8e7b2c8e7bcbe533eabd9a225c1389300e41d/bindings/rascal/models/krr.py#L64
def predict(self, X):
return self.kernel_predict(X, self.X_fit_, self.coefs_)
class GridSearchCVSparseGPR(SpraseGPR):
"""
Same as SparseGPR but implements a CV grid search for the regularizers
using the set_regularizer function for performance gain.
"""
def __init__(self, kernel, alphas=[1e-3, 1e-2, 1e-1], cv=2):
self.alpha = alpha
self.cv = cv
def fit(self, X, y):
# grid search is happening here using set_regularizers
return self
def predict(self, X):
return ...
When XN and XM are not numpy arrays, the SparseGPR class is not compatible with other scikit-learn-like classes (for example a Pipelines). One possible solution would be to write several wrapper classes to fix this. I don't see any other solution for librascal's managers. For people who anyway ditch librascal after the computation of the features/gradients, I thought that one might just extend the numpy array class with some simple meta information about the atomic structures. Then one can still use the scikit-learn-like classes without any additional wrapper classes.
class AtomisticArray(np.ndarray):
def __new__(cls, input_array, meta=None):
# Input array is an already formed ndarray instance
# We first cast to be our class type
obj = np.asarray(input_array).view(cls)
# add the new attribute to the created instance
obj.info = info
# Finally, we must return the newly created object:
return obj
def __array_finalize__(self, obj):
# see InfoArray.__array_finalize__ for comments
if obj is None: return
self.info = getattr(obj, 'info', None)
# some atomistic features
X = ...
X_meta = {'atoms_per_struc': [len(frame) for frame in frames]
'species': [frame.numbers for frame in frames]}
X = AtomisticArray(X, meta=X_meta)
I have put a python script at the end with more details and some further applications about Pipelines in the attachment. Its quite chaotic. I am not sure about how a lot of things will work with the already existing scikit-learn functionalities, since usually problems appear, when one wants to integrate with GridSearchCV or Pipeline, but this is just a sketch where this development could go. One problem to which I couldn't find a solution is the integration of sample selectors into Pipelines, since they shouldn't select samples in the prediction phase, but this is a bit out of the scope of this issue.
https://gist.github.com/agoscinski/f773885ed85f7415225f26ae5a49c5b6
from scikit-matter.
We do this now with equistore in equisolve.
lab-cosmo/equisolve#14
To replicate a sparse kernels class as in librascal we need more meta information about the features. Since scikit-matter is designed to be domain agnostic (it does not know what the features are), equisolve is the better place for such an implementation.
from scikit-matter.
Related Issues (20)
- PCovR is not centering like PCA HOT 3
- Moving the paper-ore branch to a fork or another repo HOT 2
- Negative distances for fitted points with the DirectionalConvexHull HOT 1
- From docs it is not super clear that sample selection works analogously to feature selection HOT 2
- Move notebooks to sphinx gallery python scripts
- Interactive example of the 3d directional convex hull using chemiscope widget HOT 1
- Create a CONDA forge recipe HOT 3
- Tests are running slow HOT 2
- PCovR-WHODataset takes super long to compute HOT 3
- What should be number of characters/line HOT 2
- Give contributors more visibility HOT 1
- WHO dataset missing function call section in doc HOT 1
- Set up a doc formatter
- Still need a logo
- Implementation of local prediction rigidity HOT 2
- Small typo on PCovR documentation
- Consistent validation HOT 2
- Zero scores result in repeated selection and wrong scores at least for FPS
- Switch to sphinx doctest to avoid implicit import problem HOT 1
- Rank-one updates and other potential performance gains for CUR HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scikit-matter.