With "general PseudoPoint/SparsePoint kernel class" I mean a class which also accepts

We do this now with equistore in equisolve. <a class="issue-link js-issue-link" da

Check support for Librascal's structure manager object into a general PseudoPoint/SparsePoint kernel class,about scikit-learn-contrib/scikit-matter

Comments (3)

Luthaf commented on June 12, 2024

I can see multiple potential issues/things hard to do with a "everything is a 2D array" mindset:

librascal in practice does one model per central species, but computes a single large kernel. That means selecting sparse points per species, and computing the K_MM and K_NM kernels needs to account for this.
When trying to predict "per structure" properties, we need to do a sum over the rows of the feature matrix/kernel. But this sum might run over different numbers of rows for different structures (because they have different number of atoms)
sometimes (Chiheb's work on DOS) we need to only include some gradients out of all the possible gradients. That means making sure the gradient feature & gradient property are about the same atom/direction

from scikit-matter.

agoscinski commented on June 12, 2024

librascal in practice does one model per central species, but computes a single large kernel. That means selecting sparse points per species, and computing the K_MM and K_NM kernels needs to account for this.

Is taken in account by giving the user the option to create custom kernel builder function. See more elaboration in the next answer.

When trying to predict "per structure" properties, we need to do a sum over the rows of the feature matrix/kernel. But this sum might run over different numbers of rows for different structures (because they have different number of atoms)

The X object must have the information about the atoms in the structures and their species to be able to build a kernel. So X must be a librascal manager or some other object with meta information. To infer this information the kernel builder must know how the data format is structure. So the kernel builder and data format depend on each other.

sometimes (Chiheb's work on DOS) we need to only include some gradients out of all the possible gradients. That means making sure the gradient feature & gradient property are about the same atom/direction

Seems to be possible with a specific kernel function interpreting the data format correctly. This information can be retrieved in librascal by getting the gradient info. I keep this in mind to try it out

This is the interface I am seeing at the moment. One still has to write a custom kernel builder function for each data type embedding the meta information about the structures differently. But I still see an advantage in having the solvers and the efficient way to set regularizers without recomputing the whole kernel, currently only implemented in the gaptools branch, also available outside of librascal.

class SparseGPR(MultiOutputMixin, RegressorMixin, BaseEstimator):
    def __init__(self, kernel, alpha=1):
        self.alpha = alpha

    def fit(self, X, y):
        XN = X[0]
        XM = X[1]
        # TODO mem copies are missing everywhere

        KMM = self.kernel_fit(XM)
        KNM, Y = self.kernel_fit(XM, XN, y, self.alpha)

        K = KMM + np.dot(KNM.T, KNM)
        Y = np.dot(KNM.T, Y)

        # all the different solver could be moved here
        # https://github.com/lab-cosmo/librascal/blob/33a8e7b2c8e7bcbe533eabd9a225c1389300e41d/bindings/rascal/models/krr.py#L40
        self.coefs_ = np.linalg.lstsq(K, Y, rcond=None)[0]
        self.X_fit_ = XM
        return self

    def set_regularizers(self, regularizer=1.0, jitter=0.0):
        # would be copy of 
        # https://github.com/lab-cosmo/librascal/blob/33a8e7b2c8e7bcbe533eabd9a225c1389300e41d/bindings/rascal/models/krr.py#L64

    def predict(self, X):
        return self.kernel_predict(X, self.X_fit_, self.coefs_)


class GridSearchCVSparseGPR(SpraseGPR):
    """
    Same as SparseGPR but implements a CV grid search for the regularizers
    using the set_regularizer function for performance gain.
    """
    def __init__(self, kernel, alphas=[1e-3, 1e-2, 1e-1], cv=2):
        self.alpha = alpha
        self.cv = cv

    def fit(self, X, y):
        # grid search is happening here using set_regularizers
        return self

    def predict(self, X):
        return ...

When XN and XM are not numpy arrays, the SparseGPR class is not compatible with other scikit-learn-like classes (for example a Pipelines). One possible solution would be to write several wrapper classes to fix this. I don't see any other solution for librascal's managers. For people who anyway ditch librascal after the computation of the features/gradients, I thought that one might just extend the numpy array class with some simple meta information about the atomic structures. Then one can still use the scikit-learn-like classes without any additional wrapper classes.

class AtomisticArray(np.ndarray):
    def __new__(cls, input_array, meta=None):
        # Input array is an already formed ndarray instance
        # We first cast to be our class type
        obj = np.asarray(input_array).view(cls)
        # add the new attribute to the created instance
        obj.info = info
        # Finally, we must return the newly created object:
        return obj

    def __array_finalize__(self, obj):
        # see InfoArray.__array_finalize__ for comments
        if obj is None: return
        self.info = getattr(obj, 'info', None)

# some atomistic features
X = ...
X_meta = {'atoms_per_struc': [len(frame) for frame in frames]
        'species': [frame.numbers for frame in frames]}
X = AtomisticArray(X, meta=X_meta)

I have put a python script at the end with more details and some further applications about Pipelines in the attachment. Its quite chaotic. I am not sure about how a lot of things will work with the already existing scikit-learn functionalities, since usually problems appear, when one wants to integrate with GridSearchCV or Pipeline, but this is just a sketch where this development could go. One problem to which I couldn't find a solution is the integration of sample selectors into Pipelines, since they shouldn't select samples in the prediction phase, but this is a bit out of the scope of this issue.

https://gist.github.com/agoscinski/f773885ed85f7415225f26ae5a49c5b6

from scikit-matter.

agoscinski commented on June 12, 2024

We do this now with equistore in equisolve.
lab-cosmo/equisolve#14

To replicate a sparse kernels class as in librascal we need more meta information about the features. Since scikit-matter is designed to be domain agnostic (it does not know what the features are), equisolve is the better place for such an implementation.

from scikit-matter.

Check support for Librascal's structure manager object into a general PseudoPoint/SparsePoint kernel class about scikit-matter HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent