Giter VIP home page Giter VIP logo

Comments (3)

Luthaf avatar Luthaf commented on June 12, 2024

I can see multiple potential issues/things hard to do with a "everything is a 2D array" mindset:

  • librascal in practice does one model per central species, but computes a single large kernel. That means selecting sparse points per species, and computing the K_MM and K_NM kernels needs to account for this.
  • When trying to predict "per structure" properties, we need to do a sum over the rows of the feature matrix/kernel. But this sum might run over different numbers of rows for different structures (because they have different number of atoms)
  • sometimes (Chiheb's work on DOS) we need to only include some gradients out of all the possible gradients. That means making sure the gradient feature & gradient property are about the same atom/direction

from scikit-matter.

agoscinski avatar agoscinski commented on June 12, 2024
  • librascal in practice does one model per central species, but computes a single large kernel. That means selecting sparse points per species, and computing the K_MM and K_NM kernels needs to account for this.

Is taken in account by giving the user the option to create custom kernel builder function. See more elaboration in the next answer.

  • When trying to predict "per structure" properties, we need to do a sum over the rows of the feature matrix/kernel. But this sum might run over different numbers of rows for different structures (because they have different number of atoms)

The X object must have the information about the atoms in the structures and their species to be able to build a kernel. So X must be a librascal manager or some other object with meta information. To infer this information the kernel builder must know how the data format is structure. So the kernel builder and data format depend on each other.

  • sometimes (Chiheb's work on DOS) we need to only include some gradients out of all the possible gradients. That means making sure the gradient feature & gradient property are about the same atom/direction

Seems to be possible with a specific kernel function interpreting the data format correctly. This information can be retrieved in librascal by getting the gradient info. I keep this in mind to try it out

This is the interface I am seeing at the moment. One still has to write a custom kernel builder function for each data type embedding the meta information about the structures differently. But I still see an advantage in having the solvers and the efficient way to set regularizers without recomputing the whole kernel, currently only implemented in the gaptools branch, also available outside of librascal.

class SparseGPR(MultiOutputMixin, RegressorMixin, BaseEstimator):
    def __init__(self, kernel, alpha=1):
        self.alpha = alpha

    def fit(self, X, y):
        XN = X[0]
        XM = X[1]
        # TODO mem copies are missing everywhere

        KMM = self.kernel_fit(XM)
        KNM, Y = self.kernel_fit(XM, XN, y, self.alpha)

        K = KMM + np.dot(KNM.T, KNM)
        Y = np.dot(KNM.T, Y)

        # all the different solver could be moved here
        # https://github.com/lab-cosmo/librascal/blob/33a8e7b2c8e7bcbe533eabd9a225c1389300e41d/bindings/rascal/models/krr.py#L40
        self.coefs_ = np.linalg.lstsq(K, Y, rcond=None)[0]
        self.X_fit_ = XM
        return self

    def set_regularizers(self, regularizer=1.0, jitter=0.0):
        # would be copy of 
        # https://github.com/lab-cosmo/librascal/blob/33a8e7b2c8e7bcbe533eabd9a225c1389300e41d/bindings/rascal/models/krr.py#L64

    def predict(self, X):
        return self.kernel_predict(X, self.X_fit_, self.coefs_)


class GridSearchCVSparseGPR(SpraseGPR):
    """
    Same as SparseGPR but implements a CV grid search for the regularizers
    using the set_regularizer function for performance gain.
    """
    def __init__(self, kernel, alphas=[1e-3, 1e-2, 1e-1], cv=2):
        self.alpha = alpha
        self.cv = cv

    def fit(self, X, y):
        # grid search is happening here using set_regularizers
        return self

    def predict(self, X):
        return ...

When XN and XM are not numpy arrays, the SparseGPR class is not compatible with other scikit-learn-like classes (for example a Pipelines). One possible solution would be to write several wrapper classes to fix this. I don't see any other solution for librascal's managers. For people who anyway ditch librascal after the computation of the features/gradients, I thought that one might just extend the numpy array class with some simple meta information about the atomic structures. Then one can still use the scikit-learn-like classes without any additional wrapper classes.

class AtomisticArray(np.ndarray):
    def __new__(cls, input_array, meta=None):
        # Input array is an already formed ndarray instance
        # We first cast to be our class type
        obj = np.asarray(input_array).view(cls)
        # add the new attribute to the created instance
        obj.info = info
        # Finally, we must return the newly created object:
        return obj

    def __array_finalize__(self, obj):
        # see InfoArray.__array_finalize__ for comments
        if obj is None: return
        self.info = getattr(obj, 'info', None)

# some atomistic features
X = ...
X_meta = {'atoms_per_struc': [len(frame) for frame in frames]
        'species': [frame.numbers for frame in frames]}
X = AtomisticArray(X, meta=X_meta)

I have put a python script at the end with more details and some further applications about Pipelines in the attachment. Its quite chaotic. I am not sure about how a lot of things will work with the already existing scikit-learn functionalities, since usually problems appear, when one wants to integrate with GridSearchCV or Pipeline, but this is just a sketch where this development could go. One problem to which I couldn't find a solution is the integration of sample selectors into Pipelines, since they shouldn't select samples in the prediction phase, but this is a bit out of the scope of this issue.

https://gist.github.com/agoscinski/f773885ed85f7415225f26ae5a49c5b6

from scikit-matter.

agoscinski avatar agoscinski commented on June 12, 2024

We do this now with equistore in equisolve.
lab-cosmo/equisolve#14

To replicate a sparse kernels class as in librascal we need more meta information about the features. Since scikit-matter is designed to be domain agnostic (it does not know what the features are), equisolve is the better place for such an implementation.

from scikit-matter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.