algorithmfoundry / foundry Goto Github PK

View Code? Open in Web Editor NEW

132.0 132.0 41.0 19.36 MB

The Cognitive Foundry is an open-source Java library for building intelligent systems using machine learning

License: Other

Java 99.99% HTML 0.01%

algorithm artificial-intelligence information-retrieval java linear-algebra machine-learning statistics

foundry's People

Contributors

Stargazers

Watchers

foundry's Issues

Make a callback-driven way to iterate through a sparse vector

The current way is mostly based on doing normal iteration via VectorEntry. Another way could be to have a callback for active sparse elements that give the index and the value.

Implement boosted decision trees

Boosted decision trees should fit well into the Foundry learning package. It should also support the stochastic variant.

Make compile using Java 8

Fix the build errors/warnings that happen with Java 8. Ideally in a way that it still builds with Java 6 and 7 as well.

Diagonal matrix multiplication can create lots of zeros in result matrix

Example (Scala code)

May effect not just diagonal but other types of sparse matrices.

import gov.sandia.cognition.math.matrix.{MatrixFactory, VectorFactory}
val factory = MatrixFactory.getSparseDefault()
val matrix1 = factory.createMatrix(2, 2)
matrix1.setElement(0,0,1.0)
matrix1.setElement(1,1,10.0)
val matrix2 = factory.createDiagonal(VectorFactory.getDenseDefault.copyArray(Array(1.0, 0.1)))
matrix1.times(matrix2)

(That returns:

res2: gov.sandia.cognition.math.matrix.Matrix =
(0,0): 1.0
(0,1): 0.0
(1,0): 0.0
(1,1): 1.0
instead of
res2: gov.sandia.cognition.math.matrix.Matrix =
(0,0): 1.0
(1,1): 1.0
)

VectorFactory#copyValues(Collection<? extends Number>): The iteration order of the collection is used blindly

It appears that VectorFactory#copyValues(Collection<? extends Number>) expects the provided collection to be ordered, since it uses the iteration order of the collection when creating the vector. As some Java collections have arbitrary iteration order, there is potential for nasty ordering bugs if the developer is not careful to consider the underlying implementation of the collection before using this method.

I suggest that the expectation about iteration order is documented in the javadoc for the method. Alternatively, the method could be replaced by variants only accepting ordered collections (such as List and SortedSet, but this might limit the compatibility with unknown/future collections).

Implement random projection method

Random projections of the input data can be a useful method for creating non-linear features. It conceptually fits nicely with the rest of the Foundry, so we should add support for it.

Add interface for category distribution prediction

Add an interface for distribution of classes from a categorizer and implement it in the relevant classes.

DefaultDataDistribution doesn't update total when a value is set to zero

After a value has been set/incremented if you set it back to zero (or less than zero) then it doesn't update the total.

EigenvectorPowerIteration.java

public static Vector estimateEigenvector(
final Vector initial,
final Matrix A,
final double stoppingThreshold,
final int maxIterations ) {
}

This method takes stoppingThreshold, maxIterations for numerical methods. Any idea what I should pass on to these to achieve similar implementation as here (i.e) Damping parameter for PageRank, default=0.85.
..I have used the default values as of now.

Make it easier to do regularization with optimization methods

Our optimization methods for learning are currently designed with a very heavy bias towards being used with a supervised cost function. However, there are other types of cost functions that people often use, such as regularized versions, that do not fit will with the current design.
We should adjust the design to accommodate these types of cost functions by making the generics more permissive and less tied to the specifics of the SupervisedCostFunction? directly.
See the forum topic http://www.cognitivefoundry.org/?topic=a-couple-of-usage-questions-learning-package for some background information.

Add a gradient checking utility

Add a utility that makes sure a gradient computation is working.

Implement a standard normalization learner

A common step in learning is to do feature normalization. One popular method for doing this is to normalize each feature by mapping it to a standard normal (Gaussian) distribution by subtracting the mean and dividing by the standard deviation.

Random Forest accuracy is reduced severely with the addition of zero-information features

Disclaimer: This could very well be a bug in my code. Perhaps someone could try to reproduce locally.

I've stumbled upon a weird problem. I'm using RandomForestFactory with the following parameters:

ensembleSize: 200
baggingFraction: 1.0
dimensionsFraction: 0.2
maxTreeDepth: Integer.MAX_VALUE
minLeafSize: 1

Consider the following trivial dataset: 10 samples where 5 are labelled 'A' and 5 are labelled 'B'. There is just one feature, with the value '1' for 'A' samples and the value '0' for 'B' samples. As to be expected, I am able to achieve a 100% prediction accuracy on this dataset.

However, if I add 100 zero-information features to the dataset, something weird happens. If the samples are given random values of either '0' or '1' for these features, the accuracy falls to ~75%. If the samples are all given a value of just '0' instead, the accuracy falls further down to ~52% (i.e. only slightly better than random guessing).

I compared with Weka's Random Forest implementation with similar parameters, and get 100% accuracy in all 3 cases.

Any ideas?

Rename Cognitive Foundry as Algorithm Foundry

We had decided a while back to rename the Cognitive part to Algorithm (hence the organization name, Twitter handle, and new domain). This needs to be carried out at some point

Make it easier to learn logistic regression with optimization methods

It would be nice to be able to apply optimization methods to learn logistic regression type functions.
For some background, see the forum post: http://www.cognitivefoundry.org/?topic=a-couple-of-usage-questions-learning-package

Iterating over sparse vector can be slow

Using the VectorEntry seems to be doing a log(n) iteration on lookup. See if this can be improved.

Sampling from Dirichlet distribution is slow for large alpha values

The performance of sampling from a Dirichlet is slow when there are large alpha values.

Implement basic restricted boltzmann machine

A good implementation of a basic restrictions boltzmann machine (RBM) would be a good addition to the Foundry. It could then be used as a feature transformation for further learning.

More permissive generics on ClusterCreators

The generics on the various ClusterCreators could perhaps be slightly more permissive. I'm using the standard Java convention of referencing stuff by the highest superinterface that makes sense in the context. For example:

List<String> list = new ArrayList<>();

Similarly, I intend to do:

ClusterCreator<Cluster<Vector>, Vector> creator = new DefaultClusterCreator<>();

However, this is not allowed because of the generics of DefaultClusterCreator. Instead, I have to do this:

ClusterCreator<? extends Cluster<Vector>, Vector> creator = new DefaultClusterCreator<>();

Could this be fixed? Of course there might be a good reason for this limitation. If so, please feel free to ignore this request :).

Spherical k-means with sparse vectors is slow

The dot product is happening in the wrong order for spherical k-means (cosine distance), which causes a loop over the dense vector. It should be over the sparse one.
One potential fix for this is to change the vector classes to prefer looping over a sparse vector rather than a dense one.

DivergencesEvaluator should be a vector encoder

The divergences evaluator should conform to the vector encoder interface.

Fix clone methods for decision trees and supporting learners

Some of the decision tree code doesn't seem to have proper clone methods.

Add immutable vector and matrix classes

Add the ability to make immutable vectors and matrices to help prevent against accidental plusEquals (and the like) on something that is meant to be immutable.

Implement support for tensors

We have support for vectors and matrices, so tensors could be another good addition. We could have both a general Tensor class that can have a variable number of ways and also maybe a Tensor3 that has 3-way tensor specialization.

Implement an adapter class for common multi-level learning models

A very common use case for learning in the Foundry having a supervised learner where we apply some transformation to the input and output data to represent it in the appropriate way for the learner. Currently it is up to the developer to do these transformations as part of calling the Foundry. However, because this happens so frequently, for example when transforming input data into vectors, it would be nice if some of this could happen automatically so that the details can be abstracted away. Another place this happens is multi-level learning where one (or more) unsupervised algorithms are used before applying a supervised one. Thus, we should add a utility class that helps with this very common use case.

Improvement in API/documentation clarity with regards to "maximum minimum distance"

I have a small suggestion for improvement of clarity in the API/documentation for the AgglomerativeClusterer class: Rename "maximum minimum distance" to "maximum distance".

For example:

public void setMax~~Min~~Distance(double max~~Min~~Distance)

The maximum ~~minimum~~ distance between clusters that is allowed for the two clusters to be merged. If there are no clusters that remain that have a distance between them less than or equal to this value, then the clustering will halt. To not have this value factored into the clustering, set it to something such as Double.MAX_VALUE.

KMeansClusterer with a CentroidClusterDivergenceFunction crashes when a cluster ends up empty

In KMeansClusterer, the divergences between an element and each of the clusters are measured every iteration. At its core, it happens like this:

 double distance = this.divergenceFunction.evaluate(cluster, element);

When the divergence function is a CentroidClusterDivergenceFunction, evaluate() does this:

return this.divergenceFunction.evaluate(other, cluster.getCentroid());

However, this throws a NullPointerException when cluster is null (and thus can't be dereferenced at the .getCentroid() call).

A cluster is indeed set to null in KMeansClusterer when all its previous elements have been reassigned to different clusters:

if (members.size() > 0)
{
    cluster = this.creator.createCluster(members);
}
else
{
    cluster = null;
}

Iterate Over Vector Values Only

Is there a simple way to get an iterator over just the Doubles of a vector/matrix object? I know that one can iterate over VectorEntry's, but that doesn't fit easily into the generics/collections context (if all you're interested in are the values).
For example, it would be great to use something like Guava's Iterables methods on vector and matrix objects.

Find or implement a sparse singular value decomposition (SVD)

It would be very helpful to have a sparse singular value decomposition (SVD) for several of the algorithms that we have.
It may be possible to do this using the ARPACK wrapper that is from Netlib-java.

Add support for statistical equivalence and noninferiority testing

It would be nice to have the ability to run statistical equivalence or noninferiority tests.

Random Forests are slowed down by AbstractDataDistribution#getEntropy()

I'm testing your new RandomForestFactory and am getting some pretty good results! However, the algorithm is slower than expected.

My code trains hundreds of Random Forest classifiers on a small test dataset. I profiled it, and noticed that ~45% of the time is spent in AbstractDataDistribution#getEntropy(). I suspect that this is not supposed to happen, but if I'm wrong, and this is indeed the natural center of computation, please feel free to close this issue.

I don't know what the underlying performance bottleneck is, but I suspect that the call to MathUtil.log2(double) may be the one.

Convert argument checks to use ArgumentChecker

There are lots of argument checks that were put in the code before ArgumentChecker? was created. We should convert as many of these as possible and where appropriate add additional methods to ArgumentChecker? to support these checks.

Here are some possible new checks:

Not empty string, array, or collection.
Two arguments are the same size array or collection.

Could add variants of some methods that would return the value. This can be useful in the case of checking before calling super constructor or chaining constructors.

Add a distribution for uniform values in a range of integers

There is UniformDistribution over doubles, but we could use another one over integers.

Add a method to get top or bottom n values from a numeric map

The numeric map offers methods to get the min and max, but it would be useful to get the top and bottom n keys.

Upgrade version of MTJ

The foundry is currently using an older version of MTJ. It should be updated to use a newer one.

Protected class AbstractMutableDoubleMap.SimpleEntrySet is exposed through AbstractMutableDoubleMap#entrySet()

The AbstractMutableDoubleMap.SimpleEntrySet class is protected, but exposed through the public AbstractMutableDoubleMap#entrySet() method. This is a problem, because you can't look at the SimpleEntrySet object returned by the entrySet() method from "public" when you do not have "public" access to the SimpleEntrySet class.

Failing MonteCarloSamplerTestHarness test

Hello,

On my setup (Java 1.7.0, Ubuntu Linux), the testSample test in MonteCarloSamplerTestHarness is failing - the mean falls outside of the confidence interval.

Running gov.sandia.cognition.statistics.montecarlo.DirectSamplerTest
Constructors
Known Values
clone
sample
Mean: 0.7308781907032909
Monte Carlo: Mean: 0.6873011446918514 Variance: 0.00459004806254931
Interval: 0.673858098932311 0.6873011446918514 0.7007441904513919 0.95 100
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec <<< FAILURE!

Increasing the number of samples doesn't help.

Many thanks,
Yves

Implement learner for univariate regression

There should be a simple static method and associated batch learner for doing simple univariate regression where there is a single input and a single output. This is the basic case of f(x) = m * x + b.
Also, an incremental learner should be implemented as well.

TukeyKramerConfidence result is wrong

The result of the TukeyKramerConfidence test is wrong. We confirmed this with Statistika and SAS JMP.

The proplem is within the calculaction of the test statistics/standardError.
One should not use the totalVariance, but this term instead:

1/(N-K) * SUM( Variance_i * (TreatementCount_i -1))

Also there is no need to mulitply the test statistic with an extra Sqrt(2), since its already in the caclulation of the standard error.

Attached is a fixed version and a UnitTest from a Textbook example.
This example has been verified with Statistika and SAS JMP.

/*
 * File:                TukeyKramerConfidence.java
 * Authors:             Kevin R. Dixon
 * Company:             Sandia National Laboratories
 * Project:             Cognitive Foundry
 * 
 * Copyright May 16, 2011, Sandia Corporation.
 * Under the terms of Contract DE-AC04-94AL85000, there is a non-exclusive
 * license for use of this work by or on behalf of the U.S. Government.
 * Export of this program may require a license from the United States
 * Government. See CopyrightHistory.txt for complete details.
 * 
 */


package com.gf.ye.yes.service.plot.statistics;

import gov.sandia.cognition.annotation.PublicationReference;
import gov.sandia.cognition.annotation.PublicationType;
import gov.sandia.cognition.math.UnivariateStatisticsUtil;
import gov.sandia.cognition.math.matrix.Matrix;
import gov.sandia.cognition.math.matrix.MatrixFactory;
import gov.sandia.cognition.statistics.distribution.StudentizedRangeDistribution;
import gov.sandia.cognition.statistics.method.AbstractMultipleHypothesisComparison;
import gov.sandia.cognition.statistics.method.ConfidenceTestAssumptions;
import gov.sandia.cognition.statistics.method.TukeyKramerConfidence;
import gov.sandia.cognition.util.ObjectUtil;
import gov.sandia.cognition.util.Pair;

import java.util.ArrayList;
import java.util.Collection;
import java.util.List;

/**
 * Tukey-Kramer test is the multiple-comparison generalization of the unpaired
 * Student's t-test when conducting multiple comparisons.  The t-test and
 * Tukey's Range test are coincident when a single comparison is made.
 * Tukey's Range test is typically used as the post-hoc analysis technique
 * after detecting a difference using a 1-way ANOVA.  This class implements
 * Kramer's generalization to unequal subjects in different treatments.
 * @author Kevin R. Dixon
 * @since 3.1
 */
@ConfidenceTestAssumptions(
    name="Tukey-Kramer Range test",
    alsoKnownAs={
        "Tukey's Range test",
        "Tukey's Honestly Significant Difference test",
        "Tukey's HSD test"
    },
    description={
        "Tukey's test determines which treatment is statistically different from a multiple comparison.",
        "Tukey's test is a generalization of the paired Student's t-test for multiple comparisons using a population-correction factor."
    },
    assumptions={
        "All data came from same distribution, without considering treatment effects.",
        "The observations have equal variance.",
        "Measurements are independent and equivalent within a treatment.",
        "All observations are independent."
    },
    nullHypothesis="Each treatment has no effect on the mean outcome of the subjects",
    dataPaired=false,
    dataSameSize=false,
    distribution=StudentizedRangeDistribution.class,
    reference={
        @PublicationReference(
            author="Wikipedia",
            title="Tukey's range test",
            type=PublicationType.WebPage,
            year=2011,
            url="http://en.wikipedia.org/wiki/Tukey's_range_test"
        )
    }
)

public class TukeyKramerConfidenceN1 extends AbstractMultipleHypothesisComparison<Collection<? extends Number>, TukeyKramerConfidenceN1.Statistic> {

    private static final long serialVersionUID = 1L;

    /**
     * Creates a new instance of TukeyKramerConfidence
     */
    public TukeyKramerConfidenceN1() {
        super();
    }

    @Override
    public TukeyKramerConfidence clone() {
        return (TukeyKramerConfidence) super.clone();
    }

    @Override
    public TukeyKramerConfidenceN1.Statistic evaluateNullHypotheses(Collection<? extends Collection<? extends Number>> data, double uncompensatedAlpha) {
        // There are "K" treatments
        final int K = data.size();

        // Each treatment can have a different number of subjects
        List<Integer> subjectCounts = new ArrayList<Integer>(K);
        List<Double> treatmentMeans = new ArrayList<Double>(K);

        double treatmentVariancesSum = 0;
        // This is the total subject count.
        int N =0;  
        for (Collection<? extends Number> treatment : data) {
            final int Ni = treatment.size();
            N += Ni;
            subjectCounts.add(Ni);
            Pair<Double,Double> meanAndVariance = UnivariateStatisticsUtil.computeMeanAndVariance(treatment);
            treatmentMeans.add(meanAndVariance.getFirst());
            treatmentVariancesSum += meanAndVariance.getSecond() * (Ni-1);
        }

        final double meanSquaredResiduals = treatmentVariancesSum / (N -K );

        return new TukeyKramerConfidenceN1.Statistic(uncompensatedAlpha, subjectCounts, treatmentMeans, meanSquaredResiduals);
    }

    /**
     * Statistic from Tukey-Kramer's multiple comparison test
     */
    public static class Statistic extends AbstractMultipleHypothesisComparison.Statistic {

        /**
         * 
         */
        private static final long serialVersionUID = 1L;

        /**
         * Number of subjects in each treatment
         */
        protected List<Integer> subjectCounts;

        /**
         * Mean for each treatment
         */
        protected List<Double> treatmentMeans;


        protected List<Double> treatmentVariances;

        /**
         * Gets the standard errors in the experiment
         */
        protected Matrix standardErrors;

        /**
         * 
         */
        protected Matrix meanDifferences;

            /**
         * Creates a new instance of StudentizedMultipleComparisonStatistic
         * 
         * @param uncompensatedAlpha
         *            Uncompensated alpha (p-value threshold) for the multiple comparison test
         * @param subjectCounts
         *            Number of subjects in each treatment
         * @param treatmentMeans
         *            Mean for each treatment
         * @param treatmentVariances 
         * @param totalVariance
         *            Variance over all subjects in the experiment
         */
        public Statistic(final double uncompensatedAlpha, final List<Integer> subjectCounts, final List<Double> treatmentMeans, final double meanSquaredResiduals) {
            this.treatmentCount = treatmentMeans.size();
            this.uncompensatedAlpha = uncompensatedAlpha;
            this.subjectCounts = subjectCounts;
            this.treatmentMeans = treatmentMeans;
            this.testStatistics = this.computeTestStatistics(subjectCounts, treatmentMeans, meanSquaredResiduals);
            this.nullHypothesisProbabilities = this.computeNullHypothesisProbabilities(subjectCounts, this.testStatistics);
        }

        /**
         * Computes the test statistic for all treatments
         * 
         * @param subjectCounts
         *            Number of subjects in each treatment
         * @param treatmentMeans
         *            Mean for each treatment
         * @param totalVariance
         *            Variance over all subjects in the experiment
         * @return Test statistics, where the (i,j) element compares treatment "i" to treatment "j", the statistic is symmetric
         */
        public Matrix computeTestStatistics(final List<Integer> subjectCounts, final List<Double> treatmentMeans, final double meanSquaredResiduals) {
            int K = treatmentMeans.size();
            Matrix Z = MatrixFactory.getDefault().createMatrix(K, K);
            this.standardErrors = MatrixFactory.getDefault().createMatrix(K, K);

            for (int i = 0; i < K; i++) {
                final double yi = treatmentMeans.get(i);
                final int ni = subjectCounts.get(i);
                for (int j = i + 1; j < K; j++) {
                    final int nj = subjectCounts.get(j);
                    final double yj = treatmentMeans.get(j);
                    double standardError = Math.sqrt( meanSquaredResiduals  * 0.5 * ((1.0 / ni) + (1.0 / nj)));
                    final double zij = Math.abs(yi - yj) / standardError;
                    Z.setElement(i, j, zij);
                    Z.setElement(j, i, zij);
                    this.standardErrors.setElement(i, j, standardError);
                    this.standardErrors.setElement(j, i, standardError);
                }
            }
            return Z;
        }

        /**
         * Computes null-hypothesis probability for the (i,j) treatment comparison
         * 
         * @param subjectCounts
         *            Number of subjects in the experiment
         * @param Z
         *            Test statistic for the (i,j) treatment comparison
         * @return Null-hypothesis probability for the (i,j) treatment comparison
         */
        public Matrix computeNullHypothesisProbabilities(final List<Integer> subjectCounts, final Matrix Z) {
            final int K = Z.getNumRows();
            final double N = UnivariateStatisticsUtil.computeSum(subjectCounts);

            Matrix P = MatrixFactory.getDefault().createMatrix(K, K);
            StudentizedRangeDistribution.CDF cdf = new StudentizedRangeDistribution.CDF(K, N - K);
            for (int i = 0; i < K; i++) {
                // A classifier is equal to itself.
                P.setElement(i, i, 1.0);
                for (int j = i + 1; j < K; j++) {
                    // The difference is symmetric
                    double zij = Z.getElement(i, j);
                    double pij = 1.0 - cdf.evaluate(zij ); // * Math.sqrt(2) 
                    P.setElement(i, j, pij);
                    P.setElement(j, i, pij);
                }
            }

            return P;

        }

        @Override
        public Statistic clone() {
            Statistic clone = (Statistic) super.clone();
            clone.treatmentMeans = ObjectUtil.cloneSmartElementsAsArrayList(this.getTreatmentMeans());
            clone.subjectCounts = ObjectUtil.cloneSmartElementsAsArrayList(this.getSubjectCounts());
            return clone;
        }

        /**
         * Getter for subjectCounts
         * 
         * @return Number of subjects in the experiment
         */
        public List<Integer> getSubjectCounts() {
            return this.subjectCounts;
        }

        /**
         * Getter for treatmentMeans
         * 
         * @return Mean for each treatment
         */
        public List<Double> getTreatmentMeans() {
            return this.treatmentMeans;
        }

        @Override
        public boolean acceptNullHypothesis(final int i, final int j) {
            return this.getNullHypothesisProbability(i, j) >= this.getUncompensatedAlpha();
        }

        /**
         * Getter for standardErrors
         * 
         * @return Gets the standard errors in the experiment
         */
        public Matrix getStandardErrors() {
            return this.standardErrors;
        }


    }

}



/**
 * 
 */
package com.gf.ye.yes.service.plot;

import static org.junit.Assert.assertEquals;
import gov.sandia.cognition.math.UnivariateStatisticsUtil;

import java.util.List;

import org.junit.Test;

import com.gf.ye.yes.service.plot.statistics.TukeyKramerConfidenceN1;
import com.google.common.collect.ImmutableList;

/**
 * @author fkurth
 *
 */
public class TukeyTestTest {


    /**
     * 
     * From
     * 
     * Rasch, Herrendoerfer, Bock, Victor, Guiard
     * ISBN 3-486-23146-4
     * 
     * (In German)
     * 
     * Verfahrensbibliothek. Band 1.
     * Page 851
     * 
     * Verified with Statistica
     * 
     */
    List<List<Double>> testData = ImmutableList.of(
            (List<Double>)ImmutableList.of( 529d, 508d, 501d, 534d, 510d, 504d ),
            (List<Double>)ImmutableList.of( 505d, 521d, 560d, 516d, 598d, 552d ),
            (List<Double>)ImmutableList.of( 537d, 569d, 499d, 501d, 506d, 600d ),
            (List<Double>)ImmutableList.of( 619d, 632d, 644d, 638d, 623d ),
            (List<Double>)ImmutableList.of( 565d, 596d, 631d, 667d, 613d, 580d )
            );

    final TukeyKramerConfidenceN1 t = new TukeyKramerConfidenceN1();



    /**
     * Test method for {@link com.gf.ye.yes.service.plot.statistics.TukeyKramerConfidenceN1#evaluateNullHypotheses(java.util.Collection, double)}.
     */
    @Test
    public final void testEvaluateNullHypothesesCollectionOfQextendsCollectionOfQextendsNumberDouble() {

        TukeyKramerConfidenceN1.Statistic stat = t.evaluateNullHypotheses(testData);

        Integer treatments = stat.getTreatmentCount();

        assertEquals(Integer.valueOf(5) , treatments );

        List<Double> means =  stat.getTreatmentMeans();

        assertEquals( Double.valueOf( 514.33d ),  means.get(0), 0.005 );
        assertEquals( Double.valueOf( 542.00d ),  means.get(1), 0.005 );
        assertEquals( Double.valueOf( 535.33d ),  means.get(2), 0.005 );
        assertEquals( Double.valueOf( 631.20d ),  means.get(3), 0.005 );
        assertEquals( Double.valueOf( 608.67d ),  means.get(4), 0.005 );

        Integer subjects = (int) UnivariateStatisticsUtil.computeSum(stat.getSubjectCounts() );
        assertEquals( Integer.valueOf(29), subjects);

        Integer degOfFreedom = subjects - treatments; 
        assertEquals( Integer.valueOf(24), degOfFreedom);


        // diagonals
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(0, 0)  , 0.00005 );
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(1, 1)  , 0.00005 );
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(2, 2)  , 0.00005 );
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(3, 3)  , 0.00005 );
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(4, 4)  , 0.00005 );

        assertEquals( Double.valueOf( 0.541176d ),  stat.getNullHypothesisProbability(0, 1)  , 0.000001 );
        assertEquals( Double.valueOf( 0.541176d ),  stat.getNullHypothesisProbability(1, 0)  , 0.000001 );

        assertEquals( Double.valueOf( 0.763884d ),  stat.getNullHypothesisProbability(0, 2)  , 0.000001 );
        assertEquals( Double.valueOf( 0.763884d ),  stat.getNullHypothesisProbability(2, 0)  , 0.000001 );

        assertEquals( Double.valueOf( 0.000145d ),  stat.getNullHypothesisProbability(0, 3)  , 0.000001 );
        assertEquals( Double.valueOf( 0.000145d ),  stat.getNullHypothesisProbability(3, 0)  , 0.000001 );

        assertEquals( Double.valueOf( 0.000300d ),  stat.getNullHypothesisProbability(0, 4)  , 0.000001 );
        assertEquals( Double.valueOf( 0.000300d ),  stat.getNullHypothesisProbability(4, 0)  , 0.000001 );

        assertEquals( Double.valueOf( 0.995624d ),  stat.getNullHypothesisProbability(1, 2)  , 0.000001 );
        assertEquals( Double.valueOf( 0.995624d ),  stat.getNullHypothesisProbability(2, 1)  , 0.000001 );

        assertEquals( Double.valueOf( 0.000766d ),  stat.getNullHypothesisProbability(1, 3)  , 0.000001 );
        assertEquals( Double.valueOf( 0.000766d ),  stat.getNullHypothesisProbability(3, 1)  , 0.000001 );

        assertEquals( Double.valueOf( 0.008328d ),  stat.getNullHypothesisProbability(1, 4)  , 0.000001 );
        assertEquals( Double.valueOf( 0.008328d ),  stat.getNullHypothesisProbability(4, 1)  , 0.000001 );

        assertEquals( Double.valueOf( 0.000391d ),  stat.getNullHypothesisProbability(2, 3)  , 0.000001 );
        assertEquals( Double.valueOf( 0.000391d ),  stat.getNullHypothesisProbability(3, 2)  , 0.000001 );

        assertEquals( Double.valueOf( 0.748831d ),  stat.getNullHypothesisProbability(4, 3)  , 0.000001 );
        assertEquals( Double.valueOf( 0.748831d ),  stat.getNullHypothesisProbability(3, 4)  , 0.000001 );






    }

}

Design and implement a general index assignment utility

A common task in both the Foundry (and more generally) is to keep track of some values and assign values unique indices. These indices are typically integers starting from 0, though in some cases they may be longs or other values such as UUIDs.
Having a utility to cover this, either generally or just the specific base case, would be a good addition to the Foundry. In particular, it could help with mapping values onto indices in a Vector, for example when converting an InfiniteVector? to a Vector.

Implement Nystrom Method for Approximate Gaussian RBF Kernel

Allows a linear algorithm to be used with an approximation to Gaussian RBF Kernels.

Expose method to get number of active elements in a vector and matrix

Currently it is in the SparseVector implementation but not available in the Vector interface.

AbstractVectorThresholdMaximumGainLearner: Sanity check triggered

I was playing around with the parameters for the Random Forest example from #6 and somehow triggered a sanity check in AbstractVectorThresholdMaximumGainLearner that probably should not be triggerable:

java.lang.RuntimeException: bestThreshold (8.30760652058587) lies outside range of values (8.30760652058587, 9.14680325466277]
    at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.computeBestGainAndThreshold(AbstractVectorThresholdMaximumGainLearner.java:383)
    at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.computeBestGainAndThreshold(AbstractVectorThresholdMaximumGainLearner.java:209)
    at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.learn(AbstractVectorThresholdMaximumGainLearner.java:141)
    at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.learn(AbstractVectorThresholdMaximumGainLearner.java:45)
    at gov.sandia.cognition.learning.algorithm.tree.RandomSubVectorThresholdLearner.learn(RandomSubVectorThresholdLearner.java:212)
    at gov.sandia.cognition.learning.algorithm.tree.RandomSubVectorThresholdLearner.learn(RandomSubVectorThresholdLearner.java:47)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learnNode(CategorizationTreeLearner.java:237)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learnNode(CategorizationTreeLearner.java:37)
    at gov.sandia.cognition.learning.algorithm.tree.AbstractDecisionTreeLearner.learnChildNodes(AbstractDecisionTreeLearner.java:129)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learnNode(CategorizationTreeLearner.java:246)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learn(CategorizationTreeLearner.java:178)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learn(CategorizationTreeLearner.java:37)
    at gov.sandia.cognition.learning.algorithm.ensemble.AbstractBaggingLearner.step(AbstractBaggingLearner.java:195)
    at gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner.learn(AbstractAnytimeBatchLearner.java:147)
    ...

Add an integer-type based version of DefaultDataDistribution

This could be a bit nicer of an interface than having to cast back and forth to doubles.

Implement Alternating Decision Trees

Implement the Alternating Decision Tree and its learning algorithm.

Sparse vector times dense matrix is slow

It doesn't seem like sparse vector times dense matrix is working at an appropriate speed. Neither is dense matrix times sparse vector.

Implement Factorization Machines

Add implementation of Factorization Machines and the learning algorithms for it.

The CategorizationTreeLearner produces leaf nodes having number of data points less leafCountThreshold

I set the leafCountThreshold to 90k for a very large data set. I then used the model trained to predict the training dataset and found there are many nodes have number of output records much less than the threshold.

In the source code of CategorizationTreeLearner:

    boolean isLeaf = this.areAllOutputsEqual(data)
        || data.size() <= this.leafCountThreshold
        || (this.maxDepth > 0 && node.getDepth() >= this.maxDepth);

The second condition makes a node have data size less than the leafCountThreshold a leaf node.

Create a factory for random forests

The Foundry has support for random forests, however it requires stitching together several components to create the learner. Since it is a very popular method, we should make it easier to get started with it by adding a factory class.

Add callback construct for efficient looping over sparse vectors

Add methods for looping over vectors via a callback where the index and value is passed instead of using an iterator. According to @dbtsai this could be more efficient for sparse vector iteration.

algorithmfoundry / foundry Goto Github PK

foundry's People

Contributors

Stargazers

Watchers

Forkers

foundry's Issues

Recommend Projects

Recommend Topics

Recommend Org