Giter VIP home page Giter VIP logo

foundry's People

Contributors

dependabot[bot] avatar jbasilico avatar jdwendt avatar jeremy-wendt avatar smcrosb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

foundry's Issues

Make compile using Java 8

Fix the build errors/warnings that happen with Java 8. Ideally in a way that it still builds with Java 6 and 7 as well.

Diagonal matrix multiplication can create lots of zeros in result matrix

Example (Scala code)

May effect not just diagonal but other types of sparse matrices.

import gov.sandia.cognition.math.matrix.{MatrixFactory, VectorFactory}
val factory = MatrixFactory.getSparseDefault()
val matrix1 = factory.createMatrix(2, 2)
matrix1.setElement(0,0,1.0)
matrix1.setElement(1,1,10.0)
val matrix2 = factory.createDiagonal(VectorFactory.getDenseDefault.copyArray(Array(1.0, 0.1)))
matrix1.times(matrix2)

(That returns:

res2: gov.sandia.cognition.math.matrix.Matrix =
(0,0): 1.0
(0,1): 0.0
(1,0): 0.0
(1,1): 1.0
instead of
res2: gov.sandia.cognition.math.matrix.Matrix =
(0,0): 1.0
(1,1): 1.0
)

VectorFactory#copyValues(Collection<? extends Number>): The iteration order of the collection is used blindly

It appears that VectorFactory#copyValues(Collection<? extends Number>) expects the provided collection to be ordered, since it uses the iteration order of the collection when creating the vector. As some Java collections have arbitrary iteration order, there is potential for nasty ordering bugs if the developer is not careful to consider the underlying implementation of the collection before using this method.

I suggest that the expectation about iteration order is documented in the javadoc for the method. Alternatively, the method could be replaced by variants only accepting ordered collections (such as List and SortedSet, but this might limit the compatibility with unknown/future collections).

Implement random projection method

Random projections of the input data can be a useful method for creating non-linear features. It conceptually fits nicely with the rest of the Foundry, so we should add support for it.

EigenvectorPowerIteration.java

public static Vector estimateEigenvector(
final Vector initial,
final Matrix A,
final double stoppingThreshold,
final int maxIterations ) {
}

This method takes stoppingThreshold, maxIterations for numerical methods. Any idea what I should pass on to these to achieve similar implementation as here (i.e) Damping parameter for PageRank, default=0.85.
..I have used the default values as of now.

Make it easier to do regularization with optimization methods

Our optimization methods for learning are currently designed with a very heavy bias towards being used with a supervised cost function. However, there are other types of cost functions that people often use, such as regularized versions, that do not fit will with the current design.
We should adjust the design to accommodate these types of cost functions by making the generics more permissive and less tied to the specifics of the SupervisedCostFunction? directly.
See the forum topic http://www.cognitivefoundry.org/?topic=a-couple-of-usage-questions-learning-package for some background information.

Implement a standard normalization learner

A common step in learning is to do feature normalization. One popular method for doing this is to normalize each feature by mapping it to a standard normal (Gaussian) distribution by subtracting the mean and dividing by the standard deviation.

Random Forest accuracy is reduced severely with the addition of zero-information features

Disclaimer: This could very well be a bug in my code. Perhaps someone could try to reproduce locally.

I've stumbled upon a weird problem. I'm using RandomForestFactory with the following parameters:

  • ensembleSize: 200
  • baggingFraction: 1.0
  • dimensionsFraction: 0.2
  • maxTreeDepth: Integer.MAX_VALUE
  • minLeafSize: 1

Consider the following trivial dataset: 10 samples where 5 are labelled 'A' and 5 are labelled 'B'. There is just one feature, with the value '1' for 'A' samples and the value '0' for 'B' samples. As to be expected, I am able to achieve a 100% prediction accuracy on this dataset.

However, if I add 100 zero-information features to the dataset, something weird happens. If the samples are given random values of either '0' or '1' for these features, the accuracy falls to ~75%. If the samples are all given a value of just '0' instead, the accuracy falls further down to ~52% (i.e. only slightly better than random guessing).

I compared with Weka's Random Forest implementation with similar parameters, and get 100% accuracy in all 3 cases.

Any ideas?

Implement basic restricted boltzmann machine

A good implementation of a basic restrictions boltzmann machine (RBM) would be a good addition to the Foundry. It could then be used as a feature transformation for further learning.

More permissive generics on ClusterCreators

The generics on the various ClusterCreators could perhaps be slightly more permissive. I'm using the standard Java convention of referencing stuff by the highest superinterface that makes sense in the context. For example:

List<String> list = new ArrayList<>();

Similarly, I intend to do:

ClusterCreator<Cluster<Vector>, Vector> creator = new DefaultClusterCreator<>();

However, this is not allowed because of the generics of DefaultClusterCreator. Instead, I have to do this:

ClusterCreator<? extends Cluster<Vector>, Vector> creator = new DefaultClusterCreator<>();

Could this be fixed? Of course there might be a good reason for this limitation. If so, please feel free to ignore this request :).

Spherical k-means with sparse vectors is slow

The dot product is happening in the wrong order for spherical k-means (cosine distance), which causes a loop over the dense vector. It should be over the sparse one.
One potential fix for this is to change the vector classes to prefer looping over a sparse vector rather than a dense one.

Add immutable vector and matrix classes

Add the ability to make immutable vectors and matrices to help prevent against accidental plusEquals (and the like) on something that is meant to be immutable.

Implement support for tensors

We have support for vectors and matrices, so tensors could be another good addition. We could have both a general Tensor class that can have a variable number of ways and also maybe a Tensor3 that has 3-way tensor specialization.

Implement an adapter class for common multi-level learning models

A very common use case for learning in the Foundry having a supervised learner where we apply some transformation to the input and output data to represent it in the appropriate way for the learner. Currently it is up to the developer to do these transformations as part of calling the Foundry. However, because this happens so frequently, for example when transforming input data into vectors, it would be nice if some of this could happen automatically so that the details can be abstracted away. Another place this happens is multi-level learning where one (or more) unsupervised algorithms are used before applying a supervised one. Thus, we should add a utility class that helps with this very common use case.

Improvement in API/documentation clarity with regards to "maximum minimum distance"

I have a small suggestion for improvement of clarity in the API/documentation for the AgglomerativeClusterer class: Rename "maximum minimum distance" to "maximum distance".

For example:

public void setMaxMinDistance(double maxMinDistance)

The maximum minimum distance between clusters that is allowed for the two clusters to be merged. If there are no clusters that remain that have a distance between them less than or equal to this value, then the clustering will halt. To not have this value factored into the clustering, set it to something such as Double.MAX_VALUE.

KMeansClusterer with a CentroidClusterDivergenceFunction crashes when a cluster ends up empty

In KMeansClusterer, the divergences between an element and each of the clusters are measured every iteration. At its core, it happens like this:

 double distance = this.divergenceFunction.evaluate(cluster, element);

When the divergence function is a CentroidClusterDivergenceFunction, evaluate() does this:

return this.divergenceFunction.evaluate(other, cluster.getCentroid());

However, this throws a NullPointerException when cluster is null (and thus can't be dereferenced at the .getCentroid() call).

A cluster is indeed set to null in KMeansClusterer when all its previous elements have been reassigned to different clusters:

if (members.size() > 0)
{
    cluster = this.creator.createCluster(members);
}
else
{
    cluster = null;
}

Iterate Over Vector Values Only

Is there a simple way to get an iterator over just the Doubles of a vector/matrix object? I know that one can iterate over VectorEntry's, but that doesn't fit easily into the generics/collections context (if all you're interested in are the values).
For example, it would be great to use something like Guava's Iterables methods on vector and matrix objects.

Random Forests are slowed down by AbstractDataDistribution#getEntropy()

I'm testing your new RandomForestFactory and am getting some pretty good results! However, the algorithm is slower than expected.

My code trains hundreds of Random Forest classifiers on a small test dataset. I profiled it, and noticed that ~45% of the time is spent in AbstractDataDistribution#getEntropy(). I suspect that this is not supposed to happen, but if I'm wrong, and this is indeed the natural center of computation, please feel free to close this issue.

I don't know what the underlying performance bottleneck is, but I suspect that the call to MathUtil.log2(double) may be the one.

Convert argument checks to use ArgumentChecker

There are lots of argument checks that were put in the code before ArgumentChecker? was created. We should convert as many of these as possible and where appropriate add additional methods to ArgumentChecker? to support these checks.

Here are some possible new checks:

  • Not empty string, array, or collection.
  • Two arguments are the same size array or collection.

Could add variants of some methods that would return the value. This can be useful in the case of checking before calling super constructor or chaining constructors.

Upgrade version of MTJ

The foundry is currently using an older version of MTJ. It should be updated to use a newer one.

Failing MonteCarloSamplerTestHarness test

Hello,

On my setup (Java 1.7.0, Ubuntu Linux), the testSample test in MonteCarloSamplerTestHarness is failing - the mean falls outside of the confidence interval.

Running gov.sandia.cognition.statistics.montecarlo.DirectSamplerTest
Constructors
Known Values
clone
sample
Mean: 0.7308781907032909
Monte Carlo: Mean: 0.6873011446918514 Variance: 0.00459004806254931
Interval: 0.673858098932311 0.6873011446918514 0.7007441904513919 0.95 100
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec <<< FAILURE!

Increasing the number of samples doesn't help.

Many thanks,
Yves

Implement learner for univariate regression

There should be a simple static method and associated batch learner for doing simple univariate regression where there is a single input and a single output. This is the basic case of f(x) = m * x + b.
Also, an incremental learner should be implemented as well.

TukeyKramerConfidence result is wrong

The result of the TukeyKramerConfidence test is wrong. We confirmed this with Statistika and SAS JMP.

The proplem is within the calculaction of the test statistics/standardError.
One should not use the totalVariance, but this term instead:

1/(N-K) * SUM( Variance_i * (TreatementCount_i -1))

Also there is no need to mulitply the test statistic with an extra Sqrt(2), since its already in the caclulation of the standard error.

Attached is a fixed version and a UnitTest from a Textbook example.
This example has been verified with Statistika and SAS JMP.

/*
 * File:                TukeyKramerConfidence.java
 * Authors:             Kevin R. Dixon
 * Company:             Sandia National Laboratories
 * Project:             Cognitive Foundry
 * 
 * Copyright May 16, 2011, Sandia Corporation.
 * Under the terms of Contract DE-AC04-94AL85000, there is a non-exclusive
 * license for use of this work by or on behalf of the U.S. Government.
 * Export of this program may require a license from the United States
 * Government. See CopyrightHistory.txt for complete details.
 * 
 */


package com.gf.ye.yes.service.plot.statistics;

import gov.sandia.cognition.annotation.PublicationReference;
import gov.sandia.cognition.annotation.PublicationType;
import gov.sandia.cognition.math.UnivariateStatisticsUtil;
import gov.sandia.cognition.math.matrix.Matrix;
import gov.sandia.cognition.math.matrix.MatrixFactory;
import gov.sandia.cognition.statistics.distribution.StudentizedRangeDistribution;
import gov.sandia.cognition.statistics.method.AbstractMultipleHypothesisComparison;
import gov.sandia.cognition.statistics.method.ConfidenceTestAssumptions;
import gov.sandia.cognition.statistics.method.TukeyKramerConfidence;
import gov.sandia.cognition.util.ObjectUtil;
import gov.sandia.cognition.util.Pair;

import java.util.ArrayList;
import java.util.Collection;
import java.util.List;

/**
 * Tukey-Kramer test is the multiple-comparison generalization of the unpaired
 * Student's t-test when conducting multiple comparisons.  The t-test and
 * Tukey's Range test are coincident when a single comparison is made.
 * Tukey's Range test is typically used as the post-hoc analysis technique
 * after detecting a difference using a 1-way ANOVA.  This class implements
 * Kramer's generalization to unequal subjects in different treatments.
 * @author Kevin R. Dixon
 * @since 3.1
 */
@ConfidenceTestAssumptions(
    name="Tukey-Kramer Range test",
    alsoKnownAs={
        "Tukey's Range test",
        "Tukey's Honestly Significant Difference test",
        "Tukey's HSD test"
    },
    description={
        "Tukey's test determines which treatment is statistically different from a multiple comparison.",
        "Tukey's test is a generalization of the paired Student's t-test for multiple comparisons using a population-correction factor."
    },
    assumptions={
        "All data came from same distribution, without considering treatment effects.",
        "The observations have equal variance.",
        "Measurements are independent and equivalent within a treatment.",
        "All observations are independent."
    },
    nullHypothesis="Each treatment has no effect on the mean outcome of the subjects",
    dataPaired=false,
    dataSameSize=false,
    distribution=StudentizedRangeDistribution.class,
    reference={
        @PublicationReference(
            author="Wikipedia",
            title="Tukey's range test",
            type=PublicationType.WebPage,
            year=2011,
            url="http://en.wikipedia.org/wiki/Tukey's_range_test"
        )
    }
)

public class TukeyKramerConfidenceN1 extends AbstractMultipleHypothesisComparison<Collection<? extends Number>, TukeyKramerConfidenceN1.Statistic> {

    private static final long serialVersionUID = 1L;

    /**
     * Creates a new instance of TukeyKramerConfidence
     */
    public TukeyKramerConfidenceN1() {
        super();
    }

    @Override
    public TukeyKramerConfidence clone() {
        return (TukeyKramerConfidence) super.clone();
    }

    @Override
    public TukeyKramerConfidenceN1.Statistic evaluateNullHypotheses(Collection<? extends Collection<? extends Number>> data, double uncompensatedAlpha) {
        // There are "K" treatments
        final int K = data.size();

        // Each treatment can have a different number of subjects
        List<Integer> subjectCounts = new ArrayList<Integer>(K);
        List<Double> treatmentMeans = new ArrayList<Double>(K);

        double treatmentVariancesSum = 0;
        // This is the total subject count.
        int N =0;  
        for (Collection<? extends Number> treatment : data) {
            final int Ni = treatment.size();
            N += Ni;
            subjectCounts.add(Ni);
            Pair<Double,Double> meanAndVariance = UnivariateStatisticsUtil.computeMeanAndVariance(treatment);
            treatmentMeans.add(meanAndVariance.getFirst());
            treatmentVariancesSum += meanAndVariance.getSecond() * (Ni-1);
        }

        final double meanSquaredResiduals = treatmentVariancesSum / (N -K );

        return new TukeyKramerConfidenceN1.Statistic(uncompensatedAlpha, subjectCounts, treatmentMeans, meanSquaredResiduals);
    }

    /**
     * Statistic from Tukey-Kramer's multiple comparison test
     */
    public static class Statistic extends AbstractMultipleHypothesisComparison.Statistic {

        /**
         * 
         */
        private static final long serialVersionUID = 1L;

        /**
         * Number of subjects in each treatment
         */
        protected List<Integer> subjectCounts;

        /**
         * Mean for each treatment
         */
        protected List<Double> treatmentMeans;


        protected List<Double> treatmentVariances;

        /**
         * Gets the standard errors in the experiment
         */
        protected Matrix standardErrors;

        /**
         * 
         */
        protected Matrix meanDifferences;

            /**
         * Creates a new instance of StudentizedMultipleComparisonStatistic
         * 
         * @param uncompensatedAlpha
         *            Uncompensated alpha (p-value threshold) for the multiple comparison test
         * @param subjectCounts
         *            Number of subjects in each treatment
         * @param treatmentMeans
         *            Mean for each treatment
         * @param treatmentVariances 
         * @param totalVariance
         *            Variance over all subjects in the experiment
         */
        public Statistic(final double uncompensatedAlpha, final List<Integer> subjectCounts, final List<Double> treatmentMeans, final double meanSquaredResiduals) {
            this.treatmentCount = treatmentMeans.size();
            this.uncompensatedAlpha = uncompensatedAlpha;
            this.subjectCounts = subjectCounts;
            this.treatmentMeans = treatmentMeans;
            this.testStatistics = this.computeTestStatistics(subjectCounts, treatmentMeans, meanSquaredResiduals);
            this.nullHypothesisProbabilities = this.computeNullHypothesisProbabilities(subjectCounts, this.testStatistics);
        }

        /**
         * Computes the test statistic for all treatments
         * 
         * @param subjectCounts
         *            Number of subjects in each treatment
         * @param treatmentMeans
         *            Mean for each treatment
         * @param totalVariance
         *            Variance over all subjects in the experiment
         * @return Test statistics, where the (i,j) element compares treatment "i" to treatment "j", the statistic is symmetric
         */
        public Matrix computeTestStatistics(final List<Integer> subjectCounts, final List<Double> treatmentMeans, final double meanSquaredResiduals) {
            int K = treatmentMeans.size();
            Matrix Z = MatrixFactory.getDefault().createMatrix(K, K);
            this.standardErrors = MatrixFactory.getDefault().createMatrix(K, K);

            for (int i = 0; i < K; i++) {
                final double yi = treatmentMeans.get(i);
                final int ni = subjectCounts.get(i);
                for (int j = i + 1; j < K; j++) {
                    final int nj = subjectCounts.get(j);
                    final double yj = treatmentMeans.get(j);
                    double standardError = Math.sqrt( meanSquaredResiduals  * 0.5 * ((1.0 / ni) + (1.0 / nj)));
                    final double zij = Math.abs(yi - yj) / standardError;
                    Z.setElement(i, j, zij);
                    Z.setElement(j, i, zij);
                    this.standardErrors.setElement(i, j, standardError);
                    this.standardErrors.setElement(j, i, standardError);
                }
            }
            return Z;
        }

        /**
         * Computes null-hypothesis probability for the (i,j) treatment comparison
         * 
         * @param subjectCounts
         *            Number of subjects in the experiment
         * @param Z
         *            Test statistic for the (i,j) treatment comparison
         * @return Null-hypothesis probability for the (i,j) treatment comparison
         */
        public Matrix computeNullHypothesisProbabilities(final List<Integer> subjectCounts, final Matrix Z) {
            final int K = Z.getNumRows();
            final double N = UnivariateStatisticsUtil.computeSum(subjectCounts);

            Matrix P = MatrixFactory.getDefault().createMatrix(K, K);
            StudentizedRangeDistribution.CDF cdf = new StudentizedRangeDistribution.CDF(K, N - K);
            for (int i = 0; i < K; i++) {
                // A classifier is equal to itself.
                P.setElement(i, i, 1.0);
                for (int j = i + 1; j < K; j++) {
                    // The difference is symmetric
                    double zij = Z.getElement(i, j);
                    double pij = 1.0 - cdf.evaluate(zij ); // * Math.sqrt(2) 
                    P.setElement(i, j, pij);
                    P.setElement(j, i, pij);
                }
            }

            return P;

        }

        @Override
        public Statistic clone() {
            Statistic clone = (Statistic) super.clone();
            clone.treatmentMeans = ObjectUtil.cloneSmartElementsAsArrayList(this.getTreatmentMeans());
            clone.subjectCounts = ObjectUtil.cloneSmartElementsAsArrayList(this.getSubjectCounts());
            return clone;
        }

        /**
         * Getter for subjectCounts
         * 
         * @return Number of subjects in the experiment
         */
        public List<Integer> getSubjectCounts() {
            return this.subjectCounts;
        }

        /**
         * Getter for treatmentMeans
         * 
         * @return Mean for each treatment
         */
        public List<Double> getTreatmentMeans() {
            return this.treatmentMeans;
        }

        @Override
        public boolean acceptNullHypothesis(final int i, final int j) {
            return this.getNullHypothesisProbability(i, j) >= this.getUncompensatedAlpha();
        }

        /**
         * Getter for standardErrors
         * 
         * @return Gets the standard errors in the experiment
         */
        public Matrix getStandardErrors() {
            return this.standardErrors;
        }


    }

}



/**
 * 
 */
package com.gf.ye.yes.service.plot;

import static org.junit.Assert.assertEquals;
import gov.sandia.cognition.math.UnivariateStatisticsUtil;

import java.util.List;

import org.junit.Test;

import com.gf.ye.yes.service.plot.statistics.TukeyKramerConfidenceN1;
import com.google.common.collect.ImmutableList;

/**
 * @author fkurth
 *
 */
public class TukeyTestTest {


    /**
     * 
     * From
     * 
     * Rasch, Herrendoerfer, Bock, Victor, Guiard
     * ISBN 3-486-23146-4
     * 
     * (In German)
     * 
     * Verfahrensbibliothek. Band 1.
     * Page 851
     * 
     * Verified with Statistica
     * 
     */
    List<List<Double>> testData = ImmutableList.of(
            (List<Double>)ImmutableList.of( 529d, 508d, 501d, 534d, 510d, 504d ),
            (List<Double>)ImmutableList.of( 505d, 521d, 560d, 516d, 598d, 552d ),
            (List<Double>)ImmutableList.of( 537d, 569d, 499d, 501d, 506d, 600d ),
            (List<Double>)ImmutableList.of( 619d, 632d, 644d, 638d, 623d ),
            (List<Double>)ImmutableList.of( 565d, 596d, 631d, 667d, 613d, 580d )
            );

    final TukeyKramerConfidenceN1 t = new TukeyKramerConfidenceN1();



    /**
     * Test method for {@link com.gf.ye.yes.service.plot.statistics.TukeyKramerConfidenceN1#evaluateNullHypotheses(java.util.Collection, double)}.
     */
    @Test
    public final void testEvaluateNullHypothesesCollectionOfQextendsCollectionOfQextendsNumberDouble() {

        TukeyKramerConfidenceN1.Statistic stat = t.evaluateNullHypotheses(testData);

        Integer treatments = stat.getTreatmentCount();

        assertEquals(Integer.valueOf(5) , treatments );

        List<Double> means =  stat.getTreatmentMeans();

        assertEquals( Double.valueOf( 514.33d ),  means.get(0), 0.005 );
        assertEquals( Double.valueOf( 542.00d ),  means.get(1), 0.005 );
        assertEquals( Double.valueOf( 535.33d ),  means.get(2), 0.005 );
        assertEquals( Double.valueOf( 631.20d ),  means.get(3), 0.005 );
        assertEquals( Double.valueOf( 608.67d ),  means.get(4), 0.005 );

        Integer subjects = (int) UnivariateStatisticsUtil.computeSum(stat.getSubjectCounts() );
        assertEquals( Integer.valueOf(29), subjects);

        Integer degOfFreedom = subjects - treatments; 
        assertEquals( Integer.valueOf(24), degOfFreedom);


        // diagonals
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(0, 0)  , 0.00005 );
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(1, 1)  , 0.00005 );
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(2, 2)  , 0.00005 );
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(3, 3)  , 0.00005 );
        assertEquals( Double.valueOf( 1d ),  stat.getNullHypothesisProbability(4, 4)  , 0.00005 );

        assertEquals( Double.valueOf( 0.541176d ),  stat.getNullHypothesisProbability(0, 1)  , 0.000001 );
        assertEquals( Double.valueOf( 0.541176d ),  stat.getNullHypothesisProbability(1, 0)  , 0.000001 );

        assertEquals( Double.valueOf( 0.763884d ),  stat.getNullHypothesisProbability(0, 2)  , 0.000001 );
        assertEquals( Double.valueOf( 0.763884d ),  stat.getNullHypothesisProbability(2, 0)  , 0.000001 );

        assertEquals( Double.valueOf( 0.000145d ),  stat.getNullHypothesisProbability(0, 3)  , 0.000001 );
        assertEquals( Double.valueOf( 0.000145d ),  stat.getNullHypothesisProbability(3, 0)  , 0.000001 );

        assertEquals( Double.valueOf( 0.000300d ),  stat.getNullHypothesisProbability(0, 4)  , 0.000001 );
        assertEquals( Double.valueOf( 0.000300d ),  stat.getNullHypothesisProbability(4, 0)  , 0.000001 );

        assertEquals( Double.valueOf( 0.995624d ),  stat.getNullHypothesisProbability(1, 2)  , 0.000001 );
        assertEquals( Double.valueOf( 0.995624d ),  stat.getNullHypothesisProbability(2, 1)  , 0.000001 );

        assertEquals( Double.valueOf( 0.000766d ),  stat.getNullHypothesisProbability(1, 3)  , 0.000001 );
        assertEquals( Double.valueOf( 0.000766d ),  stat.getNullHypothesisProbability(3, 1)  , 0.000001 );

        assertEquals( Double.valueOf( 0.008328d ),  stat.getNullHypothesisProbability(1, 4)  , 0.000001 );
        assertEquals( Double.valueOf( 0.008328d ),  stat.getNullHypothesisProbability(4, 1)  , 0.000001 );

        assertEquals( Double.valueOf( 0.000391d ),  stat.getNullHypothesisProbability(2, 3)  , 0.000001 );
        assertEquals( Double.valueOf( 0.000391d ),  stat.getNullHypothesisProbability(3, 2)  , 0.000001 );

        assertEquals( Double.valueOf( 0.748831d ),  stat.getNullHypothesisProbability(4, 3)  , 0.000001 );
        assertEquals( Double.valueOf( 0.748831d ),  stat.getNullHypothesisProbability(3, 4)  , 0.000001 );






    }

}


Design and implement a general index assignment utility

A common task in both the Foundry (and more generally) is to keep track of some values and assign values unique indices. These indices are typically integers starting from 0, though in some cases they may be longs or other values such as UUIDs.
Having a utility to cover this, either generally or just the specific base case, would be a good addition to the Foundry. In particular, it could help with mapping values onto indices in a Vector, for example when converting an InfiniteVector? to a Vector.

AbstractVectorThresholdMaximumGainLearner: Sanity check triggered

I was playing around with the parameters for the Random Forest example from #6 and somehow triggered a sanity check in AbstractVectorThresholdMaximumGainLearner that probably should not be triggerable:

java.lang.RuntimeException: bestThreshold (8.30760652058587) lies outside range of values (8.30760652058587, 9.14680325466277]
    at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.computeBestGainAndThreshold(AbstractVectorThresholdMaximumGainLearner.java:383)
    at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.computeBestGainAndThreshold(AbstractVectorThresholdMaximumGainLearner.java:209)
    at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.learn(AbstractVectorThresholdMaximumGainLearner.java:141)
    at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.learn(AbstractVectorThresholdMaximumGainLearner.java:45)
    at gov.sandia.cognition.learning.algorithm.tree.RandomSubVectorThresholdLearner.learn(RandomSubVectorThresholdLearner.java:212)
    at gov.sandia.cognition.learning.algorithm.tree.RandomSubVectorThresholdLearner.learn(RandomSubVectorThresholdLearner.java:47)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learnNode(CategorizationTreeLearner.java:237)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learnNode(CategorizationTreeLearner.java:37)
    at gov.sandia.cognition.learning.algorithm.tree.AbstractDecisionTreeLearner.learnChildNodes(AbstractDecisionTreeLearner.java:129)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learnNode(CategorizationTreeLearner.java:246)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learn(CategorizationTreeLearner.java:178)
    at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learn(CategorizationTreeLearner.java:37)
    at gov.sandia.cognition.learning.algorithm.ensemble.AbstractBaggingLearner.step(AbstractBaggingLearner.java:195)
    at gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner.learn(AbstractAnytimeBatchLearner.java:147)
    ...

The CategorizationTreeLearner produces leaf nodes having number of data points less leafCountThreshold

I set the leafCountThreshold to 90k for a very large data set. I then used the model trained to predict the training dataset and found there are many nodes have number of output records much less than the threshold.

In the source code of CategorizationTreeLearner:

    boolean isLeaf = this.areAllOutputsEqual(data)
        || data.size() <= this.leafCountThreshold
        || (this.maxDepth > 0 && node.getDepth() >= this.maxDepth);

The second condition makes a node have data size less than the leafCountThreshold a leaf node.

Create a factory for random forests

The Foundry has support for random forests, however it requires stitching together several components to create the learner. Since it is a very popular method, we should make it easier to get started with it by adding a factory class.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.