algorithmfoundry / foundry Goto Github PK
View Code? Open in Web Editor NEWThe Cognitive Foundry is an open-source Java library for building intelligent systems using machine learning
License: Other
The Cognitive Foundry is an open-source Java library for building intelligent systems using machine learning
License: Other
The current way is mostly based on doing normal iteration via VectorEntry. Another way could be to have a callback for active sparse elements that give the index and the value.
Boosted decision trees should fit well into the Foundry learning package. It should also support the stochastic variant.
Fix the build errors/warnings that happen with Java 8. Ideally in a way that it still builds with Java 6 and 7 as well.
Example (Scala code)
May effect not just diagonal but other types of sparse matrices.
import gov.sandia.cognition.math.matrix.{MatrixFactory, VectorFactory}
val factory = MatrixFactory.getSparseDefault()
val matrix1 = factory.createMatrix(2, 2)
matrix1.setElement(0,0,1.0)
matrix1.setElement(1,1,10.0)
val matrix2 = factory.createDiagonal(VectorFactory.getDenseDefault.copyArray(Array(1.0, 0.1)))
matrix1.times(matrix2)
(That returns:
res2: gov.sandia.cognition.math.matrix.Matrix =
(0,0): 1.0
(0,1): 0.0
(1,0): 0.0
(1,1): 1.0
instead of
res2: gov.sandia.cognition.math.matrix.Matrix =
(0,0): 1.0
(1,1): 1.0
)
It appears that VectorFactory#copyValues(Collection<? extends Number>) expects the provided collection to be ordered, since it uses the iteration order of the collection when creating the vector. As some Java collections have arbitrary iteration order, there is potential for nasty ordering bugs if the developer is not careful to consider the underlying implementation of the collection before using this method.
I suggest that the expectation about iteration order is documented in the javadoc for the method. Alternatively, the method could be replaced by variants only accepting ordered collections (such as List and SortedSet, but this might limit the compatibility with unknown/future collections).
Random projections of the input data can be a useful method for creating non-linear features. It conceptually fits nicely with the rest of the Foundry, so we should add support for it.
Add an interface for distribution of classes from a categorizer and implement it in the relevant classes.
After a value has been set/incremented if you set it back to zero (or less than zero) then it doesn't update the total.
public static Vector estimateEigenvector(
final Vector initial,
final Matrix A,
final double stoppingThreshold,
final int maxIterations ) {
}
This method takes stoppingThreshold, maxIterations for numerical methods. Any idea what I should pass on to these to achieve similar implementation as here (i.e) Damping parameter for PageRank, default=0.85.
..I have used the default values as of now.
Our optimization methods for learning are currently designed with a very heavy bias towards being used with a supervised cost function. However, there are other types of cost functions that people often use, such as regularized versions, that do not fit will with the current design.
We should adjust the design to accommodate these types of cost functions by making the generics more permissive and less tied to the specifics of the SupervisedCostFunction? directly.
See the forum topic http://www.cognitivefoundry.org/?topic=a-couple-of-usage-questions-learning-package for some background information.
Add a utility that makes sure a gradient computation is working.
A common step in learning is to do feature normalization. One popular method for doing this is to normalize each feature by mapping it to a standard normal (Gaussian) distribution by subtracting the mean and dividing by the standard deviation.
Disclaimer: This could very well be a bug in my code. Perhaps someone could try to reproduce locally.
I've stumbled upon a weird problem. I'm using RandomForestFactory
with the following parameters:
Consider the following trivial dataset: 10 samples where 5 are labelled 'A' and 5 are labelled 'B'. There is just one feature, with the value '1' for 'A' samples and the value '0' for 'B' samples. As to be expected, I am able to achieve a 100% prediction accuracy on this dataset.
However, if I add 100 zero-information features to the dataset, something weird happens. If the samples are given random values of either '0' or '1' for these features, the accuracy falls to ~75%. If the samples are all given a value of just '0' instead, the accuracy falls further down to ~52% (i.e. only slightly better than random guessing).
I compared with Weka's Random Forest implementation with similar parameters, and get 100% accuracy in all 3 cases.
Any ideas?
We had decided a while back to rename the Cognitive part to Algorithm (hence the organization name, Twitter handle, and new domain). This needs to be carried out at some point
It would be nice to be able to apply optimization methods to learn logistic regression type functions.
For some background, see the forum post: http://www.cognitivefoundry.org/?topic=a-couple-of-usage-questions-learning-package
Using the VectorEntry seems to be doing a log(n) iteration on lookup. See if this can be improved.
The performance of sampling from a Dirichlet is slow when there are large alpha values.
A good implementation of a basic restrictions boltzmann machine (RBM) would be a good addition to the Foundry. It could then be used as a feature transformation for further learning.
The generics on the various ClusterCreators could perhaps be slightly more permissive. I'm using the standard Java convention of referencing stuff by the highest superinterface that makes sense in the context. For example:
List<String> list = new ArrayList<>();
Similarly, I intend to do:
ClusterCreator<Cluster<Vector>, Vector> creator = new DefaultClusterCreator<>();
However, this is not allowed because of the generics of DefaultClusterCreator. Instead, I have to do this:
ClusterCreator<? extends Cluster<Vector>, Vector> creator = new DefaultClusterCreator<>();
Could this be fixed? Of course there might be a good reason for this limitation. If so, please feel free to ignore this request :).
The dot product is happening in the wrong order for spherical k-means (cosine distance), which causes a loop over the dense vector. It should be over the sparse one.
One potential fix for this is to change the vector classes to prefer looping over a sparse vector rather than a dense one.
The divergences evaluator should conform to the vector encoder interface.
Some of the decision tree code doesn't seem to have proper clone methods.
Add the ability to make immutable vectors and matrices to help prevent against accidental plusEquals (and the like) on something that is meant to be immutable.
We have support for vectors and matrices, so tensors could be another good addition. We could have both a general Tensor class that can have a variable number of ways and also maybe a Tensor3 that has 3-way tensor specialization.
A very common use case for learning in the Foundry having a supervised learner where we apply some transformation to the input and output data to represent it in the appropriate way for the learner. Currently it is up to the developer to do these transformations as part of calling the Foundry. However, because this happens so frequently, for example when transforming input data into vectors, it would be nice if some of this could happen automatically so that the details can be abstracted away. Another place this happens is multi-level learning where one (or more) unsupervised algorithms are used before applying a supervised one. Thus, we should add a utility class that helps with this very common use case.
I have a small suggestion for improvement of clarity in the API/documentation for the AgglomerativeClusterer
class: Rename "maximum minimum distance" to "maximum distance".
For example:
public void setMax
MinDistance(double maxMinDistance)The maximum
minimumdistance between clusters that is allowed for the two clusters to be merged. If there are no clusters that remain that have a distance between them less than or equal to this value, then the clustering will halt. To not have this value factored into the clustering, set it to something such as Double.MAX_VALUE.
In KMeansClusterer, the divergences between an element and each of the clusters are measured every iteration. At its core, it happens like this:
double distance = this.divergenceFunction.evaluate(cluster, element);
When the divergence function is a CentroidClusterDivergenceFunction, evaluate() does this:
return this.divergenceFunction.evaluate(other, cluster.getCentroid());
However, this throws a NullPointerException when cluster is null (and thus can't be dereferenced at the .getCentroid() call).
A cluster is indeed set to null in KMeansClusterer when all its previous elements have been reassigned to different clusters:
if (members.size() > 0)
{
cluster = this.creator.createCluster(members);
}
else
{
cluster = null;
}
Is there a simple way to get an iterator over just the Doubles of a vector/matrix object? I know that one can iterate over VectorEntry's, but that doesn't fit easily into the generics/collections context (if all you're interested in are the values).
For example, it would be great to use something like Guava's Iterables methods on vector and matrix objects.
It would be very helpful to have a sparse singular value decomposition (SVD) for several of the algorithms that we have.
It may be possible to do this using the ARPACK wrapper that is from Netlib-java.
It would be nice to have the ability to run statistical equivalence or noninferiority tests.
I'm testing your new RandomForestFactory
and am getting some pretty good results! However, the algorithm is slower than expected.
My code trains hundreds of Random Forest classifiers on a small test dataset. I profiled it, and noticed that ~45% of the time is spent in AbstractDataDistribution#getEntropy()
. I suspect that this is not supposed to happen, but if I'm wrong, and this is indeed the natural center of computation, please feel free to close this issue.
I don't know what the underlying performance bottleneck is, but I suspect that the call to MathUtil.log2(double)
may be the one.
There are lots of argument checks that were put in the code before ArgumentChecker? was created. We should convert as many of these as possible and where appropriate add additional methods to ArgumentChecker? to support these checks.
Here are some possible new checks:
Could add variants of some methods that would return the value. This can be useful in the case of checking before calling super constructor or chaining constructors.
There is UniformDistribution over doubles, but we could use another one over integers.
The numeric map offers methods to get the min and max, but it would be useful to get the top and bottom n keys.
The foundry is currently using an older version of MTJ. It should be updated to use a newer one.
The AbstractMutableDoubleMap.SimpleEntrySet
class is protected, but exposed through the public AbstractMutableDoubleMap#entrySet()
method. This is a problem, because you can't look at the SimpleEntrySet
object returned by the entrySet()
method from "public" when you do not have "public" access to the SimpleEntrySet
class.
Hello,
On my setup (Java 1.7.0, Ubuntu Linux), the testSample test in MonteCarloSamplerTestHarness is failing - the mean falls outside of the confidence interval.
Running gov.sandia.cognition.statistics.montecarlo.DirectSamplerTest
Constructors
Known Values
clone
sample
Mean: 0.7308781907032909
Monte Carlo: Mean: 0.6873011446918514 Variance: 0.00459004806254931
Interval: 0.673858098932311 0.6873011446918514 0.7007441904513919 0.95 100
Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec <<< FAILURE!
Increasing the number of samples doesn't help.
Many thanks,
Yves
There should be a simple static method and associated batch learner for doing simple univariate regression where there is a single input and a single output. This is the basic case of f(x) = m * x + b.
Also, an incremental learner should be implemented as well.
The result of the TukeyKramerConfidence test is wrong. We confirmed this with Statistika and SAS JMP.
The proplem is within the calculaction of the test statistics/standardError.
One should not use the totalVariance, but this term instead:
1/(N-K) * SUM( Variance_i * (TreatementCount_i -1))
Also there is no need to mulitply the test statistic with an extra Sqrt(2), since its already in the caclulation of the standard error.
Attached is a fixed version and a UnitTest from a Textbook example.
This example has been verified with Statistika and SAS JMP.
/*
* File: TukeyKramerConfidence.java
* Authors: Kevin R. Dixon
* Company: Sandia National Laboratories
* Project: Cognitive Foundry
*
* Copyright May 16, 2011, Sandia Corporation.
* Under the terms of Contract DE-AC04-94AL85000, there is a non-exclusive
* license for use of this work by or on behalf of the U.S. Government.
* Export of this program may require a license from the United States
* Government. See CopyrightHistory.txt for complete details.
*
*/
package com.gf.ye.yes.service.plot.statistics;
import gov.sandia.cognition.annotation.PublicationReference;
import gov.sandia.cognition.annotation.PublicationType;
import gov.sandia.cognition.math.UnivariateStatisticsUtil;
import gov.sandia.cognition.math.matrix.Matrix;
import gov.sandia.cognition.math.matrix.MatrixFactory;
import gov.sandia.cognition.statistics.distribution.StudentizedRangeDistribution;
import gov.sandia.cognition.statistics.method.AbstractMultipleHypothesisComparison;
import gov.sandia.cognition.statistics.method.ConfidenceTestAssumptions;
import gov.sandia.cognition.statistics.method.TukeyKramerConfidence;
import gov.sandia.cognition.util.ObjectUtil;
import gov.sandia.cognition.util.Pair;
import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
/**
* Tukey-Kramer test is the multiple-comparison generalization of the unpaired
* Student's t-test when conducting multiple comparisons. The t-test and
* Tukey's Range test are coincident when a single comparison is made.
* Tukey's Range test is typically used as the post-hoc analysis technique
* after detecting a difference using a 1-way ANOVA. This class implements
* Kramer's generalization to unequal subjects in different treatments.
* @author Kevin R. Dixon
* @since 3.1
*/
@ConfidenceTestAssumptions(
name="Tukey-Kramer Range test",
alsoKnownAs={
"Tukey's Range test",
"Tukey's Honestly Significant Difference test",
"Tukey's HSD test"
},
description={
"Tukey's test determines which treatment is statistically different from a multiple comparison.",
"Tukey's test is a generalization of the paired Student's t-test for multiple comparisons using a population-correction factor."
},
assumptions={
"All data came from same distribution, without considering treatment effects.",
"The observations have equal variance.",
"Measurements are independent and equivalent within a treatment.",
"All observations are independent."
},
nullHypothesis="Each treatment has no effect on the mean outcome of the subjects",
dataPaired=false,
dataSameSize=false,
distribution=StudentizedRangeDistribution.class,
reference={
@PublicationReference(
author="Wikipedia",
title="Tukey's range test",
type=PublicationType.WebPage,
year=2011,
url="http://en.wikipedia.org/wiki/Tukey's_range_test"
)
}
)
public class TukeyKramerConfidenceN1 extends AbstractMultipleHypothesisComparison<Collection<? extends Number>, TukeyKramerConfidenceN1.Statistic> {
private static final long serialVersionUID = 1L;
/**
* Creates a new instance of TukeyKramerConfidence
*/
public TukeyKramerConfidenceN1() {
super();
}
@Override
public TukeyKramerConfidence clone() {
return (TukeyKramerConfidence) super.clone();
}
@Override
public TukeyKramerConfidenceN1.Statistic evaluateNullHypotheses(Collection<? extends Collection<? extends Number>> data, double uncompensatedAlpha) {
// There are "K" treatments
final int K = data.size();
// Each treatment can have a different number of subjects
List<Integer> subjectCounts = new ArrayList<Integer>(K);
List<Double> treatmentMeans = new ArrayList<Double>(K);
double treatmentVariancesSum = 0;
// This is the total subject count.
int N =0;
for (Collection<? extends Number> treatment : data) {
final int Ni = treatment.size();
N += Ni;
subjectCounts.add(Ni);
Pair<Double,Double> meanAndVariance = UnivariateStatisticsUtil.computeMeanAndVariance(treatment);
treatmentMeans.add(meanAndVariance.getFirst());
treatmentVariancesSum += meanAndVariance.getSecond() * (Ni-1);
}
final double meanSquaredResiduals = treatmentVariancesSum / (N -K );
return new TukeyKramerConfidenceN1.Statistic(uncompensatedAlpha, subjectCounts, treatmentMeans, meanSquaredResiduals);
}
/**
* Statistic from Tukey-Kramer's multiple comparison test
*/
public static class Statistic extends AbstractMultipleHypothesisComparison.Statistic {
/**
*
*/
private static final long serialVersionUID = 1L;
/**
* Number of subjects in each treatment
*/
protected List<Integer> subjectCounts;
/**
* Mean for each treatment
*/
protected List<Double> treatmentMeans;
protected List<Double> treatmentVariances;
/**
* Gets the standard errors in the experiment
*/
protected Matrix standardErrors;
/**
*
*/
protected Matrix meanDifferences;
/**
* Creates a new instance of StudentizedMultipleComparisonStatistic
*
* @param uncompensatedAlpha
* Uncompensated alpha (p-value threshold) for the multiple comparison test
* @param subjectCounts
* Number of subjects in each treatment
* @param treatmentMeans
* Mean for each treatment
* @param treatmentVariances
* @param totalVariance
* Variance over all subjects in the experiment
*/
public Statistic(final double uncompensatedAlpha, final List<Integer> subjectCounts, final List<Double> treatmentMeans, final double meanSquaredResiduals) {
this.treatmentCount = treatmentMeans.size();
this.uncompensatedAlpha = uncompensatedAlpha;
this.subjectCounts = subjectCounts;
this.treatmentMeans = treatmentMeans;
this.testStatistics = this.computeTestStatistics(subjectCounts, treatmentMeans, meanSquaredResiduals);
this.nullHypothesisProbabilities = this.computeNullHypothesisProbabilities(subjectCounts, this.testStatistics);
}
/**
* Computes the test statistic for all treatments
*
* @param subjectCounts
* Number of subjects in each treatment
* @param treatmentMeans
* Mean for each treatment
* @param totalVariance
* Variance over all subjects in the experiment
* @return Test statistics, where the (i,j) element compares treatment "i" to treatment "j", the statistic is symmetric
*/
public Matrix computeTestStatistics(final List<Integer> subjectCounts, final List<Double> treatmentMeans, final double meanSquaredResiduals) {
int K = treatmentMeans.size();
Matrix Z = MatrixFactory.getDefault().createMatrix(K, K);
this.standardErrors = MatrixFactory.getDefault().createMatrix(K, K);
for (int i = 0; i < K; i++) {
final double yi = treatmentMeans.get(i);
final int ni = subjectCounts.get(i);
for (int j = i + 1; j < K; j++) {
final int nj = subjectCounts.get(j);
final double yj = treatmentMeans.get(j);
double standardError = Math.sqrt( meanSquaredResiduals * 0.5 * ((1.0 / ni) + (1.0 / nj)));
final double zij = Math.abs(yi - yj) / standardError;
Z.setElement(i, j, zij);
Z.setElement(j, i, zij);
this.standardErrors.setElement(i, j, standardError);
this.standardErrors.setElement(j, i, standardError);
}
}
return Z;
}
/**
* Computes null-hypothesis probability for the (i,j) treatment comparison
*
* @param subjectCounts
* Number of subjects in the experiment
* @param Z
* Test statistic for the (i,j) treatment comparison
* @return Null-hypothesis probability for the (i,j) treatment comparison
*/
public Matrix computeNullHypothesisProbabilities(final List<Integer> subjectCounts, final Matrix Z) {
final int K = Z.getNumRows();
final double N = UnivariateStatisticsUtil.computeSum(subjectCounts);
Matrix P = MatrixFactory.getDefault().createMatrix(K, K);
StudentizedRangeDistribution.CDF cdf = new StudentizedRangeDistribution.CDF(K, N - K);
for (int i = 0; i < K; i++) {
// A classifier is equal to itself.
P.setElement(i, i, 1.0);
for (int j = i + 1; j < K; j++) {
// The difference is symmetric
double zij = Z.getElement(i, j);
double pij = 1.0 - cdf.evaluate(zij ); // * Math.sqrt(2)
P.setElement(i, j, pij);
P.setElement(j, i, pij);
}
}
return P;
}
@Override
public Statistic clone() {
Statistic clone = (Statistic) super.clone();
clone.treatmentMeans = ObjectUtil.cloneSmartElementsAsArrayList(this.getTreatmentMeans());
clone.subjectCounts = ObjectUtil.cloneSmartElementsAsArrayList(this.getSubjectCounts());
return clone;
}
/**
* Getter for subjectCounts
*
* @return Number of subjects in the experiment
*/
public List<Integer> getSubjectCounts() {
return this.subjectCounts;
}
/**
* Getter for treatmentMeans
*
* @return Mean for each treatment
*/
public List<Double> getTreatmentMeans() {
return this.treatmentMeans;
}
@Override
public boolean acceptNullHypothesis(final int i, final int j) {
return this.getNullHypothesisProbability(i, j) >= this.getUncompensatedAlpha();
}
/**
* Getter for standardErrors
*
* @return Gets the standard errors in the experiment
*/
public Matrix getStandardErrors() {
return this.standardErrors;
}
}
}
/**
*
*/
package com.gf.ye.yes.service.plot;
import static org.junit.Assert.assertEquals;
import gov.sandia.cognition.math.UnivariateStatisticsUtil;
import java.util.List;
import org.junit.Test;
import com.gf.ye.yes.service.plot.statistics.TukeyKramerConfidenceN1;
import com.google.common.collect.ImmutableList;
/**
* @author fkurth
*
*/
public class TukeyTestTest {
/**
*
* From
*
* Rasch, Herrendoerfer, Bock, Victor, Guiard
* ISBN 3-486-23146-4
*
* (In German)
*
* Verfahrensbibliothek. Band 1.
* Page 851
*
* Verified with Statistica
*
*/
List<List<Double>> testData = ImmutableList.of(
(List<Double>)ImmutableList.of( 529d, 508d, 501d, 534d, 510d, 504d ),
(List<Double>)ImmutableList.of( 505d, 521d, 560d, 516d, 598d, 552d ),
(List<Double>)ImmutableList.of( 537d, 569d, 499d, 501d, 506d, 600d ),
(List<Double>)ImmutableList.of( 619d, 632d, 644d, 638d, 623d ),
(List<Double>)ImmutableList.of( 565d, 596d, 631d, 667d, 613d, 580d )
);
final TukeyKramerConfidenceN1 t = new TukeyKramerConfidenceN1();
/**
* Test method for {@link com.gf.ye.yes.service.plot.statistics.TukeyKramerConfidenceN1#evaluateNullHypotheses(java.util.Collection, double)}.
*/
@Test
public final void testEvaluateNullHypothesesCollectionOfQextendsCollectionOfQextendsNumberDouble() {
TukeyKramerConfidenceN1.Statistic stat = t.evaluateNullHypotheses(testData);
Integer treatments = stat.getTreatmentCount();
assertEquals(Integer.valueOf(5) , treatments );
List<Double> means = stat.getTreatmentMeans();
assertEquals( Double.valueOf( 514.33d ), means.get(0), 0.005 );
assertEquals( Double.valueOf( 542.00d ), means.get(1), 0.005 );
assertEquals( Double.valueOf( 535.33d ), means.get(2), 0.005 );
assertEquals( Double.valueOf( 631.20d ), means.get(3), 0.005 );
assertEquals( Double.valueOf( 608.67d ), means.get(4), 0.005 );
Integer subjects = (int) UnivariateStatisticsUtil.computeSum(stat.getSubjectCounts() );
assertEquals( Integer.valueOf(29), subjects);
Integer degOfFreedom = subjects - treatments;
assertEquals( Integer.valueOf(24), degOfFreedom);
// diagonals
assertEquals( Double.valueOf( 1d ), stat.getNullHypothesisProbability(0, 0) , 0.00005 );
assertEquals( Double.valueOf( 1d ), stat.getNullHypothesisProbability(1, 1) , 0.00005 );
assertEquals( Double.valueOf( 1d ), stat.getNullHypothesisProbability(2, 2) , 0.00005 );
assertEquals( Double.valueOf( 1d ), stat.getNullHypothesisProbability(3, 3) , 0.00005 );
assertEquals( Double.valueOf( 1d ), stat.getNullHypothesisProbability(4, 4) , 0.00005 );
assertEquals( Double.valueOf( 0.541176d ), stat.getNullHypothesisProbability(0, 1) , 0.000001 );
assertEquals( Double.valueOf( 0.541176d ), stat.getNullHypothesisProbability(1, 0) , 0.000001 );
assertEquals( Double.valueOf( 0.763884d ), stat.getNullHypothesisProbability(0, 2) , 0.000001 );
assertEquals( Double.valueOf( 0.763884d ), stat.getNullHypothesisProbability(2, 0) , 0.000001 );
assertEquals( Double.valueOf( 0.000145d ), stat.getNullHypothesisProbability(0, 3) , 0.000001 );
assertEquals( Double.valueOf( 0.000145d ), stat.getNullHypothesisProbability(3, 0) , 0.000001 );
assertEquals( Double.valueOf( 0.000300d ), stat.getNullHypothesisProbability(0, 4) , 0.000001 );
assertEquals( Double.valueOf( 0.000300d ), stat.getNullHypothesisProbability(4, 0) , 0.000001 );
assertEquals( Double.valueOf( 0.995624d ), stat.getNullHypothesisProbability(1, 2) , 0.000001 );
assertEquals( Double.valueOf( 0.995624d ), stat.getNullHypothesisProbability(2, 1) , 0.000001 );
assertEquals( Double.valueOf( 0.000766d ), stat.getNullHypothesisProbability(1, 3) , 0.000001 );
assertEquals( Double.valueOf( 0.000766d ), stat.getNullHypothesisProbability(3, 1) , 0.000001 );
assertEquals( Double.valueOf( 0.008328d ), stat.getNullHypothesisProbability(1, 4) , 0.000001 );
assertEquals( Double.valueOf( 0.008328d ), stat.getNullHypothesisProbability(4, 1) , 0.000001 );
assertEquals( Double.valueOf( 0.000391d ), stat.getNullHypothesisProbability(2, 3) , 0.000001 );
assertEquals( Double.valueOf( 0.000391d ), stat.getNullHypothesisProbability(3, 2) , 0.000001 );
assertEquals( Double.valueOf( 0.748831d ), stat.getNullHypothesisProbability(4, 3) , 0.000001 );
assertEquals( Double.valueOf( 0.748831d ), stat.getNullHypothesisProbability(3, 4) , 0.000001 );
}
}
A common task in both the Foundry (and more generally) is to keep track of some values and assign values unique indices. These indices are typically integers starting from 0, though in some cases they may be longs or other values such as UUIDs.
Having a utility to cover this, either generally or just the specific base case, would be a good addition to the Foundry. In particular, it could help with mapping values onto indices in a Vector, for example when converting an InfiniteVector? to a Vector.
Allows a linear algorithm to be used with an approximation to Gaussian RBF Kernels.
Currently it is in the SparseVector implementation but not available in the Vector interface.
I was playing around with the parameters for the Random Forest example from #6 and somehow triggered a sanity check in AbstractVectorThresholdMaximumGainLearner
that probably should not be triggerable:
java.lang.RuntimeException: bestThreshold (8.30760652058587) lies outside range of values (8.30760652058587, 9.14680325466277]
at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.computeBestGainAndThreshold(AbstractVectorThresholdMaximumGainLearner.java:383)
at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.computeBestGainAndThreshold(AbstractVectorThresholdMaximumGainLearner.java:209)
at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.learn(AbstractVectorThresholdMaximumGainLearner.java:141)
at gov.sandia.cognition.learning.algorithm.tree.AbstractVectorThresholdMaximumGainLearner.learn(AbstractVectorThresholdMaximumGainLearner.java:45)
at gov.sandia.cognition.learning.algorithm.tree.RandomSubVectorThresholdLearner.learn(RandomSubVectorThresholdLearner.java:212)
at gov.sandia.cognition.learning.algorithm.tree.RandomSubVectorThresholdLearner.learn(RandomSubVectorThresholdLearner.java:47)
at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learnNode(CategorizationTreeLearner.java:237)
at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learnNode(CategorizationTreeLearner.java:37)
at gov.sandia.cognition.learning.algorithm.tree.AbstractDecisionTreeLearner.learnChildNodes(AbstractDecisionTreeLearner.java:129)
at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learnNode(CategorizationTreeLearner.java:246)
at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learn(CategorizationTreeLearner.java:178)
at gov.sandia.cognition.learning.algorithm.tree.CategorizationTreeLearner.learn(CategorizationTreeLearner.java:37)
at gov.sandia.cognition.learning.algorithm.ensemble.AbstractBaggingLearner.step(AbstractBaggingLearner.java:195)
at gov.sandia.cognition.learning.algorithm.AbstractAnytimeBatchLearner.learn(AbstractAnytimeBatchLearner.java:147)
...
This could be a bit nicer of an interface than having to cast back and forth to doubles.
Implement the Alternating Decision Tree and its learning algorithm.
It doesn't seem like sparse vector times dense matrix is working at an appropriate speed. Neither is dense matrix times sparse vector.
Add implementation of Factorization Machines and the learning algorithms for it.
I set the leafCountThreshold to 90k for a very large data set. I then used the model trained to predict the training dataset and found there are many nodes have number of output records much less than the threshold.
In the source code of CategorizationTreeLearner:
boolean isLeaf = this.areAllOutputsEqual(data)
|| data.size() <= this.leafCountThreshold
|| (this.maxDepth > 0 && node.getDepth() >= this.maxDepth);
The second condition makes a node have data size less than the leafCountThreshold a leaf node.
The Foundry has support for random forests, however it requires stitching together several components to create the learner. Since it is a very popular method, we should make it easier to get started with it by adding a factory class.
Add methods for looping over vectors via a callback where the index and value is passed instead of using an iterator. According to @dbtsai this could be more efficient for sparse vector iteration.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.