Comments (5)
Hi Mike,
Thanks for the bug report! I would like to do c and a. The "training bag" is actually not for training. It is the feature list, and thus should be a set of unique words. But for sanity check, we should check the duplicates in the constructor. I will do these right now. Can you please add a unit test (in the smile.feature.BagTest of test directory) with your test case? Thanks!
Haifeng
from smile.
Hi Mike,
The fix is in the master now. Thanks!
from smile.
Thanks! I'll add a unit test and PR it to this issue later today, as I'm in a meeting right now.
from smile.
PS. the reason I found this was by extracting a feature array for 1 category, then for a second one and combining the two feature arrays into 1 new one, causing duplicates to occur as some terms where in both categories. I should probably remove the intersecting features to make the algorithm predictions better, or does the implementation take these into account?
from smile.
Yes, you better use Set class to get the union of the features rather than just concat in general. But the constructor now handles the duplicates. So it should be fine for you to use the feature list in your example.
The main problem is the documentation of Bag. It is not for training purpose. It just assumes that you have a set of unique features and uses them to calculate the double valued feature vectors of some text for you.
from smile.
Related Issues (20)
- Jitpack builds are failing since 3.x HOT 2
- FR: Compact "how to load dirty data" example HOT 1
- Arff.java writeField can fail when type isn't in the list of handled types HOT 1
- BarPlot.getUpperBound() computes wrong bound. HOT 1
- FR: Warn before trying to train where the label column has any nulls HOT 1
- Dot product Question HOT 2
- stringVector(0) error HOT 1
- Suggest changing license to Apache 2.0 license or MIT
- Non-monotonic cluster tree -- the linkage is probably not appropriate! HOT 1
- HiddenLayerBuilder does not add dropout to HiddenLayer HOT 4
- Method in interface BaseArray can never return an int[] HOT 2
- Making the plot module available in Java API HOT 4
- InnerProduct of vectors created with cas.Vars not being simplified HOT 6
- Support header attribute on facet / row / column encoding channels HOT 2
- Incorrect spec generated for encoding channel sort HOT 4
- How can I set up in Intellij or other IDE to compile and read code? HOT 3
- What is the efficient way to fill null values in a column with an arbitrary string in a Dataframe? HOT 3
- ClassCastException when calling DataFrame.omitNullRows() HOT 1
- smile.plot.swing.BarPlot works with smile-plot 3.0.2 but not with 3.1.0 HOT 1
- IllegalArgumentException when suing SimpleImputer for data sourced from json file HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smile.