Comments (4)
hi @jamesmf
Thanks for your feedback, this is indeed a good addition to the DSL, i like your semantics of a new entity at the end of each sentence using &[1]. I think there are a couple of options here. e.g:
~[topping]
pepperoni &[99]
anchovies
mushrooms
In this case then anchovies
and mushrooms
would share the remaining 1%?
Currently the way to handle probabilities is to just repeat the the same sentence or maybe add typos or other augmentation. e.g.:
~[topping]
pepperoni
pepperoni
Pepperoni
peperoni
Peperoni
anchovies
mushrooms
So i agree in the sense that we may want to make this shorter, adding augmentation mechanisms and probabilities.
// augmentation can be: 'none', 'capitalization', 'typos' or 'full'
~[topping]('augmentation': 'full')
pepperoni &[20x]
anchovies
mushrooms
This would be another way to control probabilities, but instead of imperatively expressing the 99%, we can also express the frecuency, and it would be like if that word was 20 times listed, so i'm not sure which way is better atm. would love to hear any feedback
from chatito.
I think it would be reasonable if the & can be treated as a multiplier so that not much functionality would have to change. I don't think describing them as probabilities would work very well when you start dealing with multiple levels.
I just ran into the tricky situation where I had two very long lists ~[x]
and ~[y]
and then wanted to implement
@[capture]
~[x?] ~[y]
~[x] ~[y?]
Which off the top of my head I thought would have generated approx 50% of the data as either x or y, but not both. In reality, it was overwhelmingly likely to generate [x and y] because (I think) there were far, far, far more combinations of [x and y] than there were of [x or y].
If it is expanding all the combinations, then sampling some number of them, the combinatorial math gets really hard to actually estimate, and you end up with some seriously weird results (in the example above, I generated hundreds of examples and 0 showed up as [x or y]).
from chatito.
Is there currently any way to offset the problem I described above?
A concrete example would be something like this:
Let's say you want to generate "negative" training data in the form of stop-words and nonsense in order to classify this as having 'unrecognized intent'. For this example, take the following chatito file:
%[other]('training': '1000', 'testing': '20')
~[stopword] ~[stopword?] ~[stopword?]
~[nonsense] ~[nonsense?] ~[nonsense?]
~[nonsense]
~[char]~[char?]~[char?]~[char?]~[char?]~[char?]
~[char]~[char]~[char?]~[char?]~[char?]~[char?]~[char?]
If ~[stopwords] contains a short-ish list, there are orders of magnitude more possible "nonsense" examples than "stopword" examples.
If I wanted to ensure the "stopword" outputs occur approximately as often as "nonsense" outputs, is there a way to do that beyond having more than one chatito file?
from chatito.
The probability operator is now implemented since v2.2.0
from chatito.
Related Issues (20)
- relex
- Unhandled crash when generating testing data HOT 3
- Online ide HOT 2
- Optional slots HOT 1
- [BUG] Slot regression between v2.1.5 and v.2.2.1 HOT 5
- Import failing HOT 2
- Weighted probability HOT 10
- Snips NLU output format error HOT 1
- 数据量太大,然后速度太慢了 HOT 2
- How can I add previous generated json file with new examples? HOT 1
- How can I add Number? HOT 1
- "Can't generate X examples" warning doesn't say which intent it is referring to HOT 2
- How to use Chatito in angularjs HOT 1
- Training/Testing Number Via Cli? HOT 2
- how to use regex_features? HOT 1
- Downloading dsl files? HOT 1
- How to start Chatito on local host HOT 1
- I got JavaScript heap out of memory when training HOT 1
- How to determine whether happened over-fit?
- Save entities for test HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatito.