It would be great if you could specify ratios between entries or probabilities or rela

hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Feature Request: specify probabilities about chatito HOT 4 CLOSED

rodrigopivi commented on May 29, 2024

Feature Request: specify probabilities

from chatito.

Comments (4)

rodrigopivi commented on May 29, 2024

hi @jamesmf
Thanks for your feedback, this is indeed a good addition to the DSL, i like your semantics of a new entity at the end of each sentence using &[1]. I think there are a couple of options here. e.g:

~[topping]
    pepperoni &[99]
    anchovies
    mushrooms

In this case then anchovies and mushrooms would share the remaining 1%?

Currently the way to handle probabilities is to just repeat the the same sentence or maybe add typos or other augmentation. e.g.:

~[topping]
    pepperoni
    pepperoni
    Pepperoni
    peperoni
    Peperoni
    anchovies
    mushrooms

So i agree in the sense that we may want to make this shorter, adding augmentation mechanisms and probabilities.

// augmentation can be: 'none', 'capitalization', 'typos' or 'full'
~[topping]('augmentation': 'full')
    pepperoni &[20x]
    anchovies 
    mushrooms

This would be another way to control probabilities, but instead of imperatively expressing the 99%, we can also express the frecuency, and it would be like if that word was 20 times listed, so i'm not sure which way is better atm. would love to hear any feedback

from chatito.

jamesmf commented on May 29, 2024

I think it would be reasonable if the & can be treated as a multiplier so that not much functionality would have to change. I don't think describing them as probabilities would work very well when you start dealing with multiple levels.

I just ran into the tricky situation where I had two very long lists ~[x] and ~[y] and then wanted to implement

@[capture]
    ~[x?] ~[y]
    ~[x] ~[y?]

Which off the top of my head I thought would have generated approx 50% of the data as either x or y, but not both. In reality, it was overwhelmingly likely to generate [x and y] because (I think) there were far, far, far more combinations of [x and y] than there were of [x or y].

If it is expanding all the combinations, then sampling some number of them, the combinatorial math gets really hard to actually estimate, and you end up with some seriously weird results (in the example above, I generated hundreds of examples and 0 showed up as [x or y]).

from chatito.

jamesmf commented on May 29, 2024

Is there currently any way to offset the problem I described above?

A concrete example would be something like this:

Let's say you want to generate "negative" training data in the form of stop-words and nonsense in order to classify this as having 'unrecognized intent'. For this example, take the following chatito file:

%[other]('training': '1000', 'testing': '20')
    ~[stopword] ~[stopword?] ~[stopword?]
    ~[nonsense] ~[nonsense?] ~[nonsense?]

~[nonsense]
    ~[char]~[char?]~[char?]~[char?]~[char?]~[char?]
    ~[char]~[char]~[char?]~[char?]~[char?]~[char?]~[char?]

If ~[stopwords] contains a short-ish list, there are orders of magnitude more possible "nonsense" examples than "stopword" examples.

If I wanted to ensure the "stopword" outputs occur approximately as often as "nonsense" outputs, is there a way to do that beyond having more than one chatito file?

from chatito.

rodrigopivi commented on May 29, 2024

The probability operator is now implemented since v2.2.0

from chatito.

Feature Request: specify probabilities about chatito HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent