Giter VIP home page Giter VIP logo

Comments (4)

rodrigopivi avatar rodrigopivi commented on May 29, 2024

hi @jamesmf
Thanks for your feedback, this is indeed a good addition to the DSL, i like your semantics of a new entity at the end of each sentence using &[1]. I think there are a couple of options here. e.g:

~[topping]
    pepperoni &[99]
    anchovies
    mushrooms

In this case then anchovies and mushrooms would share the remaining 1%?

Currently the way to handle probabilities is to just repeat the the same sentence or maybe add typos or other augmentation. e.g.:

~[topping]
    pepperoni
    pepperoni
    Pepperoni
    peperoni
    Peperoni
    anchovies
    mushrooms

So i agree in the sense that we may want to make this shorter, adding augmentation mechanisms and probabilities.

// augmentation can be: 'none', 'capitalization', 'typos' or 'full'
~[topping]('augmentation': 'full')
    pepperoni &[20x]
    anchovies 
    mushrooms

This would be another way to control probabilities, but instead of imperatively expressing the 99%, we can also express the frecuency, and it would be like if that word was 20 times listed, so i'm not sure which way is better atm. would love to hear any feedback

from chatito.

jamesmf avatar jamesmf commented on May 29, 2024

I think it would be reasonable if the & can be treated as a multiplier so that not much functionality would have to change. I don't think describing them as probabilities would work very well when you start dealing with multiple levels.

I just ran into the tricky situation where I had two very long lists ~[x] and ~[y] and then wanted to implement

@[capture]
    ~[x?] ~[y]
    ~[x] ~[y?]

Which off the top of my head I thought would have generated approx 50% of the data as either x or y, but not both. In reality, it was overwhelmingly likely to generate [x and y] because (I think) there were far, far, far more combinations of [x and y] than there were of [x or y].

If it is expanding all the combinations, then sampling some number of them, the combinatorial math gets really hard to actually estimate, and you end up with some seriously weird results (in the example above, I generated hundreds of examples and 0 showed up as [x or y]).

from chatito.

jamesmf avatar jamesmf commented on May 29, 2024

Is there currently any way to offset the problem I described above?

A concrete example would be something like this:

Let's say you want to generate "negative" training data in the form of stop-words and nonsense in order to classify this as having 'unrecognized intent'. For this example, take the following chatito file:

%[other]('training': '1000', 'testing': '20')
    ~[stopword] ~[stopword?] ~[stopword?]
    ~[nonsense] ~[nonsense?] ~[nonsense?]

~[nonsense]
    ~[char]~[char?]~[char?]~[char?]~[char?]~[char?]
    ~[char]~[char]~[char?]~[char?]~[char?]~[char?]~[char?]

If ~[stopwords] contains a short-ish list, there are orders of magnitude more possible "nonsense" examples than "stopword" examples.

If I wanted to ensure the "stopword" outputs occur approximately as often as "nonsense" outputs, is there a way to do that beyond having more than one chatito file?

from chatito.

rodrigopivi avatar rodrigopivi commented on May 29, 2024

The probability operator is now implemented since v2.2.0

from chatito.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.