Giter VIP home page Giter VIP logo

Comments (9)

dinal24 avatar dinal24 commented on May 21, 2024 1

Hey @tmbo Thanks for the insights!

I'd like to keep that and rather separate model persistence and data loading

I think in this case, we do something like introducing AbstractDataLoader which gets implemented for loading training data from file-system or mongo-db.AbstractPersistor for persistance will include, aws, gcs, & mongo-db (no change in the current implementation except, introducing a mongo-db class).

from rasa.

plauto avatar plauto commented on May 21, 2024

@amn41 So basically, we can extend that abstract class for mongoDB and store the training JSON (for example for wit or api) inside a mongodb collection right?

At this point do you think it would be OK to perform the training by using the POST /parse endpoint that we have? Here we could maybe specify a connector, host, username, password and maybe a collection...
the do_POST method would then call the appropriate db connector through the Data Router.

So for example we could think of providing POST /parse with:

  • the training data
{
  "provider":  "traing_data".
  "data":  { .......}
}
  • a db connector and collection
{
  "provider":  "mongoDB".
  "config": { 
      "host": "....", 
      "user": "....",
      "pass": "....",
      "collection": "...."
  }
}
  • or you can register a connector along with its configuration in config.json and you just refer to a specific connector/collection to trigger the training process:
{
  "provider":  "mongoDB".
  "collection": "...."
}

from rasa.

amn41 avatar amn41 commented on May 21, 2024

closing for now. If this becomes relevant again we can reopen

from rasa.

dinal24 avatar dinal24 commented on May 21, 2024

Hi guys! 😃

I had a chat with @tmbo regarding this and went through the source.

Rasa specific I/O operations (I noticed) are:

  • Training phase:
  1. Opening config file :

    config.json
    
  2. Loading training data :

     data.json
    
  3. Saving model :

     ner
        |——config.json
        |——model
     intent_classifier.pkl
     metadata.json
     training_data.json
    
  • Testing phase:
  1. Opening config file
  2. Loading model
  3. Logging

Considering 2, 3 & 5:

I would like to suggest and implement a solution as below:

  1. Introduce an abstract DataAdapter:

     class DataAdapter():
        @abstractmethod
        def read_training_data():
            pass
    
        @abstractmethod
        def save_model():
            pass
    
        @abstractmethod
        def read_model():
            pass
    
  2. DataAdapter gets implemented in several ways, e.g: MongoAdapter, FileAdapter and S3Adapter

  3. Such instance gets passed around to the Trainer, MetaData and etc.

  4. When data is required call read_training_data(), similarly for other operations as well.

  5. Leave the room for others to implement DataAdapter when needed. (e.g. DynamoAdapter for dynamodb)

  6. Delete persister.py as S3Adapter and FileAdapter does that work. May remove some of the code in data_router.py

  7. Change config.json, e.g:

     {
        ….
        “training_data” : { “source”: “file_system”, “path” : “./data/training_data.json”}
        ….
     }
    

    or

     {
        ….
        “training_data” : { “source”: “mongo_db”, “host” : “”,…, “collection”: “”}
        ….
     }
    

What do you think?

from rasa.

tmbo avatar tmbo commented on May 21, 2024

@dinal24 Thanks for the nice writeup. Some thoughts:

  • it should be possible to support "mixed" data adapters, e.g. reading the data from mongo but writing the models onto disk (I think writing the models into mongo is a rare use case anyway).
  • not sure yet about the configuration format, but that is a different topic we are currently thinking about (e.g. the nesting is though if arguments should be passed in via command line or environment), so anything will be fine here for the time being.

from rasa.

dinal24 avatar dinal24 commented on May 21, 2024

Hey @tmbo thanks for your thoughts. I could think of an altered solution.

We have the abstract DataAdapter, and implement a default class like DefaultDataAdapter. It will include all the functionality of the latest rasa (s3 & file) + mongo.

      class DefaultDataAdapter(DataAdapter):
          def __init__(train_conf, s_model_conf, r_model_conf)
              # deduces using input params, e.g:
              # self.train_type = 'mongodb'
              # self.train_param = 'mongodb://localhost:27020/mydb'
              # self.save_model_type = 'file'
              # self.save_model_param = './models'
              pass
  
          def read_training_data():
              if self.train_type == 'mongodb':
                  # load from mongo
                  pass
              elif self.train_type == 'file':    
                  # load from fs
                  pass
  
          def save_model():
              # similarly
              pass
  
          def read_model():
              # similarly
              pass

We can keep the configuration the same. Extract information from the config.json and create a DataAdapter instance and pass it as necessary. Also MongoDB preference can be inputted using the standard mongo URI string via terminal.

If someone needs a different combination has the ease to implement DataAdapter and use it.

I believe logs can also be stored to a mongo instead of file system (if required). It will need another function like def write_log(), implementation can be done in future.

I hope this addresses your concerns. 😀

from rasa.

tmbo avatar tmbo commented on May 21, 2024

Alright, here are my thoughts. I'd rather like to keep it as simple as possible for now, we can always add more abstractions later if we feel it needs more structure. From your previous idea I really liked that every data source had its own class it was implemented in. I'd like to keep that and rather separate model persistence and data loading (that doesn't mean an implementation like mongo can't use a helper class to share implementation details between the two).

That said, I'd love it if you could introduce an interface to load data (the interface to persist models already exists with two implementations) and integrate that. I think this is a good start to get you coding. Don't hesitate to share your PR early so we can continue to exchange ideas, you can still change it afterwards.

from rasa.

tmbo avatar tmbo commented on May 21, 2024

there now is the possibility to fetch training data from an http endpoint which should be fine for most use cases.

from rasa.

anushka17agarwal avatar anushka17agarwal commented on May 21, 2024

Can we find these updates and how to use them in the documentation?

from rasa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.