rucio / donkeybot Goto Github PK
View Code? Open in Web Editor NEW๐ค Question Answering Bot for Rucio User Support (GSoC Project)
License: Apache License 2.0
๐ค Question Answering Bot for Rucio User Support (GSoC Project)
License: Apache License 2.0
Progress in the bot's creation and changes in the scope of the project require completely new documentation.
Change all currrent documentation to match changes and add missing documentation for new modules and scripts
While the QuestionDetector works good for the prototype, failing test cases indicate that an improvment is needed and bugs exist.
Look into the failing tests under test/test_question/test_detector
and fix improve the QuestionDetector
Running the AnswerDetector and the QAInterface on the example notebooks vs ask_donkeybot and the unittests
for some reason gives different results.
Look into Answer Detector notebook the ask_donkeybot script and the Answer Detector tests.
Try to find why we get different results.
When detecting a question, if it has no context then there is no need to keep it on data_storage since no answer will be found.
Update the question detection scripts to only insert questions in data_storage that have a some context.
We started off expanding the SearchEngine class to work for general documentation.
There is a piece of code left that fails to do this.
An extra if statement inside SearchEngine changes the corpus attribute of the search engine
unless we specifically change it's type attribute .
Change this if statement: https://github.com/rucio/donkeybot/blob/master/lib/bot/searcher/base.py#L157
and the SearchEngine class in general to not autotmatically set the type attribute as Documentation Search Engine.
Another idea is for Rucio Docs which use this search engine, they could be of type 'Rucio Docs Search Engine' thus the
if statement doesn't change but rather every line where we call/create the SearchEngine
Upon review the save() and load() methods inside the Fetcher classes are redundant.
Now that new data sources have been introduced #8 , our QuestionDetector will have to be improved to handle them
Make the required changes to the QuestionDetector for question and context detection inside Rucio's github issues
The first place where DonkeyBot will look for an answer is the FAQ table which contains supervised QA pairs.
If we find an identical/very similar question in FAQ it should be what is returned by the bot with the highest confidence.
We need to have a single script which when run fetches/parses/detects everything we need and builds the bot.
We also need good test coverage to automatically test the build and package.
Working on #26 proved that faq.json file which is located under data/
is needed to store faq created on my machine.
Thus, there is the need to add the data/ folder to our repository
We should also fetch Rucio's documentation through github which will be used for context in answering general questions regarding Rucio.
These doc pages are then going to be indexed, so we might need to create the appropriate script for said job.
Related to #6 for the DocumentationSearchEngine
There is the need besides the ask_donkeybot.py script for a UI to be built.
It can be a webapp, a slack bot or some other interface for Rucio users to test the bot and speak to a server.
Data from these questions asked and answers given can be gathered and then used to build an evaluation dataset.
Deside and create a simple UI for Rucio users to query the bot.
Options:
There is a need to create documentation and wiki pages for any code up to this point (1st GSoC evaluation - June)
Add the documentation
The module now that it has an expanded scope and complexity needs to have a better structure.
Folder restructuring, name changes and dependancy changes.
All scripts will need to change to fit the new structure.
We need to test the clean_subject method better.
I think currently reply chains with subjects like "Re: Re: ... Re: subject" might not
be cleaned correctly.
Test if the above statement is correct and fix the clean_subject method
There is a need to create a module to handle all of our searching/querying needs.
This module will hold a SearchEngineFactory and have any classes related our requirements.
Rucio's
documentation indexed and is used when querying simple questions the user asks whose answer might exist under our documentationDocker will help with the installation and make is OS agnostic.
Current installation is somewhat Widnows specific and not much testing has been done on other OS beside a Linux vm I re-built the bot on.
Add Docker support and package the bot through that ๐
This might even make setuptools obsolete, since we aren't really providing Donkeybot through pip.
So its better to change it
init method in EmailQuestion, CommentQuestion, IssueQuestion can simply use super
along with an origin parameter to simplify the object's construction
create init method in Question superclass and utilize inheritance.
Conversation dict was code implemented early on in the project and refactoring is needed.
There are also related tests that we skip.
Its currently hard to test and reuses code which exists in the EmailParser class.
Refactor the code and find a better way to test it
We are currently skipping around ~16 unittests, some of them have corresponding tickets for bugfixes open.
Look into the rest.
Figure out how to test the code for the skipped tests and change them as needed.
Now that the fetching of the Issues and Documentation data is complete #8.
We need to create the appropriate classes that will handle any processing/metadata creation etc that the bot is going to need.
While facing some problems with our current email data and cleaning, there is a suggestion to fetch data from Rucio's issues in github. These are nice for more technical questions and matters regarding the codebase.
To serve the model and close Docker Support/UI issues, we need to remove the emails from the current predictions.
Questions extracted from emails reside in the questions table along with questions from issues and issue comments.
This need to change either by saving them on a different table or when predicting questions, having the option to not predict questions on emails.
Even better if the option to build the bot without utilizing the emails end-to-end exists.
Top level folder detector/
under which question
and answer
folders reside is redundant.
Restructure codebase and remove the folder
A crucial step to a Question Answering system is the actual aswering part, we need to create the bot's module that wil provide our answer-detection capabilities.
Utilize transformers library to integrate BERT and/or other models into our pipeline. End result should have module and related scripts for answer detection. When question and context is given, an answer should be returned.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.