rucio / donkeybot Goto Github PK

🤖 Question Answering Bot for Rucio User Support (GSoC Project)

License: Apache License 2.0

Python 99.89% Dockerfile 0.11%

bert gsoc gsoc-2020 hacktoberfest natural-language-processing question-answering search-engine transfer-learning

donkeybot's Introduction

Rucio - Scientific Data Management

Rucio is a software framework that provides functionality to organize, manage, and access large volumes of scientific data using customisable policies. The data can be spread across globally distributed locations and across heterogeneous data centers, uniting different storage and network technologies as a single federated entity. Rucio offers advanced features such as distributed data recovery or adaptive replication, and is highly scalable, modular, and extensible. Rucio has been originally developed to meet the requirements of the high-energy physics experiment ATLAS, and is continuously extended to support LHC experiments and other diverse scientific communities.

Documentation

General information, API/REST description and guides can be found in our documentation or on our webpage.

Try it out

We provide a dockerized environment which serves both as a demo environment and a development environment. It includes all the necessary preconfigured components for multiple storage and transfers developments.

Developers

For information on how to contribute to Rucio, please refer and follow our CONTRIBUTING guidelines. We strongly recommend to use the dockerized environment for development.

Operators

To learn how to deploy and configure Rucio, consult the documentation available online.

Getting Support

If you are looking for support, please contact us via one of our official channels.

donkeybot's People

Contributors

Stargazers

Watchers

Forkers

mageirakos satyam-kumar-yadav d2anubis

donkeybot's Issues

Change if statement in SearchEngine class

Motivation

We started off expanding the SearchEngine class to work for general documentation.
There is a piece of code left that fails to do this.
An extra if statement inside SearchEngine changes the corpus attribute of the search engine
unless we specifically change it's type attribute .

Modification

Change this if statement: https://github.com/rucio/donkeybot/blob/master/lib/bot/searcher/base.py#L157
and the SearchEngine class in general to not autotmatically set the type attribute as Documentation Search Engine.
Another idea is for Rucio Docs which use this search engine, they could be of type 'Rucio Docs Search Engine' thus the
if statement doesn't change but rather every line where we call/create the SearchEngine

Fetching of issue data

Motivation

While facing some problems with our current email data and cleaning, there is a suggestion to fetch data from Rucio's issues in github. These are nice for more technical questions and matters regarding the codebase.

Modification

a fetching script for all the issues
a parsing module IssueParser class that parses said issues
another script to use all of the above and the QuestionDetector to save and parse the issues into our dataset
Additional modifications to QuestionDetector may need to be made for this to merge correctly

Add documentation and wiki pages

Motivation

There is a need to create documentation and wiki pages for any code up to this point (1st GSoC evaluation - June)

Modification

Add the documentation

only insert questions with context on data_storage

Motivation

When detecting a question, if it has no context then there is no need to keep it on data_storage since no answer will be found.

Modification

Update the question detection scripts to only insert questions in data_storage that have a some context.

Refactor conversation dict creation and unit tests

Motivation

Conversation dict was code implemented early on in the project and refactoring is needed.
There are also related tests that we skip.
Its currently hard to test and reuses code which exists in the EmailParser class.

Modification

Refactor the code and find a better way to test it

Option to not include emails in prediction

Motivation

To serve the model and close Docker Support/UI issues, we need to remove the emails from the current predictions.

Modification

Questions extracted from emails reside in the questions table along with questions from issues and issue comments.
This need to change either by saving them on a different table or when predicting questions, having the option to not predict questions on emails.
Even better if the option to build the bot without utilizing the emails end-to-end exists.

Fetching documentation data

Motivation

We should also fetch Rucio's documentation through github which will be used for context in answering general questions regarding Rucio.

Modification

a fetching script for most (most important) if not all the documentation (some might need to be compiled/exists in docstrings)
additional parsing might be needed to clean up the documentation (if so, create said parser as well)

These doc pages are then going to be indexed, so we might need to create the appropriate script for said job.
Related to #6 for the DocumentationSearchEngine

Write unit tests and create build_donkeybot script

Motivation

We need to have a single script which when run fetches/parses/detects everything we need and builds the bot.
We also need good test coverage to automatically test the build and package.

Modification

Create build_donkeybot script
Create tests/. (utilize pytest)

Add data/ and faq.json file to repo

Motivation

Working on #26 proved that faq.json file which is located under data/ is needed to store faq created on my machine.
Thus, there is the need to add the data/ folder to our repository

Modification

Appropriate modifications to .gitignore for .db files that shouldn't be pushed
Addition of data/faq.js

expand QuestionDetector for issue data

Motivation

Now that new data sources have been introduced #8 , our QuestionDetector will have to be improved to handle them

Modification

Make the required changes to the QuestionDetector for question and context detection inside Rucio's github issues

Creation of UI

Motivation

There is the need besides the ask_donkeybot.py script for a UI to be built.
It can be a webapp, a slack bot or some other interface for Rucio users to test the bot and speak to a server.

Data from these questions asked and answers given can be gathered and then used to build an evaluation dataset.

Modification

Deside and create a simple UI for Rucio users to query the bot.
Options:

Slack bot
Flask webapp
other

Look into skipped unit-tests

Motivation

We are currently skipping around ~16 unittests, some of them have corresponding tickets for bugfixes open.
Look into the rest.

Modification

Figure out how to test the code for the skipped tests and change them as needed.

Look into Email Parser subject cleaning

Motivation

We need to test the clean_subject method better.
I think currently reply chains with subjects like "Re: Re: ... Re: subject" might not
be cleaned correctly.

Modification

Test if the above statement is correct and fix the clean_subject method

add RucioDocsParser and IssueParser

Motivation

Now that the fetching of the Issues and Documentation data is complete #8.
We need to create the appropriate classes that will handle any processing/metadata creation etc that the bot is going to need.

Modification

Expand our parsing module to handle the new input data sources
Create the scipts that utilize the new classes/methods

Bugfix AnswerDetector

Motivation

Running the AnswerDetector and the QAInterface on the example notebooks vs ask_donkeybot and the unittests
for some reason gives different results.

Modification

Look into Answer Detector notebook the ask_donkeybot script and the Answer Detector tests.
Try to find why we get different results.

utilize inheritance on Question objects

Motivation

init method in EmailQuestion, CommentQuestion, IssueQuestion can simply use super
along with an origin parameter to simplify the object's construction

Modification

create init method in Question superclass and utilize inheritance.

Bugfix QuestionDetector

Motivation

While the QuestionDetector works good for the prototype, failing test cases indicate that an improvment is needed and bugs exist.

Modification

Look into the failing tests under test/test_question/test_detector and fix improve the QuestionDetector

Docker Support

Motivation

Docker will help with the installation and make is OS agnostic.

Current installation is somewhat Widnows specific and not much testing has been done on other OS beside a Linux vm I re-built the bot on.

Modification

Add Docker support and package the bot through that 👍
This might even make setuptools obsolete, since we aren't really providing Donkeybot through pip.
So its better to change it

add save_dataframe() method to sqlite wrapper; remove Fetchers save() and load() methods

Motivation

Upon review the save() and load() methods inside the Fetcher classes are redundant.

Modification

Add a save_dataframe() method to our sqlite wrapper which in conjuntion with the get_dataframe() method
achieves the same results.
Modify classes and the fetch_docs/fetch_issues scripts to work as intended

Restructure codebase

Motivation

The module now that it has an expanded scope and complexity needs to have a better structure.

Modification

Folder restructuring, name changes and dependancy changes.
All scripts will need to change to fit the new structure.

Update documentation and add examples

Motivation

Progress in the bot's creation and changes in the scope of the project require completely new documentation.

Modification

Change all currrent documentation to match changes and add missing documentation for new modules and scripts

Create Answer detection module

Motivation

A crucial step to a Question Answering system is the actual aswering part, we need to create the bot's module that wil provide our answer-detection capabilities.

Modification

Utilize transformers library to integrate BERT and/or other models into our pipeline. End result should have module and related scripts for answer detection. When question and context is given, an answer should be returned.

add FAQ table and expand search engine to use it

Motivation

The first place where DonkeyBot will look for an answer is the FAQ table which contains supervised QA pairs.
If we find an identical/very similar question in FAQ it should be what is returned by the bot with the highest confidence.

Modification

Add FAQ table in database with a supervised sample of question and answer pairs.
Expand query.py script to utilize faq table. (eg. -mfaq/--match_faq )

remove toplevel detector/ folder

Motivation

Top level folder detector/ under which question and answer folders reside is redundant.

Modification

Restructure codebase and remove the folder

Creation of Search Engine module

Motivation

There is a need to create a module to handle all of our searching/querying needs.
This module will hold a SearchEngineFactory and have any classes related our requirements.

Modification

An FAQSearchEngine that looks at the FAQ Table
A QuestionSearchEngine that looks at previous Questions asked inside of emails. Thus, moving forward with their context perhaps we can answer the current query. This can be done for both emails and issues if we end up fetching issue data and parsing that as well ( related to #5 )
A DocumentationSearchEngine which has Rucio's documentation indexed and is used when querying simple questions the user asks whose answer might exist under our documentation

rucio / donkeybot Goto Github PK

donkeybot's Introduction

Rucio - Scientific Data Management

Documentation

Try it out

Developers

Operators

Getting Support

donkeybot's People

Contributors

Stargazers

Watchers

Forkers

donkeybot's Issues

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Motivation

Modification

Recommend Projects

Recommend Topics

Recommend Org