Giter VIP home page Giter VIP logo

donkeybot's Introduction

Rucio - Scientific Data Management

Rucio is a software framework that provides functionality to organize, manage, and access large volumes of scientific data using customisable policies. The data can be spread across globally distributed locations and across heterogeneous data centers, uniting different storage and network technologies as a single federated entity. Rucio offers advanced features such as distributed data recovery or adaptive replication, and is highly scalable, modular, and extensible. Rucio has been originally developed to meet the requirements of the high-energy physics experiment ATLAS, and is continuously extended to support LHC experiments and other diverse scientific communities.

Documentation

General information, API/REST description and guides can be found in our documentation or on our webpage.

Try it out

We provide a dockerized environment which serves both as a demo environment and a development environment. It includes all the necessary preconfigured components for multiple storage and transfers developments.

Developers

For information on how to contribute to Rucio, please refer and follow our CONTRIBUTING guidelines. We strongly recommend to use the dockerized environment for development.

Operators

To learn how to deploy and configure Rucio, consult the documentation available online.

Getting Support

If you are looking for support, please contact us via one of our official channels.

donkeybot's People

Contributors

bari12 avatar dependabot[bot] avatar mageirakos avatar mlassnig avatar tomasjavurek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

donkeybot's Issues

Change if statement in SearchEngine class

Motivation

We started off expanding the SearchEngine class to work for general documentation.
There is a piece of code left that fails to do this.
An extra if statement inside SearchEngine changes the corpus attribute of the search engine
unless we specifically change it's type attribute .

Modification

Change this if statement: https://github.com/rucio/donkeybot/blob/master/lib/bot/searcher/base.py#L157
and the SearchEngine class in general to not autotmatically set the type attribute as Documentation Search Engine.
Another idea is for Rucio Docs which use this search engine, they could be of type 'Rucio Docs Search Engine' thus the
if statement doesn't change but rather every line where we call/create the SearchEngine

Fetching of issue data

Motivation

While facing some problems with our current email data and cleaning, there is a suggestion to fetch data from Rucio's issues in github. These are nice for more technical questions and matters regarding the codebase.

Modification

  1. a fetching script for all the issues
  2. a parsing module IssueParser class that parses said issues
  3. another script to use all of the above and the QuestionDetector to save and parse the issues into our dataset
  4. Additional modifications to QuestionDetector may need to be made for this to merge correctly

Add documentation and wiki pages

Motivation

There is a need to create documentation and wiki pages for any code up to this point (1st GSoC evaluation - June)

Modification

Add the documentation

only insert questions with context on data_storage

Motivation

When detecting a question, if it has no context then there is no need to keep it on data_storage since no answer will be found.

Modification

Update the question detection scripts to only insert questions in data_storage that have a some context.

Refactor conversation dict creation and unit tests

Motivation

Conversation dict was code implemented early on in the project and refactoring is needed.
There are also related tests that we skip.
Its currently hard to test and reuses code which exists in the EmailParser class.

Modification

Refactor the code and find a better way to test it

Option to not include emails in prediction

Motivation

To serve the model and close Docker Support/UI issues, we need to remove the emails from the current predictions.

Modification

Questions extracted from emails reside in the questions table along with questions from issues and issue comments.
This need to change either by saving them on a different table or when predicting questions, having the option to not predict questions on emails.
Even better if the option to build the bot without utilizing the emails end-to-end exists.

Fetching documentation data

Motivation

We should also fetch Rucio's documentation through github which will be used for context in answering general questions regarding Rucio.

Modification

  1. a fetching script for most (most important) if not all the documentation (some might need to be compiled/exists in docstrings)
  2. additional parsing might be needed to clean up the documentation (if so, create said parser as well)

These doc pages are then going to be indexed, so we might need to create the appropriate script for said job.
Related to #6 for the DocumentationSearchEngine

Write unit tests and create build_donkeybot script

Motivation

We need to have a single script which when run fetches/parses/detects everything we need and builds the bot.
We also need good test coverage to automatically test the build and package.

Modification

  1. Create build_donkeybot script
  2. Create tests/. (utilize pytest)

Add data/ and faq.json file to repo

Motivation

Working on #26 proved that faq.json file which is located under data/ is needed to store faq created on my machine.
Thus, there is the need to add the data/ folder to our repository

Modification

  1. Appropriate modifications to .gitignore for .db files that shouldn't be pushed
  2. Addition of data/faq.js

expand QuestionDetector for issue data

Motivation

Now that new data sources have been introduced #8 , our QuestionDetector will have to be improved to handle them

Modification

Make the required changes to the QuestionDetector for question and context detection inside Rucio's github issues

Creation of UI

Motivation

There is the need besides the ask_donkeybot.py script for a UI to be built.
It can be a webapp, a slack bot or some other interface for Rucio users to test the bot and speak to a server.

Data from these questions asked and answers given can be gathered and then used to build an evaluation dataset.

Modification

Deside and create a simple UI for Rucio users to query the bot.
Options:

  1. Slack bot
  2. Flask webapp
  3. other

Look into skipped unit-tests

Motivation

We are currently skipping around ~16 unittests, some of them have corresponding tickets for bugfixes open.
Look into the rest.

Modification

Figure out how to test the code for the skipped tests and change them as needed.

Look into Email Parser subject cleaning

Motivation

We need to test the clean_subject method better.
I think currently reply chains with subjects like "Re: Re: ... Re: subject" might not
be cleaned correctly.

Modification

Test if the above statement is correct and fix the clean_subject method

add RucioDocsParser and IssueParser

Motivation

Now that the fetching of the Issues and Documentation data is complete #8.
We need to create the appropriate classes that will handle any processing/metadata creation etc that the bot is going to need.

Modification

  1. Expand our parsing module to handle the new input data sources
  2. Create the scipts that utilize the new classes/methods

utilize inheritance on Question objects

Motivation

init method in EmailQuestion, CommentQuestion, IssueQuestion can simply use super
along with an origin parameter to simplify the object's construction

Modification

create init method in Question superclass and utilize inheritance.

Bugfix QuestionDetector

Motivation

While the QuestionDetector works good for the prototype, failing test cases indicate that an improvment is needed and bugs exist.

Modification

Look into the failing tests under test/test_question/test_detector and fix improve the QuestionDetector

Docker Support

Motivation

Docker will help with the installation and make is OS agnostic.

Current installation is somewhat Widnows specific and not much testing has been done on other OS beside a Linux vm I re-built the bot on.

Modification

Add Docker support and package the bot through that ๐Ÿ‘
This might even make setuptools obsolete, since we aren't really providing Donkeybot through pip.
So its better to change it

Restructure codebase

Motivation

The module now that it has an expanded scope and complexity needs to have a better structure.

Modification

Folder restructuring, name changes and dependancy changes.
All scripts will need to change to fit the new structure.

Update documentation and add examples

Motivation

Progress in the bot's creation and changes in the scope of the project require completely new documentation.

Modification

Change all currrent documentation to match changes and add missing documentation for new modules and scripts

Create Answer detection module

Motivation

A crucial step to a Question Answering system is the actual aswering part, we need to create the bot's module that wil provide our answer-detection capabilities.

Modification

Utilize transformers library to integrate BERT and/or other models into our pipeline. End result should have module and related scripts for answer detection. When question and context is given, an answer should be returned.

add FAQ table and expand search engine to use it

Motivation

The first place where DonkeyBot will look for an answer is the FAQ table which contains supervised QA pairs.
If we find an identical/very similar question in FAQ it should be what is returned by the bot with the highest confidence.

Modification

  1. Add FAQ table in database with a supervised sample of question and answer pairs.
  2. Expand query.py script to utilize faq table. (eg. -mfaq/--match_faq )

remove toplevel detector/ folder

Motivation

Top level folder detector/ under which question and answer folders reside is redundant.

Modification

Restructure codebase and remove the folder

Creation of Search Engine module

Motivation

There is a need to create a module to handle all of our searching/querying needs.
This module will hold a SearchEngineFactory and have any classes related our requirements.

Modification

  1. An FAQSearchEngine that looks at the FAQ Table
  2. A QuestionSearchEngine that looks at previous Questions asked inside of emails. Thus, moving forward with their context perhaps we can answer the current query. This can be done for both emails and issues if we end up fetching issue data and parsing that as well ( related to #5 )
  3. A DocumentationSearchEngine which has Rucio's documentation indexed and is used when querying simple questions the user asks whose answer might exist under our documentation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.