Giter VIP home page Giter VIP logo

github-labeler's People

Contributors

aakankshaduggal avatar antter avatar harshad16 avatar michaelclifford avatar sesheta avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

github-labeler's Issues

use .env file to manage environment variables

Can you add a .env file to your local repository and provide an .env.example file in the repo that provides a template for the expected environment variables required by your project?

Here is an example of how to use the dotenv package to read env variables into your notebooks.

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

This issue came up do to this line, where I assume there is some out of notebook variable setting:
https://github.com/aicoe-aiops/github-labeler/pull/3/files#diff-35e4337b926c8d4cfdaefebb14768d4b7fca4988ca74f7f50b833333205e8619R39

make k-fold loop outside training and use same dataset for both models

Is your feature request related to a problem? Please describe.
Currently to test between fasttext and SVM we create the subdatasets twice and perform k-fold cross validation on them. Since we take negative samples randomly, this brings up some variance in the process. If fasttext and SVM are using two different datasets that have variance in their predictability, we cannot honestly compare them.

Describe the solution you'd like
I would like the functions to be altered so both models use the same dataset. Instead of using 5-fold cross validation we should randomize the negative sampling 5 times, then train the models on 80% and average the 20% of the datasets where we do our validation.

Create pre-processing notebook

Is your feature request related to a problem? Please describe.
Github issue data is super noisy and if it was to be fed into most models it would need a fair amount of pre-processing.

Describe the solution you'd like
A notebook/script to preprocess the text with some options, and testing to see which one works the best

Create methods for visualizing word vectors

Is your feature request related to a problem? Please describe.
fastText is a blackbox model and I would like to visualize how it works.

Describe the solution you'd like
A notebook where a visualization of word vectors is done

Make visualization look cooler

Is your feature request related to a problem? Please describe.
Currently in the ft_viz notebook, the notebooks trains an unsupervised model on the small openshift/origin dataset or something similar. This is really not enough to give the word vectors a very meaningful interpretation. Now, we have a very powerful word vector model that would be much more interesting to visualize.

Describe the solution you'd like
Change the ft_viz notebook to import the newest word vector model from ceph, and perform our visualizations on that model instead.

New minor release

Hey, Kebechet!

Create a new minor release, please.

(just want to see what happens)

functionality to exclude bots from dataset

Is your feature request related to a problem? Please describe.
There are often bots that create issues or tag labels that we do not want to include in the dataset.

Describe the solution you'd like
A way to exclude certain users/bots from counting their issues or tags.

Make README more user-friendly

Is your feature request related to a problem? Please describe.
Currently, the README gives a high-level overview of what the project sets out to do. It gives no explanation of the code that exists within the project or how to use it.

Describe the solution you'd like
A thorough, enhanced README that explains ALL the code would be useful.

Reduce fastText models size

Is your feature request related to a problem? Please describe.
fastText has proven itself to be quite useful in predicting github labels, but the models are way too big, especially for the pre-trained version. It is not convenient to save multiple models and load them in the app. Some care has already been taken to reduce model size, such as reducing vocabulary and vector size in the src/data/build_w2v_vocab notebook. More can still be done to significantly reduce the size.

Describe the solution you'd like

Two main ideas are to quantize the model, which fastText allows easily in their python package. Another option is to throw out even more vocabulary (e.g. if a word doesn't appear once in our target issues), and reduce vector size yet again. We could have a finetuned vector model that is useful for anyone, and smaller, specific ones per user.

Demo the opf/support bot and get it connected!

Is your feature request related to a problem? Please describe.
We need to actually implement this somewhere. A good place would be operate-first/support

Describe the solution you'd like
We need a github app that can label github issues on the opf/support repo. The app has been created and can be found here: https://github.com/apps/issue-labeler-opfirst

Note it is not deployed yet, only has been tested locally.

This issue is currently frozen by this issue: operate-first/continuous-delivery#18

The image needs to be updated in order for the app to actually work.

create app

Describe the solution you'd like
There must be an app that can take a title and body as input and return a list of label names as outputs

Make a blog post

Is your feature request related to a problem? Please describe.
Right now, there is a lot of work done that is not very well explained.

Describe the solution you'd like
A blog post would be a good way to thoroughly explain the work done and the motivations behind it.

Application cannot be managed by Kebechet due to it containing an unsupported package location.

Kebechet cannot support maintaining this application as it contain's local
version of packages.

The package causing the issue is - src
Linked SHA - 9e6e308

For more information, see Pipfile and Pipfile.lock.

Environment details

Kebechet version: 1.5.4
Python version: 3.8.6
Platform: Linux-4.18.0-305.19.1.el8_4.x86_64-x86_64-with-glibc2.2.5
pipenv version: pipenv, version 2020.11.15


/kind bug
/priority critical-urgent

Create overlay for bot

Is your feature request related to a problem? Please describe.
The bot needs to be deployed to openshit

Describe the solution you'd like
The best way to do this is probably to create an overlay for the bot so it's own image can be made and ArgoCD will create its pod.

Replace use_ceph with environment variables

Is your feature request related to a problem? Please describe.

Currently in every notebook and script that uses storage there is a line of code that says use_ceph = True. This is an inefficient way to do this.

Describe the solution you'd like

There should be a USE_CEPH environment variables or something related.

preprocess data that enters app

The data entering the app has an old preprocessing technique. It is necessary to import the proper preprocessing technique from a different notebook/script

Kebechet update manager: KeyError - 420da9ba23

Description

This is an automated issue generated by Kebechet. The update manager threw an exception (KeyError) at
runtime. If you think this exception is a bug please open an issue upstream at https://github.com/thoth-station/kebechet
otherwise use the traceback below to help you fix whatever issues were encountered with your repository.

Traceback

Traceback (most recent call last):
File "/home/user/kebechet/kebechet_runners.py", line 193, in run
instance.run(**manager_configuration)
File "/home/user/kebechet/managers/update/update.py", line 919, in run
result = self._do_update(
File "/home/user/kebechet/managers/update/update.py", line 762, in _do_update
old_environment = self._get_all_packages_versions()
File "/home/user/kebechet/managers/update/update.py", line 210, in _get_all_packages_versions
"version": package_info["version"][len("==") :],
KeyError: 'version'

adjust app so it can handle empty requests

Is your feature request related to a problem? Please describe.
Sometimes issues are empty, or they become empty after preprocessing. In this case an error occurs.

Describe the solution you'd like
We want to return nothing, not create an error.

Add Markdown Descriptions to Each Notebook

As a Data Scientist and an end-user of the notebooks, it is easier to understand the objective of the notebook if we have a brief introduction header and a concluding footer.

Acceptance Criteria :

Polish the existing notebooks with -

  • A heading
  • An Introduction
  • A conclusion for cells that have visualizations or interesting insights
  • A final conclusion with future work overview.

Explore usability on operate-first

Is your feature request related to a problem? Please describe.
Now that the project is practically done and is usable, we need to see which repo it can help out in. Most repos in operate-first have little to no labelled issue data, so we will have to explore to see what works

Describe the solution you'd like
THe pipeline should be run on a handful of different repos to see which gives the best-looking results for an issue-labeler.

Application cannot be managed by Kebechet due to it containing an unsupported package location in rhel:8 environment.

Kebechet cannot support maintaining this application as it contain's local
version of packages.

The package causing the issue is - src
Linked SHA - 554cb47

For more information, see Pipfile and Pipfile.lock.

Environment details

Kebechet version: 1.6.6
Python version: 3.8.8
Platform: Linux-4.18.0-305.10.2.el8_4.x86_64-x86_64-with-glibc2.2.5
pipenv version: pipenv, version 2020.11.15


/kind bug
/priority critical-urgent

Update pipeline

Is your feature request related to a problem? Please describe.
The Elyra pipeline should be updated to include the preprocessing step.

Describe the solution you'd like
In the .pipeline file, the notebooks/preprocess.ipynb notebook should be run after the data extraction and before the model training.

see how pretrained fastText model can improve performance

Is your feature request related to a problem? Please describe.
The current fastText model has to learn language from scratch, which is difficult/impossible with little training data

Describe the solution you'd like
I would like there to be a method to use a pretrained fastText model, downloading either the generic Wikipedia model or creating an unsupervised pretrained model trained on github issues.

Additional context
Pretrained English model available at https://fasttext.cc/docs/en/english-vectors.html

introduce balanced negative sampling amongst issues

Is your feature request related to a problem? Please describe.
When negative samples are taken at random, the most popular labels overwhelm the dataset. If these labels are easy to predict, such as "bot", the binary classification problem becomes too trivial.
Describe the solution you'd like
In the model notebook, a function to evenly take negative samples from the other labels.

Kebechet pipfile-requirements manager: ValueError - caa120a9ae

Description

This is an automated issue generated by Kebechet. The pipfile-requirements manager threw an exception (ValueError) at
runtime. If you think this exception is a bug please open an issue upstream at https://github.com/thoth-station/kebechet
otherwise use the traceback below to help you fix whatever issues were encountered with your repository.

Traceback

Traceback (most recent call last):
File "/home/user/kebechet/kebechet_runners.py", line 193, in run
instance.run(**manager_configuration)
File "/home/user/kebechet/managers/pipfile_requirements/pipfile_requirements.py", line 94, in run
else sorted(self.get_pipfile_requirements(file_contents))
File "/home/user/kebechet/managers/pipfile_requirements/pipfile_requirements.py", line 46, in get_pipfile_requirements
raise ValueError(
ValueError: Package src does not use pinned version: {'editable': True, 'path': './'}

Kebechet version manager: FileNotFoundError - b50e189391

Description

This is an automated issue generated by Kebechet. The version manager threw an exception (FileNotFoundError) at
runtime. If you think this exception is a bug please open an issue upstream at https://github.com/thoth-station/kebechet
otherwise use the traceback below to help you fix whatever issues were encountered with your repository.

Traceback

Traceback (most recent call last):
File "/home/user/kebechet/kebechet_runners.py", line 193, in run
instance.run(**manager_configuration)
File "/home/user/kebechet/managers/version/version.py", line 460, in run
changelog = self._compute_changelog(
File "/home/user/kebechet/managers/version/version.py", line 306, in _compute_changelog
with open("CHANGELOG.md", "r+") as changelog_file:
FileNotFoundError: [Errno 2] No such file or directory: 'CHANGELOG.md'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.