osslab-pku / gfi-bot Goto Github PK
View Code? Open in Web Editor NEW[Working in Progress] ML-powered ๐ค for finding and labeling good first issues in your GitHub project!
Home Page: https://gfibot.io
License: GNU General Public License v3.0
[Working in Progress] ML-powered ๐ค for finding and labeling good first issues in your GitHub project!
Home Page: https://gfibot.io
License: GNU General Public License v3.0
With the revisions in #32 , it does not make sense to include the current information on the per-issue page.
Ideally, a user should be able to absorb more information from the per-issue page, but there is no need for us to have comprehensive coverage of everything (or otherwise, the user may simply refer to the GitHub issue page.
My proposed revision is to put the issue description as an expandable content for each issue item (similar to the design here). We may include the full issue description, but when the description goes too long, it should be capped.
It is also possible to display other detailed information, but I currently do not have a good idea.
The name of the options should be more specific and self-explanatory. My suggestions:
These sorting and filtering options should be made more visible to end users, with a visual layout like:
Users can choose one option from all options in the "Sort By" line, and add or remove filtering conditions in the two "Filter by" lines. The second line is to filter by GitHub tags. Each of the options in the Sort By line should have a tooltip text explaining its meanings.
Currently, all documentation is severely outdated regarding collecting data, training models, understanding the code structure, and deploy the backend & frontend. My proposal is to create a separate DEVELOPMENT.md
in the project root folder to explain how to run and deploy GFI-Bot, with the following sections:
gfibot.collections
),Then, all outdated content in README.md
can be replaced by a link to DEVELOPMENT.md
.
Currently, many open issues in some projects have a GFI probability of 99.99%, and some of these issues clearly should not be marked as GFI.
The performance metric of the model is also unusually high.
I examined the code and found two features that may be problematic. The first is 'created_at_timestamp', which is not one of the features and should not be included in X (def get_x_y() in gfibot/model/utils.py). The second one is 'rpt_gfi_ratio', when I try to drop this feature, the model performance metrics appear to drop significantly.
The problems can be solved by the following steps:
Line 135 in 7ed0761
gfi-bot/gfibot/model/dataloader.py
Line 112 in 7ed0761
gfi-bot/gfibot/model/dataloader.py
Line 118 in 7ed0761
issues = [i for i in user.issues if i.closed_at <= t]
should be created for calculating gfi_ratio and gfi_num later.gfi-bot/gfibot/data/dataset.py
Line 205 in 7ed0761
There may be a situation where most of the prediction probabilities are close to 0 after the above features are corrected because of the imbalance of positive and negative instances in the training data, which can be solved by balancing the training dataset using methods such as SMOTE and ADASYN. Then we can check whether the '99.99% probabilities' problem is solved.
Hello,
I recently downloaded the dataset.bson
and the resolved_issue.bson
datasets from your project on Zenodo. However, I could not find detailed descriptions of the fields contained in these datasets within the repository documentation. I need this information to properly understand and utilize the dataset.
Could you please provide a detailed description of these data fields, or direct me to where I might find this information? It is crucial for my research.
Thank you!
Currently, the model does not take text into consideration. This affects the perceived quality and validity of recommended GFIs, as the model does not learn anything from text and only learns from historical features.
My proposal is to add two lightweight TF-IDF vectors (e.g., 50 dimensions) to learn from the title and description, respectively. Of course, additional effort needs to be spent on carefully parsing the text into bags of words (with stemming, etc.).
GFI-Bot needs to implement:
As the first step, we need minimal working code for both frontend and backend.
For the frontend, we expect to have a basic react project showing a Home page, a navigation bar, and some copyright and about notices at the tail of each page. We also need a "Repositories" page to list currently registered repositories (in the gfibot.repos
collection) and display basic statistics for those repositories. Since the number of repositories may become large, this page needs pagination. For the Home page, we need basic information about GFI-Bot and a three-column description like that on this page. No need to fill in text on the Home page. we will try to fill them later.
For the backend, we expect to have some APIs to return currently registered repositories as a paginated list.
The main file changes should be made in the frontend
folder and the gfibot/backend
folder. You can also add tests, new dependencies, and new GitHub workflows (.github/workflows
) where necessary.
Dataset construction needs to be recorded with histories and current progress in MongoDB, which can be used for:
We need to show more information in frontend about the model training and evaluation results to users, showing:
The training of RecGFI requires that for each issue, the overall development experience of every issue participant. For each participant, we choose to estimate this from their GitHub profile, with two key requirements:
Implementing an approach to collect a comprehensive GitHub profile while satisfying the above requirements can be hard. Therefore, we resort to only collect user created repos, issues, and pull requests using the GitHub GraphQL API, because these statistics are both timestamped and supports time related queries (see User API). The collected data should be saved in gfibot.users
MongoDB collection following the schema provided in schemas/users.json
.
gfibot/data/update.py
gfibot/data/graphql.py
tests/data/test_update.py
tests/data/test_graphql.py
schemas/users.json
Can add new dependencies, create new tests or add new graphql query files where necessary.
Implement and add tests for incremental fetch of GitHub user stats for a given username using GitHub GraphQL API in gfibot/data/graphql.py
.
Implement and test the following function in gfibot/data/update.py
, to call your implementation, and save the final data (adhering the schemas in schemas/users.json
) in MongoDB.
def update_user(user: str, since: datetime) -> None:
# TODO: We need an efficient approach to fetch user profile from GitHub,
# we may use the GraphQL API with more user-related features than the REST API
raise NotImplementedError()
update_repo()
function of gfibot/data/update.py
to test whether your contribution work!To align with the requirements of a research paper, the following evaluations need to be conducted:
All evaluation data should be stored in the MongoDB database (for visualization in the frontend in the future).
Currently gfibot.data.backend
does not have any documentation. It will be helpful to have documentation describing the behavior of each exposed RESTful API.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.