ullaakut / astronomer Goto Github PK

View Code? Open in Web Editor NEW

695.0 695.0 24.0 7.47 MB

A tool to detect illegitimate stars from bot accounts on GitHub projects

License: MIT License

Go 99.55% Dockerfile 0.45%

bot-detection fraud-detection github github-api open-source

astronomer's People

Contributors

Stargazers

Watchers

astronomer's Issues

Support windows paths for cache

Using the same Github personal token for Linux (windows subsystem for linux) and windows on the same PC, the astronomer linux Go-based executable works fine, but the Windows Go-based executable gives an error like:

astronomer.exe username/repo

Beginning fetching process for repository username/repo
Pre-fetching all stargazers...ko
✖ failed to query stargazer data: unable to write user contribution data to cache: 
unable to create cache file: open data\username\repo\https-api-github-com-graphql-list-firstpage: 
The system cannot find the path specified.

Publish GitHub package

Create image shields endpoint

See https://shields.io/endpoint

Host it at astronomer.ullaakut.eu.

Code Reviews / Pull Requests

Hello. Nice project. I ran it on appsquickly/typhoon, which is a still active, but slowly being sunsetted.

Was surprised that it only reported a 'B' score. Report as follows:

Beginning fetching process for repository appsquickly/Typhoon
Pre-fetching all stargazers...ok
  > Selecting 200 first stargazers out of 2656
  > Selecting 800 random stargazers out of 2656
Fetching contributions for 1000 users up to year 2013
Building trust report...ok

Averages                             Score           Trust
--------                             -----           -----
Weighted contributions:              20234             B
Private contributions:               447               A
Created issues:                      12                C
Commits authored:                    227               C
Repositories:                        40                A
Pull requests:                       10                D
Code reviews:                        2                 E
Account age (days):                  2275              A
5th percentile:                      26                A
10th percentile:                     70                A
15th percentile:                     106               A
20th percentile:                     192               A
25th percentile:                     313               A
30th percentile:                     495               A
35th percentile:                     626               A
40th percentile:                     968               A
45th percentile:                     1181              A
50th percentile:                     1470              A
55th percentile:                     2192              A
60th percentile:                     2586              B
65th percentile:                     3969              B
70th percentile:                     5271              B
75th percentile:                     7115              B
80th percentile:                     10357             B
85th percentile:                     14953             C
90th percentile:                     34799             A
95th percentile:                     135676            A
----------------------------------------------------------
Overall trust:                                         B

✔ Analysis successful. 1000 users computed.

What does the pull-requests metric mean? The project didn't have many pull requests? Or the users who started the project don't make many?

I gave trusted committers push access. <-- Maybe this is useful?

Again does that mean the users who starred the repo didn't do code reviews, or that we didn't?

Just sharing some #random feedback. Please close this issue once received. Again, very nice project.

Repos with between 201 and 219 stars all have 0% trust

Currently, we generate a report for the 200 first users and another one for the rest. Since the rest of the users are less than 20, the second report is empty and takes priority during the computation.

This needs to be fixed asap.

Thank Renee French for the Go gopher

Use GitHub API v4

The v4 API is GraphQL based. So it will drastically cut down on the number of requests needed.

https://developer.github.com/v4/

Unit test trust.buildComparativeReport

Index out of range after failing to fetch

Beginning fetching process for repository icecrime/poule
Pre-fetching all stargazers...ok
  > All 216 stargazers will be scanned
This repository appears to have a low amount of stargazers. Trust calculations might not be accurate.
Fetching contributions for 216 users up to year 2013
 [=>------------------------------------------------------------] ETA: 6h13m23s Elapsed: 9m34s Progress: 3 %
Failed to fetch user contributions from GitHub API too many times.
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/ullaakut/astronomer/pkg/gql.FetchContributions(0xc00021bf30, 0xc000176300, 0xa, 0x10, 0x7dd, 0x0, 0x100, 0x14fc936, 0x32, 0xc0001bfbb8)
	/Users/ullaakut/Work/go/src/github.com/ullaakut/astronomer/pkg/gql/fetch.go:315 +0x1992
main.detectFakeStars(0xc00021bf30, 0x14e7be3, 0x7)
	/Users/ullaakut/Work/go/src/github.com/ullaakut/astronomer/main.go:106 +0x3f8
main.main()
	/Users/ullaakut/Work/go/src/github.com/ullaakut/astronomer/main.go:80 +0x4b9

Add GolangCI

Unit test trust.Compute

Unit test gql.FetchContributions

Need to use httptest and write slightly complex tests, will take more time than the rest

Documenting the algorithm and providing justification evidence

Thank you for this very interesting project. Here I share a few of my tests while using the project.

I initially tested my personal project which has about 3.9k stars, the result seems wasn't so good.

$ docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/Users/changkun/dev/mct:/data/" ullaakut/astronomer changkun/modern-cpp-tutorial                                                                                          [22:00:10]
Beginning fetching process for repository changkun/modern-cpp-tutorial
Pre-fetching all stargazers...ok
  > Selecting 200 first stargazers out of 3930
  > Selecting 800 random stargazers out of 3930
Fetching contributions for 1000 users up to year 2013
Building trust report...ok

Averages                             Score           Trust
--------                             -----           -----
Weighted contributions:              4132              E
Private contributions:               65                E
Created issues:                      9                 D
Commits authored:                    238               C
Repositories:                        37                A
Pull requests:                       6                 E
Code reviews:                        2                 E
Account age (days):                  1444              B
5th percentile:                      9                 A
10th percentile:                     24                A
15th percentile:                     59                A
20th percentile:                     85                B
25th percentile:                     111               C
30th percentile:                     157               C
35th percentile:                     194               D
40th percentile:                     328               C
45th percentile:                     436               C
50th percentile:                     541               D
55th percentile:                     770               D
60th percentile:                     899               D
65th percentile:                     1255              D
70th percentile:                     1579              D
75th percentile:                     2599              D
80th percentile:                     3652              D
85th percentile:                     5277              E
90th percentile:                     6836              E
95th percentile:                     14190             E
----------------------------------------------------------
Overall trust:                                         D

✔ Analysis successful. 1000 users computed.
GitHub badge available at https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fastronomer.ullaakut.eu%2Fshields%3Fowner%3Dbilibili%26name%3Dkratos

Then, I picked another project from GitHub trend page:

$ docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/Users/changkun/dev/mct:/data/" ullaakut/astronomer bilibili/kratos                                                                                                       [22:12:59]
Beginning fetching process for repository bilibili/kratos
Pre-fetching all stargazers...ok
  > Selecting 200 first stargazers out of 5739
  > Selecting 800 random stargazers out of 5739
Fetching contributions for 1000 users up to year 2013
Building trust report...ok

Averages                             Score           Trust
--------                             -----           -----
Weighted contributions:              2536              E
Private contributions:               71                E
Created issues:                      6                 D
Commits authored:                    137               D
Repositories:                        30                A
Pull requests:                       6                 D
Code reviews:                        1                 E
Account age (days):                  1545              B
5th percentile:                      9                 A
10th percentile:                     25                A
15th percentile:                     43                A
20th percentile:                     55                C
25th percentile:                     74                D
30th percentile:                     106               D
35th percentile:                     146               D
40th percentile:                     188               D
45th percentile:                     245               D
50th percentile:                     349               D
55th percentile:                     490               D
60th percentile:                     638               E
65th percentile:                     832               E
70th percentile:                     1092              E
75th percentile:                     1577              E
80th percentile:                     2072              E
85th percentile:                     3117              E
90th percentile:                     5329              E
95th percentile:                     9192              E
----------------------------------------------------------
Overall trust:                                         D

✔ Analysis successful. 1000 users computed.
GitHub badge available at https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fastronomer.ullaakut.eu%2Fshields%3Fowner%3Dbilibili%26name%3Dkratos

OK, then let's test Tensorflow.

$ docker run -t -e GITHUB_TOKEN=$GITHUB_TOKEN -v "/Users/changkun/dev/mct:/data/" ullaakut/astronomer tensorflow/tensorflow                                                                                                 [23:32:47]
Beginning fetching process for repository tensorflow/tensorflow
Pre-fetching all stargazers...ok
  > Selecting 200 first stargazers out of 131149
  > Selecting 800 random stargazers out of 131149
Fetching contributions for 1000 users up to year 2013
Building trust report...ok

Averages                             Score           Trust
--------                             -----           -----
Weighted contributions:              7495              D
Private contributions:               190               C
Created issues:                      18                B
Commits authored:                    198               D
Repositories:                        16                C
Pull requests:                       10                D
Code reviews:                        3                 D
Account age (days):                  1145              C
5th percentile:                      1                 E
10th percentile:                     2                 E
15th percentile:                     5                 E
20th percentile:                     10                E
25th percentile:                     22                E
30th percentile:                     32                E
35th percentile:                     40                E
40th percentile:                     59                E
45th percentile:                     76                E
50th percentile:                     114               E
55th percentile:                     153               E
60th percentile:                     217               E
65th percentile:                     368               E
70th percentile:                     707               E
75th percentile:                     1076              E
80th percentile:                     2109              E
85th percentile:                     3390              E
90th percentile:                     14580             D
95th percentile:                     30685             D
----------------------------------------------------------
Overall trust:                                         D

✔ Analysis successful. 1000 users computed.
GitHub badge available at https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fastronomer.ullaakut.eu%2Fshields%3Fowner%3Dtensorflow%26name%3Dtensorflow

Issues to the Algorithm

This repo is proposing a justice algorithm without previous study on the ratio of algorithm. As a user of your algorithm, I particularly expect the following supporting points on why the algorithm is accurate:

Showing theoretical analysis regarding the influence of each of the defined factors, and providing regression analysis and statistical stability of the algorithm.
Making benchmarks on various projects, illustrates how your algorithm match the theoretical analysis for the TOP10 valuable open source projects, like golang/go, torvalds/linux, etc.

"Those random stargazers can then sometimes be responsible for slight changes in the results, but they usually represent a difference of 1% to 3%, which is negligeable." -- README.md

May I have how did you have this conclusion? How large is your test samples? What are they? etc.
Establish a user study, an important way of evaluating usability issue is to held an user study. Typically, a single score has lack of expression on many different aspects, and it is not easy to say if the star of a repo is seriously fake or unworthy. Making quantitative analysis on, for example, how other users feel about the score provided by the algorithm, does the score matches your mental expectation? why? how could we help? those are questions should be seriously considered.

Add CI and goreleaser

Add a CI (probably Travis)
Integrate goreleaser to generate the binaries for each release
Document the use of those binaries

Make trust computation distributed

Make it so that user scans are sent to Astronomer's server, which collects all Astronomer scans from users and uses them to generate badges
Sign reports using a secret key in order to guarantee legitimacy

Unit test signature package

SendReport(ctx *context.Context, report *trust.Report) error
Check(report *SignedReport) error

Take percentiles into account for overall trust

I mistakenly forgot to take the percentile values into account when computing the overall trust factor. Will do.

Mode to scan first 1000 users

Add --scanFirstStars option to scan the (by default) up to 1000 first stars of a repository (can be changed with option -s).
In query.go, simply make the getCursors function return the last cursors if the scanFirstStars option is enabled.

This will allow to easily detect foul play in repositories which bought/botted their first stars and now achieved organic growth.

Dependabot can't parse your go.mod

Dependabot couldn't parse the go.mod found at /go.mod.

The error Dependabot encountered was:

go: github.com/spf13/[email protected] requires
	github.com/grpc-ecosystem/[email protected] requires
	gopkg.in/[email protected]: invalid version: git fetch -f origin refs/heads/*:refs/heads/* refs/tags/*:refs/tags/* in /opt/go/gopath/pkg/mod/cache/vcs/9241c28341fcedca6a799ab7a465dd6924dc5d94044cbfabb75778817250adfc: exit status 128:
	fatal: The remote end hung up unexpectedly

View the update logs.

This should have a url of a badger pic

Use A/B/C/D/E/F instead of %s

99% 	A+
80-98%	A
60-80%	B
45-50%	C
35-45%	D
25-35%	E
0-25%	F

Something like that.

Build web application to let users request scans

Build an API for Astronomer where it would run scans of a repository's stars and answer with the trust report.
Build a web application to let people request scans and get reports (@veliona)
Make sure that previously generated reports are kept somewhere and accessible from the web interface

Percentiles should not be shown for repos with less than 20 stars

Also in this example, the progress bar breaks the output, because the hack is not working for repos with less than 40 stars. It needs to be adapted.

Add command line flags & options

-d for detailed reports
--cachedir for specifying a custom cache directory

Make available on Hombrew

After I seen astronomer on HN I put together this homebrew tap to ease installation on OSX: https://github.com/dkanejs/homebrew-astronomer

It would be cool to get this into the official Homebrew itself or have you as the maintainer of the tap under your own GitHub namespace, this way you can also update the formulae with each release 👍

What do you think?

Take ratio of pro users into account for trust

Add percentile variance to standard output results

Compute variance between percentiles
Print it in normal mode
A good variance should be around 5-10%?

Unit test gql helpers

updateUsers(users []User, response listStargazersResponse, year int) []User
getCursors(ctx *context.Context, sg []stargazers, totalUsers uint) []string
buildRequestBody(ctx *context.Context, baseRequest string, pagination int) string
getCursor(cursors []string, page int, reverseOrder bool) string
pickRandomStringsExcept(s []string, picked []string, amount uint) []string
isBlacklisted(user string) bool
parseResponse(resp *http.Response) (*listStargazersResponse, []byte, error)

Astronomer should log a proper error if the GITHUB_TOKEN is not set

Pre-scan stargazers and add progress bar

Before scanning contributions, just scan all stargazers in order to know how many there will be to scan (this task is also a part of #10)
Display a progress bar while the scan is in progress

Add a fast mode

Add an option --fast (maybe turned on by default?) for big repositories (>2K stars) where Astronomer would
- Fetch the list of stargazers first without querying user data
- Select 50 random slices of 20 users within those stargazers
- Compute the statistics on those 1000 users

This would greatly reduce the scan time while remaining fairly accurate.

Dependabot can't parse your go.mod

Dependabot couldn't parse the go.mod found at /go.mod.

The error Dependabot encountered was:

go: github.com/spf13/[email protected] requires
	github.com/grpc-ecosystem/[email protected] requires
	gopkg.in/[email protected]: invalid version: git fetch --unshallow -f origin in /opt/go/gopath/pkg/mod/cache/vcs/748bced43cf7672b862fbc52430e98581510f4f2c34fb30c0064b7102a68ae2c: exit status 128:
	fatal: The remote end hung up unexpectedly

View the update logs.

Unit test gql.FetchStargazers

Need to use httptest and write slightly complex tests, will take more time than the rest

Unit test gql cache functions

getCache(ctx *context.Context, req *http.Request, pagination string) (*http.Response, error)
readCachedResponse(filename string, req *http.Request) (*http.Response, error)
putCache(ctx *context.Context, req *http.Request, pagination string, body []byte) error
cacheEntryFilename(ctx *context.Context, url string) string
listFilePagination(cursor string) string
contribFilePagination(cursor string, year int) string

Detect suspicious ranges of percentiles

Remove the computation of the 65/85/95th percentiles as it's done at the moment
Add a new step which computes every 5th percentile (5, 10, 15 and so on) and detect anomalies within ranges (for example if percentiles 20, 25 and 30 are all abnormally low or abnormally high, it could very well indicate illegitimate stars)
Find a good way to represent this in the trust report

Add a searchable leaderboard

I like trying to find new projects by searching for random things on GitHub and sorting the results by number of stars. I'd love to be able to do the same with trust score or, better yet, some sort of "trusted stars" metric which combines trust score with star count. Would this be possible given the data being curated by Astrolab?

Improvement to the algorithm

#42 Yeah consider developer program guys and users who are a part of some organization!

Compute graph of user trust over time

Compute a basic graph of the evolution of user trustworthiness over time

Compute individual user trustworthiness
Use graph library to display chart in terminal
Do it only when -d option is enabled

ullaakut / astronomer Goto Github PK

astronomer's People

Contributors

Stargazers

Watchers

Forkers

astronomer's Issues

Issues to the Algorithm

Recommend Projects

Recommend Topics

Recommend Org