phronmophobic / dewey Goto Github PK

View Code? Open in Web Editor NEW

74.0 4.0 1.0 894 KB

Index of Clojure libraries available on github.

License: Eclipse Public License 1.0

Clojure 96.36% Shell 3.64%

clojure github open-datasets

dewey's Introduction

Dewey

Index of Clojure libraries available on github.

Analysis:

Analyzing Every Clojure Project on Github

Web frontends:

Rationale

The goal of this project is to make the clojure libraries available on github easier to programmatically list and inspect.

Deps.edn can procure dependencies directly from github. However, finding clojure libraries that are available via github can be more difficult compared to clojars. Clojars provides several data endpoints to list available libraries and metadata. Even though similar info is available from github, it's not quite as easy to obtain.

Getting the data

Pre-retrieved data can be found at releases.

What's included?

Each release includes the following files in .gz or tar.gz format:

deps-libs.edn: This the best place to start if you're using the data. It's a map of library name to library info for all clojure github libraries that have deps.edn files on their default branch.
- Library info keys:
  - :description The github repo description.
  - :lib Lib coordinate that will be recognized by clojure cli tools. https://clojure.org/reference/deps_and_cli#find-versions
  - :topics The github topics for the repo.
  - :stars Number of github stars.
  - :url URL to page on github.
  - :versions: vector of lib coordinates based on the tags in the git repo.
deps directory: The deps.edn file for every clojure library that has a deps.edn file on their default branch. The folder structure is deps/<github username>/<github project>/deps.edn.
all-repos.edn: A vector of all clojure repositories on github that were found (including non deps.edn based projects). Each repository is represented by a map with all of the data returned via the github API https://docs.github.com/en/rest/repos/repos#get-a-repository.
deps-tags.edn: This an intermediate file of pairs of github repo information and github tag information.
analysis.edn.gz: clj-kondo analysis for every repo. The kondo config turns off all linters and includes the :locals, :keywords, :arglists, and :protocol-impls analyses. See dewey indexer and clj-kondo docs. available as of 2022-07-25 release

All the .edn or .edn.gz files can be read using com.phronemophobic.dewey.util/read-edn. For example:

(require 'com.phronemophobic.dewey.util)
(def data (com.phronemophobic.dewey.util/read-edn fname))

Analysis Data

clj-kondo analyses for each project found can be found in the releases under analysis.edn.gz. This file can be quite chonky. For an example of how to process the data, see the stats example.

The file contains a vector of maps, with each map containing the following keys:

:repo: A string name for the repo, eg. "phronmophobic/dewey".
:analyze-instant: The instant that the repo was analyzed, eg. #inst "2022-12-15T21:09:46.694-00:00".
:git/sha: The commit hash of the repository commit analyzed, eg. "69dc62aac32f8a2da0a47aaf1dc662f86ff05760".
:basis: A single repository can have multiple projects. Basis is the project file used to generate the source paths for clj-kondo to analyze, eg. "path/to/project.clj" or path/to/deps.edn.
:analysis: The clj-kondo analysis.

The file is specially formatted edn so that it can be processed without reading the full contents into memory. The first line is [, the last line is ], and every line in between is a single map.

Generating the dataset via the github API

To retrieve the data yourself, follow step 0 and then run:

# creates releases/yyyy-MM-dd/all-repos.edn
clojure -X:update-clojure-repo-index

# downloads all deps files to releases/yyyy-MM-dd/<user>/<project>/deps.edn
# due to rate limits, takes around 3 hours (mostly sleeping).
clojure -X:download-deps

# downloads tags for each deps.edn clojure library to releases/yyyy-MM-dd/deps-tags.edn
clojure -X:update-tag-index

# creates an index of library name to library metadata in releases/yyyy-MM-dd/deps-libs.edn
clojure -X:update-available-git-libs-index

These commands must be run in order.

Finding Clojure Libraries Methodology

Github search is quirky and has certain limitations imposed by rate-limiting. Below is a short synopsis of how Dewey attempts to locate clojure projects on github within the limitations imposed by github's API.

Current Method

Authentication
Find all clojure repositories
Download all deps.edn files

0. Authentication

Dewey uses personal access tokens to make github API requests. You can obtain a personal access token by following these docs.

Once you have obtained your personal access token, save it to an edn file called "secrets.edn" in the root project directly using following format:

{:github {:user "my-username"
          :token "my-token"}}

1. Find all clojure libraries

Currently, the first step is to paginate through the results of the github repository search language:clojure sorted by stars in descending order. There's a 1,000 result limit for any specific search so after exhausting the results from language:clojure, we find repositories for specific numbers of stars starting at the star number from the last result. The search query for these requests look like language:clojure stars:123, language:clojure stars:122, etc.

2. Download all deps.edn files

Once we have a list of clojure github repositories, we can then check each repository for its deps.edn file. Given a repository, the url for the deps.edn file looks like (str "https://raw.githubusercontent.com/" full-name "/" default-branch "/" fname))).

Current known limitations

There are some libraries that actually are clojure libraries, but aren't found when searching using language:clojure
Clojurescript only libraries are not currently targeted
Only checks tip of default branch.
Only 1,000 libraries max per star count. At the time of writing, this only matters for star counts less than 5.

Failed Strategies

Searching github code with "filename:deps.edn"

I thought just asking github for all the files named deps.edn might work. The roadblocks I ran into were:

Hitting secondary rate after 1-2 requests.
Receiving only 0-3 results even on successful requests.

Alternative Strategies

These are stategies that I didn't try, but might be good alternatives if the main strategy fails.

Scanning clojure repos by created or updated

As suggested by this stackoverflow answer, you can search by a field. The search API currently limits results to a max of 1000, but if you search a small enough window of time, you can scan through all the libraries.

Relevant Github docs

Use github's GraphQL API

It's possible that github's GraphQL API might provide opportunities for improvement. However, it doesn't appear to have a way to filter by language or any other means of identifying repositories that are clojure related.

Future Work

Now that we've bothered to catalog of all of the clojure repos on github, there's several interesting projects we can do that use the data:

~~Download and run static analysis across repos~~ Done! see analysis.edn.gz in releases.
~~Create a website that combines the clojars data API with dewey's data to make it easier to search for clojure libraries.~~ Done! see web search interface.
Integrate the data into tools and IDEs
- deps.edn editor that knows the available libraries and versions
- Find example usages for libraries or specific functions (for example)
Add support for other git hosting sites like gitlab.

License

Distributed under the Eclipse Public License version 1.0.

dewey's People

Contributors

Stargazers

Watchers

Forkers

ivarref

dewey's Issues

What is the rationale behind the `:lib` value format?

Hi, @phronmophobic.

I'm trying to understand the data that dewey provides. I noticed that there is a :lib key that - I presume - holds a unique identifier for each project, in a format like io.github.phronmophobic/dewey. Could you please explain the rationale for this format?

https://github.com/phronmophobic/dewey/blob/2318c3a0d08ea557e12278f2927e20b4b2fbe8a1/src/com/phronemophobic/dewey.clj#LL276C40-L276C75

Use scm attribute in clojars pom file to find repos outside of github

ns -> dep mapping

Heya, I wanted to make a deps autoloader that would catch errors about missing ns and auto import them using clojure 1.12's clojure.repl.deps/add-libs.

I think that information is only on Dewey's larger release artifacts, and downloading all those files is a tough proposition.

Ideally I'd like to end up with something like

{clojure.data.csv {:git/url "https://github.com/clojure/data.json"
                   :git/sha "c323f899a06653af9d66a8e0212b65d0ac6f7b7f"
                   :git/tag "v1.1.0}
;; more ns -> dep mappings
}

Do you have any thoughts or suggestions on how to end up with this data? We talked briefly on slack and you already suggesting new build artifacts and an API.

Add `:lib` key to make it easier to analyze/process data

The deps-libs data includes :lib key. Other data should as well.

Suggestion: document list of fields included in each pre-computed dump

Hi @phronmophobic,

Thank you for maintaining dewey. I'm trying to incorporate some of the provided data and retrieval pathways into my projects. Would you consider updating the docs to include the fields and their meanings included in each pre-computed dump? I think this would help make dewey more approachable and would remove the need to run the project and call com.phronemophobic.dewey.util/read-edn to see what's inside.

Pre-retrieved data formatting

I know that the formatting of Dewey's data dumps is still a work in progress and open to change, so I wanted to flag some concerns and potential areas for improvement.

It looks like at the moment the formatting of the pre-retrieved data dump files is not consistent:

analysis.edn.gz: first line is opening bracket [ followed by one project's analysis map {...} per line
all-repos.edn: single line containing ({...} ...)
deps-libs.edn: single line containing a map keyed by lib
deps-tags.edn: single line containing a vector of 2-element vectors:
- first element: map of repo information
- second element: vector of maps, where each map represents a git tag of the repo

These differences create a bit of friction when consuming the data. It would be nice to either unify the format somehow, provide clear documentation about the differences, or add some helpers like read-edn to assist those trying to consume the data (or a combination of all three).

Related: read-edn does not currently work as documented in the README, because it consumes too much CPU and takes too long for certain files, like analysis.edn.gz.

As for the formatting, it would be nice if all of the provided data would be easy to join. Making the :lib key ubiquitous, as mentioned in #3 would be a step in the right direction. If all the data could be keyed by :lib, it would be better still, allowing lookups without iterating sequences to find the data to join. I don't know exactly how to achieve that without requiring the whole map to be loaded into memory first. Perhaps SQLite can be part of the solution?

Include latest sha in deps-libs.edn

We grab the latest commit of the default branch in default-branches.edn. We should add that info to deps-libs.edn since not every repo has a tagged commit to make a version from.

Reproducibility: include `:git/sha` or similar in `analysis.edn`

Hi @phronmophobic,

You've mentioned that some of the effort you are expanding on Dewey right now is around the repeatability of the analysis process. I'd like to suggest that when a Clojure repo is analyzed, the resulting analysis data should include the repo's commit hash on which the analysis was performed. I think the benefits of this one additional piece of data are fairly clear, but I could expand on my rationale if the use case is not evident. Do you have any thoughts on this idea?

why "FiraCode" ?

Why is thgis in the list ?

It has no Clojure related keyword :
https://github.com/tonsky/FiraCode

Document `:basis` key in `analysis.edn`

The clj-kondo analysis dump can include one or more entries per repo, depending on how many bases (deps.edn, project.clj, etc.) the project has. The difference is represented through a top-level :basis key in each of the maps in the file.

Since the :basis key is not a part of clj-kondo's own output, I think it would be helpful to document it for Dewey users, and to explain its impact on the cardinality between repos and analysis maps in the pre-computed analysis.edn dumps. I initially assumed that each repo would have just a single analysis entry.

Add `clojure/clojure` analysis

The current analysis data doesn't include https://github.com/clojure/clojure. We had a quick discussion about it on Slack, and you mentioned that it might be a good idea to add it. However, there were no specific commitments made regarding the timing. Just opening this issue as a reminder. Thanks, @phronmophobic!

Sorting and searching based on `updated` and `pushed`

Hi @phronmophobic

and thanks for many great libraries and code!

I was able to get 4431 repositories with 2 stars using this code:

(def base-request
  {:url          search-repos-url
   :method       :get
   :as           :json
   :query-params {:per_page 100
                  :sort     "updated"
                  :order    "desc"}})

and

(defn find-clojure-repos []
  (iteration
    (with-retries
      (fn [{:keys [url pushed_at last-response] :as k}]
        (prn (select-keys k [:url :pushed_at]))
        (let [req
              (cond
                ;; initial request
                (nil? k) (search-repos-request "language:clojure stars:2")
                ;; received next-url
                url (assoc base-request :url url)
                ;; received star number
                pushed_at (search-repos-request (str "language:clojure stars:2 " "pushed:<=" pushed_at))
                :else (throw (Exception. (str "Unexpected key type: " (pr-str k)))))]
          (rate-limit-sleep! last-response)
          (let [response (http/request (with-auth req))
                prev-items (into #{} (get-in last-response [:body :items] []))]
            (-> response
                (assoc ::key k
                       ::request req)
                (update-in [:body :items]
                           (fn [items]
                             (vec (remove (partial contains? prev-items) items)))))))))
    :kf
    (fn [response]
      (let [url (-> response :links :next :href)]
        (if url
          {:last-response response
           :url           url}
          (when-let [pushed_at (some-> response
                                       :body
                                       :items
                                       last
                                       ;; want to continue from where we left off
                                       :pushed_at)]
            {:pushed_at     pushed_at
             :last-response response}))))))

you'll notice that here pushed_at is used for searching and sort-by updated is used.

The search is done using pushed:<= ... so that identical updated timestamps is supported (this also introduces the need to remove items from the previous request, in order to not loop forever).

Is this something that you would like to integrate into dewey?

Thanks and kind regards!

Expanding the clj-kondo analysis config

Hi, @phronmophobic. Would you be open to expanding the config map which Dewey passes to clj-kondo?

My code is relying on {:var-definitions {:meta [:arglists]}} for handling some edge cases, particularly in Clojure core and the associated tests, because some of the definitions there are a little unusual. In some of the core definitions the arglist is defined only in meta and in some of the tests there are intentionally invalid definitions, and I check for empty :arglist-strs and empty :meta {:arglists ...} to remove those entries.

In general, I'm very much interested in delegating some of the data extraction requirements in my project to Dewey, but if there's a gap between my requirements and the provided data, I basically have to re-run the analysis on my end.

I guess this issue is two questions:

would you be open to adding {:var-definitions {:meta [:arglists]}} to the current analysis config?
do you have any thoughts about addressing such requirement gaps in general? perhaps it would be more pragmatic to ask clj-kondo for everything it is able to provide? I'm not sure if that would have other implications for performance or something else as you compile the pre-computed data sets