oss-know / airflow-jobs Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 19.0 1.37 MB

License: Apache License 2.0

Dockerfile 0.14% Shell 0.32% Python 98.13% Makefile 0.51% Jupyter Notebook 0.05% JavaScript 0.66% Roff 0.19%

airflow-jobs's People

Contributors

Stargazers

Watchers

Forkers

ynang shanchenqi crystaldust fsk-outlook hex-outlook dev-hotmail hexaemeronfsk gjftta ckxkexing roares zju163 estrangezz lin334421 seemosss

airflow-jobs's Issues

Transfer ClickHouse data when initializing GitHub profiles

Currently the github profile init DAG only stores data in opensearch, and then run a ck_transfer DAG to copy all github profile data in opensearch to clickhouse. So when the dag run again, it will copy duplicated data to clickhouse. It's better to transfer clickhouse data at the same time(within one DAG).

The design:
Each developer's profile has an updated_at field, indicating when the profile is lastly modified. So select data in opensearch whose updated_at is greater than "latest updated_at in clickhouse", then insert them into clickhouse, is approximately enough.

Control opensearch insertion batch

Some data, like github commits, pr, issues are fetched and inserted by pages. Data in one page may not be as large as a batch. Setup a batch and collect these data into batch, when batch full, insert the batch data into opensearch

Yet another testing issue

@fivestarsky #51

[FEATURE] A universal design of code review workflow

There are many different forms of code review/merging/maintaining mechanisms in opensource community. Like Apache foundation's incubation system; Linux kernel's mailing list discussion and patch review/merging; GitHub/GitLab's issue/PR/comments discussion collaboration. So we need to design a universal code review workflow, which obtains the whole process of code being written, discussed/reviewed, modified, and merged, then different code collaboration systems can be applied to the universal template. And any future forms can be adapted.

Analysis DAG restructure.

Optimize the code and structure of data analysis DAG.

[BUG] Daily sync scheduling for issues/comments/timeline not working

It's super weird that the daily sync scheduling only works for gits/commits/pr, but not issues and related comments/timeline. Check if it's a bug of airflow, or misconfiguration.

Create gits_dir and other aggregation related tables by the DAG.

Currently the table creation DAG create tables for basic data, like gits, github commits, issue, pr etc. More higher level table like gits_dir, metrics etc should also be created. To avoid creating tables manually.

Check auto formatted SQLs by IDE code automation

The PyCharm IDE automatically formatted long SQL strings in the commit 7ea4e330bd3350bfac083569f2883a2d29de5ba8, while some becomes invalid. Check all the SQLs and re-claim them in a valid way.

Default/Empty github API result

GitHub repos might be removed after downloading. Then the request to that repos will be NotFound. Maybe we should come up with a solution to provide default/empty GitHub resources.

A current example, the get_github_pull_requests function in libs/util/github_api.py (@line 127) will fail if the response is NotFound.

Add DAG to daily sync github profile from remote clickhouse service

When resolving issue #189 , we found that the github profile can be configure with new engine that will remove duplication, by keeping (github id, updated_at) fields tuple. Which defines a snap shot of a profile at a particular time.

So after changing the new engine, it is possible to sync github profile from remote clickhouse serivce by compare the search_key__updated_at and the new engine will keep removing duplication.

50x error

When the target server response 50x error, currently the base function just returns an empty list. For some modules that expect profile, it will cause error.

Restructure of token proxy accommodator

Make it a factory class, generate accommodator by different proxy service providers.

Remove unnecessary logs of opensearchpy

The opensearchpy library will print a line of log on each request like this:

[2023-04-25, 09:01:43 UTC] {base.py:270} INFO - POST https://dev-opensearch-node1:9200/gits/_delete_by_query [status:200 request:0.006s]

Which brings huge amount of useless text, consider turn off the logs by opensearchpy settings.

Handle renamed github repos

When a github repository is renamed, the current git_track_repo dag's issue timeline/comments task will never finish(and calling the api in a pretty high pace).

Set different job intervals for daily sync DAGs

Daily sync DAGs needs to be scheduled by different intervals to balance resource consuming. The policy is to 2 level intervals in variables with different priority:

high: daily_sync_{DAG_NAME}_interval
middle: daily_sync_interval

If the interval variables above is not configured, do not trigger the DAG periodically.

Improve the performance of github profile init by parallesm

Current github profiles init is done in a serial way, sending http request to timeline/commit API one after another, use executor pool to send multiple requests in parallel for better performance.

Restructure of GH archive DAG

这是一个测试用issues

Make unified format of gits origin

Both https://github.com/OWNER/REPO.git and https://github.com/OWNER/REPO are valid git urls. When specify 'includes' variable for daily sync, and its origin is different from the data in OpenSearch, like opensearch doc's origin contains '.git' suffix while 'includes' doesn't. Then there will be 2 different (owner, repo, origin) tuples.

If then we do a full repo daily sync, the 2 tuples will be considered as 2 code bases, and will sync data separately, introducing redundant data into OpenSearch, then to ClickHouse.

The solution is to make sure to eliminate the '.git' suffix before init or sync data.

perceval for LLVM Phabricator