Giter VIP home page Giter VIP logo

airflow-jobs's People

Contributors

ckxkexing avatar crystaldust avatar estrangezz avatar fivestarsky avatar gjftta avatar hexaemeronfsk avatar shanchenqi avatar ynang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

airflow-jobs's Issues

Transfer ClickHouse data when initializing GitHub profiles

Currently the github profile init DAG only stores data in opensearch, and then run a ck_transfer DAG to copy all github profile data in opensearch to clickhouse. So when the dag run again, it will copy duplicated data to clickhouse. It's better to transfer clickhouse data at the same time(within one DAG).

The design:
Each developer's profile has an updated_at field, indicating when the profile is lastly modified. So select data in opensearch whose updated_at is greater than "latest updated_at in clickhouse", then insert them into clickhouse, is approximately enough.

Control opensearch insertion batch

Some data, like github commits, pr, issues are fetched and inserted by pages. Data in one page may not be as large as a batch. Setup a batch and collect these data into batch, when batch full, insert the batch data into opensearch

[FEATURE] A universal design of code review workflow

There are many different forms of code review/merging/maintaining mechanisms in opensource community. Like Apache foundation's incubation system; Linux kernel's mailing list discussion and patch review/merging; GitHub/GitLab's issue/PR/comments discussion collaboration. So we need to design a universal code review workflow, which obtains the whole process of code being written, discussed/reviewed, modified, and merged, then different code collaboration systems can be applied to the universal template. And any future forms can be adapted.

Default/Empty github API result

GitHub repos might be removed after downloading. Then the request to that repos will be NotFound. Maybe we should come up with a solution to provide default/empty GitHub resources.

A current example, the get_github_pull_requests function in libs/util/github_api.py (@line 127) will fail if the response is NotFound.

Add DAG to daily sync github profile from remote clickhouse service

When resolving issue #189 , we found that the github profile can be configure with new engine that will remove duplication, by keeping (github id, updated_at) fields tuple. Which defines a snap shot of a profile at a particular time.

So after changing the new engine, it is possible to sync github profile from remote clickhouse serivce by compare the search_key__updated_at and the new engine will keep removing duplication.

50x error

When the target server response 50x error, currently the base function just returns an empty list. For some modules that expect profile, it will cause error.

Remove unnecessary logs of opensearchpy

The opensearchpy library will print a line of log on each request like this:

[2023-04-25, 09:01:43 UTC] {base.py:270} INFO - POST https://dev-opensearch-node1:9200/gits/_delete_by_query [status:200 request:0.006s]

Which brings huge amount of useless text, consider turn off the logs by opensearchpy settings.

Handle renamed github repos

When a github repository is renamed, the current git_track_repo dag's issue timeline/comments task will never finish(and calling the api in a pretty high pace).

Set different job intervals for daily sync DAGs

Daily sync DAGs needs to be scheduled by different intervals to balance resource consuming. The policy is to 2 level intervals in variables with different priority:

high: daily_sync_{DAG_NAME}_interval
middle: daily_sync_interval

If the interval variables above is not configured, do not trigger the DAG periodically.

Make unified format of gits origin

Both https://github.com/OWNER/REPO.git and https://github.com/OWNER/REPO are valid git urls. When specify 'includes' variable for daily sync, and its origin is different from the data in OpenSearch, like opensearch doc's origin contains '.git' suffix while 'includes' doesn't. Then there will be 2 different (owner, repo, origin) tuples.

If then we do a full repo daily sync, the 2 tuples will be considered as 2 code bases, and will sync data separately, introducing redundant data into OpenSearch, then to ClickHouse.

The solution is to make sure to eliminate the '.git' suffix before init or sync data.

Set different job intervals for daily sync DAGs

Daily sync DAGs needs to be scheduled by different intervals to balance resource consuming. The policy is to 2 level intervals in variables with different priority:

high: daily_sync_{DAG_NAME}_interval
middle: daily_sync_interval

If the interval variables above is not configured, do not trigger the DAG periodically.

Add DAG to daily sync github profile from remote clickhouse service

When resolving issue #189 , we found that the github profile can be stored with new engine that will remove duplication, by keeping (github id, updated_at) fields tuple distinct. Which indeed defines a snap shot of a profile at a particular time.

So after changing the new engine, it is possible to sync github profile from remote clickhouse serivce by compare the search_key__updated_at and the new engine will keep removing duplication.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.