Comments (2)
I've started working on this. So far I've implemented functionality to identify and delete data files that are identical to an earlier pull.
I've structured things so that some functionality can be reused in the next step, where I'll implement a DAG to delete some non-duplicated data, although I haven't settled on retention logic yet. Maybe it will be a keep_last_n_data_versions
or maybe it would be better to have a keep_data_versions_from_past_n_days
, or maybe both. I'll think through cases.
from analytics_data_where_house.
After reviewing the sizes of XComs stored in the airflow_metadata_db and of logs in all non-scheduler
logs directories, I see that the contents of the /logs/scheduler
dir comprise 94% of the /logs
disk usage. Upon inspecting a few scheduler log files, I see that the issue is there are ~25MB of logs per DAG per day, and it's overwhelmingly driven by this unnecessary warning that's slated to be removed in Airflow v2.5.2 (we're at v2.5.1 right now). So I'll settle for just clearing out old scheduler records right now.
from analytics_data_where_house.
Related Issues (20)
- Implement a metadata collector for Census data sets
- Update pgAdmin to version 7.0
- Upgrade Airflow from 2.5.2 to 2.5.3
- Refactor _standardized stage scripts to clean col-values before making them into a composite key
- Remove the `report` schema and its models HOT 1
- Fix great_expectations workflow to be run from the airflow_scheduler container HOT 2
- Develop an interface to access data tables from an external notebook or other outside analysis environment
- Explore and document the Census API data catalog, endpoints, and metadata HOT 8
- Remove dev_utils module
- The default TimeZone of the system's postgres databases was left as UTC, so timestamptz columns need correction HOT 1
- Notes from great_expectations workflow experiments: replace now-obsolete CLI workflow HOT 3
- Update to Airflow v2.6.0 and update package versions installed in Airflow images HOT 1
- Update pgAdmin4 from v7.0 to v7.1
- Notes on postgres commands or recipes for efficiently cleaning column values HOT 4
- The clean-model formatting code fails when the composite-key list in the standardized model spans multiple lines
- Add tasks to register new tables as Great Expectations Data Sources
- Upgrade Airflow to v 2.6.1
- Add tasks to ingest metadata on Census API Dataset geographies
- Develop a prototype representation for a Census API dataset
- Define a schema for a Census API Dataset-Groups Metadata table
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from analytics_data_where_house.