Giter VIP home page Giter VIP logo

pacta.data.preparation's Introduction

pacta.data.preparation

Lifecycle: stable R-CMD-check codecov CRAN status pacta.data.preparation status badge

The pacta.data.preparation R package provides utilities and functions used to process various data files into data required to run the PACTA for investors tools, such as RMI-PACTA/workflow.transition.monitor or RMI-PACTA/workflow.pacta.

Installation

You can install the development version of pacta.data.preparation from GitHub with:

# install.packages("pak")
pak::pak("RMI-PACTA/pacta.data.preparation")

Running data preparation

The primary way this package is intended to be used is through the workflow.data.preparation repo. See there for an example workflow of how to use the functions of this package to complete a data preparation process properly.

pacta.data.preparation's People

Contributors

cjyetman avatar dependabot[bot] avatar jdhoffa avatar

Stargazers

 avatar

Watchers

 avatar Alex Axthelm avatar  avatar  avatar

pacta.data.preparation's Issues

consider filtering production data to only those companies we have an ID and/or ISIN for

Our current production data is quite large and likely contains a lot of data about private companies and or other companies that we will not be able to match to a company/entity ID nor to any ISIN. It would be advantageous to remove such companies from the prepared dataset to reduce the size and complexity. This would likely be done in new_data_prep.R in the section where ABCD data is being prepared for output here.

add checks for crucial inputs to relevant functions

A fair and valid point. I would think in that case that having good initial checks in every relevant function is a suitable alternative. These errors (and warnings I guess, if applicable) should have informative and clear messages when crucial data is lacking.

I don't know if that already exists sufficiently? If it does then happy to close this, otherwise I would suggest we open a new issue in pacta.data.preparation to ensure something like
"input data in every exported data processing function in pacta.data.preparation is checked for crucial inputs".

NIT: Effective logging would help this further (i know i know) to be able to quickly ascertain where hiccups are happening for those that DON'T run this in RStudio.

Originally posted by @jdhoffa in RMI-PACTA/workflow.data.preparation#11 (comment)

Some of the functions in this repo have a data argument as well as additional arguments specifying specific values to be used in the data input. It would be advantageous for functions like this to first check if the desired options can be achieved by verifying that they are available in the input data, and if not, providing an informative error about why it cannot work as expected.

consider removing rows where scenario data is not available

https://github.com/RMI-PACTA/pacta.data.preparation/blob/ba0f8b8518afb2d00bfe5d9bff1a935418eaa5dd/R/dataprep_abcd_scen_connection.R#L267-L303

When the scenario data is left_joined with the ABCD data, it's possible/likely that some rows of the ABCD data will not match any rows in the scenario data by = c("scenario_geography", "year", "ald_sector", "technology"), and therefore the columns from the scenario data that are added (scenario_source, scenario, units, direction, fair_share_perc) will be filled with NA for those rows. Are these rows useful at all after this point?

I think we should carefully consider whether these lines with no scenario data are meaningful for any reason, and if not we should filter them out to potentially reduce the size of the data substantially. @jacobvjk @jdhoffa @AlexAxthelm

It's possible we do want at least one row of the ABCD data to be left in place even if no scenario data matches it, in which case we'll need something more sophisticated... though the scenario_geography and equity_market columns will make multiple rows distinct even while the rest of the data is duplicated?

related #7

create a list of important characteristics to be auto checked after data.prep runs

It would be nice to start compiling a list of data characteristics that should be checked once data.prep is done, and eventually write some code that will automatically check/report on them after every run.

I'll start with a few, and please add to it if you think of any...

  • number of unique funds in the fund database
  • verify that in every fund, the sum of the holdings' values adds up to the total value
  • verify that each isin in isin_to_fund_table points a single fund, i.e. is unique
  • verify that all output RDS files are ungrouped (relates to RMI-PACTA/archive.pacta.data.preparation#218)

consider dealing with emissions factors in ABCD data more cautiously

here...
https://github.com/2DegreesInvesting/pacta.data.preparation/blob/04ed2a54fcb20589086937a2d58ed994871ec78f/run_pacta_data_preparation.R#L358-L364

and here...
https://github.com/2DegreesInvesting/pacta.data.preparation/blob/04ed2a54fcb20589086937a2d58ed994871ec78f/run_pacta_data_preparation.R#L413-L419

we set the emissions factors to 0 if the technology is in the user-specified list of zero_emission_factor_techs and the production values are greater than 0. This seems hella questionable and we should consider doing whatever this is trying to achieve in a more sensible way.

don't expand `scenario_geography` and `equity_market` until necessary

https://github.com/RMI-PACTA/pacta.data.preparation/blob/ba0f8b8518afb2d00bfe5d9bff1a935418eaa5dd/R/dataprep_abcd_scen_connection.R#L143-L151

Up until merging in the scenario data, the expansion of the data with the scenario_geography and equity_market columns drastically multiplies the number of rows in the data, and the grouped calculations necessitated by these otherwise duplicated rows is a source of the incredibly long run times. Basically for every combination of id, technology, and year we are multiplying the rows by every combination of scenario_geography and equity_market and calculating duplicate data for all of them.

We should carefully consider if this is actually necessary, and if not calculate as much and we can before expanding to the scenario_geography and equity_market values. @jdhoffa @jacobvjk @AlexAxthelm

related #7

investigate differences between new AR format and old AR format

Especially with the recent change with AR, we're probably going to eventually need to switch to their new format. We should investigate the differences and see if we can switch to using the new format.

library(tidyverse)
library(pacta.data.preparation)

ar_data_path <- "~/Documents/Data/Asset Resolution/2022-08-15_AR_2021Q4"
ar_advanced_company_indicators_path <- file.path(ar_data_path, "2022-08-24_AR_2021Q4_RMI-Company-Indicators.xlsx")
masterdata_debt_path <- file.path(ar_data_path, "2022-10-05_rmi_masterdata_debt_2021q4.csv")
masterdata_ownership_path <- file.path(ar_data_path, "2022-08-15_rmi_masterdata_ownership_2021q4.csv")

ar_advanced_company_indicators <- import_ar_advanced_company_indicators(ar_advanced_company_indicators_path, fix_names = TRUE)
masterdata_debt <- readr::read_csv(masterdata_debt_path, na = "", show_col_types = FALSE)
masterdata_ownership <- readr::read_csv(masterdata_ownership_path, na = "", show_col_types = FALSE)


# -------------------------------------------------------------------------

ownership_data <- 
  ar_advanced_company_indicators %>% 
  filter(consolidation_method == "Equity Ownership") %>% 
  filter(value_type == "production") %>%
  filter(
    asset_sector == "Aviation" & activity_unit %in% c("pkm", "tkm") |
      asset_sector == "Cement" & activity_unit == "t cement" |
      asset_sector == "Coal" & activity_unit == "t coal" |
      asset_sector == "HDV" & activity_unit == "# vehicles" |
      asset_sector == "LDV" & activity_unit == "# vehicles" |
      asset_sector == "Oil&Gas" & activity_unit == "GJ" |
      asset_sector == "Power" & activity_unit == "MW" |
      asset_sector == "Shipping" & activity_unit == "dwt km" |
      asset_sector == "Steel" & activity_unit == "t steel"
  ) %>% 
  pivot_wider(names_from = "year", values_fill = 0) %>% 
  group_by(asset_sector) %>% 
  summarise(new_2016 = sum(`2016`, na.rm = TRUE), new_n = length(unique(company_id)))

masterdata_ownership %>% 
  filter(company_id %in% ar_advanced_company_indicators$company_id) %>% 
  group_by(sector) %>% 
  summarise(old_2016 = sum(`_2016`, na.rm = TRUE), old_n = length(unique(company_id))) %>% 
  full_join(ownership_data, by = c(sector = "asset_sector")) %>% 
  mutate(
    new_2016 = if_else(is.na(new_2016), as.numeric(0), as.numeric(new_2016)),
    new_n = if_else(is.na(new_n), as.numeric(0), as.numeric(new_n))
  ) %>% 
  mutate(diff = old_2016 - new_2016) %>% 
  mutate(percent_diff = round(diff / old_2016 * 100, digits = 2))

#> # A tibble: 9 × 7
#>   sector   old_2016 old_n new_2016 new_n     diff percent_diff
#>   <chr>       <dbl> <int>    <dbl> <dbl>    <dbl>        <dbl>
#> 1 Aviation  6.01e12  1820  6.01e12  1820 -3.12e-2         0   
#> 2 Cement    5.01e 9  2070  4.67e 9  2035  3.46e+8         6.91
#> 3 Coal      1.09e10  2051  1.09e10  2061 -1.47e-2         0   
#> 4 HDV       0         575  0         575  0             NaN   
#> 5 LDV       1.39e 8   394  1.39e 8   435 -3.73e-1         0   
#> 6 Oil&Gas   7.11e11  4563  7.11e11  4563 -5.32e-2         0   
#> 7 Power     1.30e 7 36846  1.30e 7 36875 -3.26e-1         0   
#> 8 Shipping  2.56e14 11167  2.56e14 11169  2.95e+9         0   
#> 9 Steel     2.73e 9  1105  2.73e 9  1105 -5.76e-3         0


# -------------------------------------------------------------------------

fin_control_data <- 
  ar_advanced_company_indicators %>% 
  filter(consolidation_method == "Financial Control") %>% 
  filter(value_type == "production") %>%
  filter(
    asset_sector == "Aviation" & activity_unit %in% c("pkm", "tkm") |
      asset_sector == "Cement" & activity_unit == "t cement" |
      asset_sector == "Coal" & activity_unit == "t coal" |
      asset_sector == "HDV" & activity_unit == "# vehicles" |
      asset_sector == "LDV" & activity_unit == "# vehicles" |
      asset_sector == "Oil&Gas" & activity_unit == "GJ" |
      asset_sector == "Power" & activity_unit == "MW" |
      asset_sector == "Shipping" & activity_unit == "dwt km" |
      asset_sector == "Steel" & activity_unit == "t steel"
  ) %>% 
  pivot_wider(names_from = "year", values_fill = 0) %>% 
  group_by(asset_sector) %>% 
  summarise(new_2016 = sum(`2016`, na.rm = TRUE), new_n = length(unique(company_id)))

masterdata_debt %>% 
  filter(company_id %in% ar_advanced_company_indicators$company_id) %>% 
  group_by(sector) %>% 
  summarise(old_2016 = sum(`_2016`, na.rm = TRUE), old_n = length(unique(company_id))) %>% 
  full_join(fin_control_data, by = c(sector = "asset_sector")) %>% 
  mutate(
    new_2016 = if_else(is.na(new_2016), as.numeric(0), as.numeric(new_2016)),
    new_n = if_else(is.na(new_n), as.numeric(0), as.numeric(new_n))
  ) %>% 
  mutate(diff = old_2016 - new_2016) %>% 
  mutate(percent_diff = round(diff / old_2016 * 100, digits = 2))

#> # A tibble: 9 × 7
#>   sector   old_2016 old_n new_2016 new_n     diff percent_diff
#>   <chr>       <dbl> <int>    <dbl> <dbl>    <dbl>        <dbl>
#> 1 Aviation  3.24e12  1302  5.94e12  1609 -2.70e12        -83.3
#> 2 Cement    2.89e 9  1459  4.27e 9  1685 -1.39e 9        -48.0
#> 3 Coal      6.31e 9  1349  1.12e10  1629 -4.87e 9        -77.2
#> 4 HDV       0         283  0         426  0              NaN  
#> 5 LDV       9.36e 7   194  1.16e 8   300 -2.24e 7        -23.9
#> 6 Oil&Gas   3.06e11  2864  6.85e11  3864 -3.79e11       -124. 
#> 7 Power     6.12e 6 29582  1.28e 7 34012 -6.68e 6       -109. 
#> 8 Shipping  1.70e14  9844  2.53e14 10518 -8.30e13        -48.9
#> 9 Steel     1.66e 9   757  2.70e 9   924 -1.03e 9        -62.1

Explore multi-threading `{asset}_abcd_scenario.rds` generating function

With RMI-PACTA/archive.pacta.data.preparation#81 closed, we open an opportunity to use multi-threading to speed up the time- and memory- intensive processes.

In particular, we might be able to spread each process to calculate {asset}_abcd_scenario_{scenario_name}.rds across multiple CPUs.

Programatically determine which economic variable to connect between ABCD and Scenario

Yes, exactly.... the primary point is that we can see in the current ABCD data that a given sector may have metric == "company level capacity" AND metric == "company level production", and we should be cautious about that and make sure we choose the appropriate one. In the current data, Shipping technologies are the only ones in that situation, so we can get away with ignoring it for now since we aren't using Shipping data to prepare 2021Q4, but eventually we should deal with this possibility (for all sectors) in a programmatic way.

Originally posted by @cjyetman in https://github.com/2DegreesInvesting/pacta.data.preparation/issues/94#issuecomment-1239104486

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.