rmi-pacta / pacta.data.preparation Goto Github PK

The goal of {pacta.data.preparation} is to prepare and format all input datasets required to run the PACTA for investors tools.

Home Page: https://rmi-pacta.github.io/pacta.data.preparation/

License: Other

R 100.00%

pacta climate-change pactaverse r r-package sustainable-finance

pacta.data.preparation's Introduction

pacta.data.preparation

The pacta.data.preparation R package provides utilities and functions used to process various data files into data required to run the PACTA for investors tools, such as RMI-PACTA/workflow.transition.monitor or RMI-PACTA/workflow.pacta.

Installation

You can install the development version of pacta.data.preparation from GitHub with:

# install.packages("pak")
pak::pak("RMI-PACTA/pacta.data.preparation")

Running data preparation

The primary way this package is intended to be used is through the workflow.data.preparation repo. See there for an example workflow of how to use the functions of this package to complete a data preparation process properly.

pacta.data.preparation's People

Contributors

Stargazers

Watchers

pacta.data.preparation's Issues

Implement SDA methodology properly

Depends on RMI-PACTA/pacta.scenario.preparation/#41
related RMI-PACTA/pacta.portfolio.analysis/issues/81
related to ADO-5281

consider filtering production data to only those companies we have an ID and/or ISIN for

Our current production data is quite large and likely contains a lot of data about private companies and or other companies that we will not be able to match to a company/entity ID nor to any ISIN. It would be advantageous to remove such companies from the prepared dataset to reduce the size and complexity. This would likely be done in new_data_prep.R in the section where ABCD data is being prepared for output here.

make CJ an admin of this repo

@jdhoffa

add checks for crucial inputs to relevant functions

A fair and valid point. I would think in that case that having good initial checks in every relevant function is a suitable alternative. These errors (and warnings I guess, if applicable) should have informative and clear messages when crucial data is lacking.

I don't know if that already exists sufficiently? If it does then happy to close this, otherwise I would suggest we open a new issue in pacta.data.preparation to ensure something like
"input data in every exported data processing function in pacta.data.preparation is checked for crucial inputs".

NIT: Effective logging would help this further (i know i know) to be able to quickly ascertain where hiccups are happening for those that DON'T run this in RStudio.

Originally posted by @jdhoffa in RMI-PACTA/workflow.data.preparation#11 (comment)

Some of the functions in this repo have a data argument as well as additional arguments specifying specific values to be used in the data input. It would be advantageous for functions like this to first check if the desired options can be achieved by verifying that they are available in the input data, and if not, providing an informative error about why it cannot work as expected.

Make this repo public

Be careful, as there is some FactSet data in here (e.g. FactSet issue type symbology, that maps to PACTA investment type)

depends on RMI-PACTA/archive.pacta.data.preparation#321

if this lands first, that should put a fire under this https://github.com/RMI-PACTA/pacta.scenario.preparation/issues/123

this will facilitate RMI-PACTA/workflow.data.preparation#102

AB#10435

consider removing rows where scenario data is not available

https://github.com/RMI-PACTA/pacta.data.preparation/blob/ba0f8b8518afb2d00bfe5d9bff1a935418eaa5dd/R/dataprep_abcd_scen_connection.R#L267-L303

When the scenario data is left_joined with the ABCD data, it's possible/likely that some rows of the ABCD data will not match any rows in the scenario data by = c("scenario_geography", "year", "ald_sector", "technology"), and therefore the columns from the scenario data that are added (scenario_source, scenario, units, direction, fair_share_perc) will be filled with NA for those rows. Are these rows useful at all after this point?

I think we should carefully consider whether these lines with no scenario data are meaningful for any reason, and if not we should filter them out to potentially reduce the size of the data substantially. @jacobvjk @jdhoffa @AlexAxthelm

It's possible we do want at least one row of the ABCD data to be left in place even if no scenario data matches it, in which case we'll need something more sophisticated... though the scenario_geography and equity_market columns will make multiple rows distinct even while the rest of the data is duplicated?

related #7

security: add branch protections to prevent accidental pushes to `main`

This shouldn't be possible for anyone (even admins).

Originally posted by @cjyetman in https://github.com/RMI-PACTA/pacta.data.preparation/issues/309#issuecomment-1997177376

create a list of important characteristics to be auto checked after data.prep runs

It would be nice to start compiling a list of data characteristics that should be checked once data.prep is done, and eventually write some code that will automatically check/report on them after every run.

I'll start with a few, and please add to it if you think of any...

number of unique funds in the fund database
verify that in every fund, the sum of the holdings' values adds up to the total value
verify that each isin in isin_to_fund_table points a single fund, i.e. is unique
verify that all output RDS files are ungrouped (relates to RMI-PACTA/archive.pacta.data.preparation#218)

consider dealing with emissions factors in ABCD data more cautiously

here...
https://github.com/2DegreesInvesting/pacta.data.preparation/blob/04ed2a54fcb20589086937a2d58ed994871ec78f/run_pacta_data_preparation.R#L358-L364

and here...
https://github.com/2DegreesInvesting/pacta.data.preparation/blob/04ed2a54fcb20589086937a2d58ed994871ec78f/run_pacta_data_preparation.R#L413-L419

we set the emissions factors to 0 if the technology is in the user-specified list of zero_emission_factor_techs and the production values are greater than 0. This seems hella questionable and we should consider doing whatever this is trying to achieve in a more sensible way.

Maybe prefer to use the AR provided `country_of_domicile` over the FS provided value if possible

This would facilitate us separating out the ABCD and scenario related functions (target calculation etc) from the financial data functions.

Towards creating:
pacta.financial.data.preparation (name tbd)
pacta.calculate.targets (name tbd)

don't expand `scenario_geography` and `equity_market` until necessary

https://github.com/RMI-PACTA/pacta.data.preparation/blob/ba0f8b8518afb2d00bfe5d9bff1a935418eaa5dd/R/dataprep_abcd_scen_connection.R#L143-L151

Up until merging in the scenario data, the expansion of the data with the scenario_geography and equity_market columns drastically multiplies the number of rows in the data, and the grouped calculations necessitated by these otherwise duplicated rows is a source of the incredibly long run times. Basically for every combination of id, technology, and year we are multiplying the rows by every combination of scenario_geography and equity_market and calculating duplicate data for all of them.

We should carefully consider if this is actually necessary, and if not calculate as much and we can before expanding to the scenario_geography and equity_market values. @jdhoffa @jacobvjk @AlexAxthelm

related #7

refactor and optimize `dataprep_connect_abcd_with_scenario()`

dataprep_connect_abcd_with_scenario() is the elephant in the room. It's hundreds of lines of code, difficult to understand, and torturously long to run. There's got to be a better way.

AB#10867

investigate differences between new AR format and old AR format

Especially with the recent change with AR, we're probably going to eventually need to switch to their new format. We should investigate the differences and see if we can switch to using the new format.

library(tidyverse)
library(pacta.data.preparation)

ar_data_path <- "~/Documents/Data/Asset Resolution/2022-08-15_AR_2021Q4"
ar_advanced_company_indicators_path <- file.path(ar_data_path, "2022-08-24_AR_2021Q4_RMI-Company-Indicators.xlsx")
masterdata_debt_path <- file.path(ar_data_path, "2022-10-05_rmi_masterdata_debt_2021q4.csv")
masterdata_ownership_path <- file.path(ar_data_path, "2022-08-15_rmi_masterdata_ownership_2021q4.csv")

ar_advanced_company_indicators <- import_ar_advanced_company_indicators(ar_advanced_company_indicators_path, fix_names = TRUE)
masterdata_debt <- readr::read_csv(masterdata_debt_path, na = "", show_col_types = FALSE)
masterdata_ownership <- readr::read_csv(masterdata_ownership_path, na = "", show_col_types = FALSE)


# -------------------------------------------------------------------------

ownership_data <- 
  ar_advanced_company_indicators %>% 
  filter(consolidation_method == "Equity Ownership") %>% 
  filter(value_type == "production") %>%
  filter(
    asset_sector == "Aviation" & activity_unit %in% c("pkm", "tkm") |
      asset_sector == "Cement" & activity_unit == "t cement" |
      asset_sector == "Coal" & activity_unit == "t coal" |
      asset_sector == "HDV" & activity_unit == "# vehicles" |
      asset_sector == "LDV" & activity_unit == "# vehicles" |
      asset_sector == "Oil&Gas" & activity_unit == "GJ" |
      asset_sector == "Power" & activity_unit == "MW" |
      asset_sector == "Shipping" & activity_unit == "dwt km" |
      asset_sector == "Steel" & activity_unit == "t steel"
  ) %>% 
  pivot_wider(names_from = "year", values_fill = 0) %>% 
  group_by(asset_sector) %>% 
  summarise(new_2016 = sum(`2016`, na.rm = TRUE), new_n = length(unique(company_id)))

masterdata_ownership %>% 
  filter(company_id %in% ar_advanced_company_indicators$company_id) %>% 
  group_by(sector) %>% 
  summarise(old_2016 = sum(`_2016`, na.rm = TRUE), old_n = length(unique(company_id))) %>% 
  full_join(ownership_data, by = c(sector = "asset_sector")) %>% 
  mutate(
    new_2016 = if_else(is.na(new_2016), as.numeric(0), as.numeric(new_2016)),
    new_n = if_else(is.na(new_n), as.numeric(0), as.numeric(new_n))
  ) %>% 
  mutate(diff = old_2016 - new_2016) %>% 
  mutate(percent_diff = round(diff / old_2016 * 100, digits = 2))

#> # A tibble: 9 × 7
#>   sector   old_2016 old_n new_2016 new_n     diff percent_diff
#>   <chr>       <dbl> <int>    <dbl> <dbl>    <dbl>        <dbl>
#> 1 Aviation  6.01e12  1820  6.01e12  1820 -3.12e-2         0   
#> 2 Cement    5.01e 9  2070  4.67e 9  2035  3.46e+8         6.91
#> 3 Coal      1.09e10  2051  1.09e10  2061 -1.47e-2         0   
#> 4 HDV       0         575  0         575  0             NaN   
#> 5 LDV       1.39e 8   394  1.39e 8   435 -3.73e-1         0   
#> 6 Oil&Gas   7.11e11  4563  7.11e11  4563 -5.32e-2         0   
#> 7 Power     1.30e 7 36846  1.30e 7 36875 -3.26e-1         0   
#> 8 Shipping  2.56e14 11167  2.56e14 11169  2.95e+9         0   
#> 9 Steel     2.73e 9  1105  2.73e 9  1105 -5.76e-3         0


# -------------------------------------------------------------------------

fin_control_data <- 
  ar_advanced_company_indicators %>% 
  filter(consolidation_method == "Financial Control") %>% 
  filter(value_type == "production") %>%
  filter(
    asset_sector == "Aviation" & activity_unit %in% c("pkm", "tkm") |
      asset_sector == "Cement" & activity_unit == "t cement" |
      asset_sector == "Coal" & activity_unit == "t coal" |
      asset_sector == "HDV" & activity_unit == "# vehicles" |
      asset_sector == "LDV" & activity_unit == "# vehicles" |
      asset_sector == "Oil&Gas" & activity_unit == "GJ" |
      asset_sector == "Power" & activity_unit == "MW" |
      asset_sector == "Shipping" & activity_unit == "dwt km" |
      asset_sector == "Steel" & activity_unit == "t steel"
  ) %>% 
  pivot_wider(names_from = "year", values_fill = 0) %>% 
  group_by(asset_sector) %>% 
  summarise(new_2016 = sum(`2016`, na.rm = TRUE), new_n = length(unique(company_id)))

masterdata_debt %>% 
  filter(company_id %in% ar_advanced_company_indicators$company_id) %>% 
  group_by(sector) %>% 
  summarise(old_2016 = sum(`_2016`, na.rm = TRUE), old_n = length(unique(company_id))) %>% 
  full_join(fin_control_data, by = c(sector = "asset_sector")) %>% 
  mutate(
    new_2016 = if_else(is.na(new_2016), as.numeric(0), as.numeric(new_2016)),
    new_n = if_else(is.na(new_n), as.numeric(0), as.numeric(new_n))
  ) %>% 
  mutate(diff = old_2016 - new_2016) %>% 
  mutate(percent_diff = round(diff / old_2016 * 100, digits = 2))

#> # A tibble: 9 × 7
#>   sector   old_2016 old_n new_2016 new_n     diff percent_diff
#>   <chr>       <dbl> <int>    <dbl> <dbl>    <dbl>        <dbl>
#> 1 Aviation  3.24e12  1302  5.94e12  1609 -2.70e12        -83.3
#> 2 Cement    2.89e 9  1459  4.27e 9  1685 -1.39e 9        -48.0
#> 3 Coal      6.31e 9  1349  1.12e10  1629 -4.87e 9        -77.2
#> 4 HDV       0         283  0         426  0              NaN  
#> 5 LDV       9.36e 7   194  1.16e 8   300 -2.24e 7        -23.9
#> 6 Oil&Gas   3.06e11  2864  6.85e11  3864 -3.79e11       -124. 
#> 7 Power     6.12e 6 29582  1.28e 7 34012 -6.68e 6       -109. 
#> 8 Shipping  1.70e14  9844  2.53e14 10518 -8.30e13        -48.9
#> 9 Steel     1.66e 9   757  2.70e 9   924 -1.03e 9        -62.1

Explore multi-threading `{asset}_abcd_scenario.rds` generating function

With RMI-PACTA/archive.pacta.data.preparation#81 closed, we open an opportunity to use multi-threading to speed up the time- and memory- intensive processes.

In particular, we might be able to spread each process to calculate {asset}_abcd_scenario_{scenario_name}.rds across multiple CPUs.

Programatically determine which economic variable to connect between ABCD and Scenario

Yes, exactly.... the primary point is that we can see in the current ABCD data that a given sector may have metric == "company level capacity" AND metric == "company level production", and we should be cautious about that and make sure we choose the appropriate one. In the current data, Shipping technologies are the only ones in that situation, so we can get away with ignoring it for now since we aren't using Shipping data to prepare 2021Q4, but eventually we should deal with this possibility (for all sectors) in a programmatic way.

Originally posted by @cjyetman in https://github.com/2DegreesInvesting/pacta.data.preparation/issues/94#issuecomment-1239104486