ropensci / allodb Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 11.0 115.68 MB

An R package for biomass estimation at extratropical forest plots.

Home Page: https://docs.ropensci.org/allodb/

License: GNU General Public License v3.0

R 100.00%

allodb's Issues

Go public

@teixeirak and @gonzalezeb,

allodb is now public (following email thread). Here are the changes I made to adapt to this openess.

Added a few documents into .github/. This are fairly standard documents adapted from the tidyverse. For example, the document ISSUE_TEMPLATE is automatically used to guide users every time they open an new issue on GitHub.
Added some badges to reflect development status. The build status updates automatically every time we push a new commit to master branch.
Tweaked the website a little bit -- more changes will come as I improve the template for all fgeo websites.

For you TODO

In .github/ please review and feel free to edit and remove documents as needed.
In DESCRIPTION please review authors and edit as needed.
In _pkgdown.yml please check if the link to "Learn more" is the one you want (now http://www.forestgeo.si.edu/).
Please review the labels in https://github.com/forestgeo/allodb/labels. Can I change these labels for more standard ones? They would look closer to these ones (both in name and colour):
Close this issue when you are done.

Confirm type of new columns

@gonzalezeb, please confirm the type of new columns:

        New column    should be
-------------------------------
sample_size           integer
site_dbh_unit         character  
equation_form         character
equation_allometry    character

Determine explicitely the type of each column of the master data

To ensure column types are interpreted as expected, write function to output column types to be passed to col_types (see ?readr::read_csv). For example, I used this approach in fgeo.tool::type_vft (https://forestgeo.github.io/fgeo.tool/reference/type_fgeo.html). The type of each column Erika documented as metatada-tables.

The code below outputs a list that can be passed to col_type of readr::read_csv(). Before I can do this I need help from @gonzalezeb. I asked her to clean the column names of the master data (#31).

# Import and clean --------------------------------------------------------

path_to_data <- here::here("data-raw/allotemp_main.csv")
master <- readr::read_csv(path_to_data, col_types = types_allodb_master)



# This prints to screen the contents of types_allodb_master (see below)
library(tidyverse)

types <- map(master, class) %>%
  enframe() %>%
  unnest() %>%
  mutate(
    type = case_when(
      value == "character" ~ "c",
      value == "integer" ~ "i",
      value == "numeric" ~ "d"
    ),
    type = paste0(name, " = '", type, "',")
  ) %>%
  pull(type)

cat(types)


# Determine column type explicitely to avoid surprises (see readr::read_csv)
# c = character,
# i = integer,
# n = number,
# d = double,
# l = logical,
# D = date,
# T = date time,
# t = time,
# ? = guess,
# _/- to skip the column.
types_allodb_master <- list(
  # xxx
)

Once this is done I need to ask @gonzalezeb to confirm the type of each variable.

Fix encoding

Some data has the wrong encoding. devtools::check() throws these warnings:

Following this post, below is my best attempt to fix the problem. But the solution is not good enough: At best, the non-ASCII characters are removed. What I want it to replace them with the correct character.

Suzanne said the encoding is "latin1" (https://goo.gl/KZiVbQ). But the conversion from latin-ascii doesn't work well enough (see below).

Maybe if I receive the data in .csv format? And I can read it with the right encoding? Something like this: read.csv(data, encoding = "latin1").

My suboptimal solution so far

library(tidyverse)
library(stringi)
library(allodb)

WSG %>% 
  mutate(encode = stri_enc_mark(species)) %>% 
  filter(encode != "ASCII") %>% 
  transmute(
    original = species,
    with_stri = stri_trans_general(species, "latin-ascii"),
    with_iconv = iconv(species, "latin1", "ASCII", sub = "")
  )

# # A tibble: 7 x 3
#                         original                        with_stri              with_iconv
#                            <chr>                            <chr>                   <chr>
# 1                     bigll3Ã‚Â¡                       bigll3A,A¡                  bigll3
# 2                     pequeÃƒÂ±a                       pequeAfA±a                  pequea
# 3       sp. Ã¢â‚¬ËœhairyÃ¢â‚¬â„¢       sp. A¢a,¬EoehairyA¢a,¬a,,¢               sp. hairy
# 4                   Ã¢â‚¬Ëœgiant                    A¢a,¬Eoegiant                   giant
# 5   dewevrei (De Wild.) J.LÃâ‚¬     dewevrei (De Wild.) J.LA-a,¬ dewevrei (De Wild.) J.L
# 6 normandii AubrÃâ‚¬Å’Â©v. & Pe normandii AubrA-a,¬A'A(C)v. & Pe   normandii Aubrv. & Pe
# 7 pellegrinianum (J.LÃâ‚¬Å’Â©on pellegrinianum (J.LA-a,¬A'A(C)on   pellegrinianum (J.Lon

Include warning/ error messages when unreliable allometries are applied

Jim Lutz recommends that the R code should generate warning messages when unreliable allometries are applied.

There are several issues that need to be decided:

What are the criteria that should generate a warning / error message? Here's the start of a list:

attempt to apply allometric equation outside of the DBH range for which it was developed
attempt to apply a generic equation to a species for which its a terrible fit (see item 2)
application of an allometry that is of poor quality (e.g., low number of individuals sampled) but still the best available. For this, we need to decide criteria.

Should the code allow bad allometries to be applied? A couple examples:

if the user selects the generic equation option, which is very bad for bristlecone pine, should there be a warning message : “you can use this general equation, but its going to be really bad for bristlecone pine”, or should the expert-selected allometry automatically be applied?
if an allometry is assigned outside the DBH range for which it was developed, should this generate a warning (i.e., message, code still runs) or an error (i.e., code won't run unless fixed)?

How is this implemented in the database and code?

I think that one way to implement this in the database is to add a warning_notes field to both allodb_equations.csv (warnings about the allometric equation itself) and allodb_site_species.csv (warnings about application of an allometric equation to a certain tree). These can be used to manually add warnings that should be displayed when the allometry is applied.

Consider how to estimate error

We don't currently have any way to estimate the error on biomass estimates for individual trees, and I'm not sure its possible in the context of what we're doing.

@ervanSTRI, do you have any advice on this?

Invitation to compile a data base of allometric equations (allodb)

Hi Ervan, Erika and Krista,

(@ervanSTRI, @gonzalezeb, @teixeirak)

I'm following up the idea of building a function to calculate biomass for sites in the CTFS-ForestGEO network. I now need a table with one allometric equation per site and I would like to start with tropical forests.
-- Ervan, I remember you offered such a table; can you share it with me now?
-- Erika and Krista, I hope to meet you in the upcoming weeks.

BACKGROUND

I wrote a prototype function that calculates biomass using either default equations or user's equations -- whichever is most specific. The prototype is explained here: https://forestgeo.github.io/bmss/articles/biomass.html. Now the function works with a dummy data set.

These are my next steps:
(1) For site-level equations, replace the dummy data set by the real data set for tropical forests (Ervan);
(2) Same for temperate forests (Erika and Krista);
(3) For species-level equations, replace the dummy data set by the real data set for tropical forests (Ervan);
(4) For species-level equations, replace the dummy data set by the real data set for tropical forests (Erika and Krista).

Cheers,
Mauro

Allometry fitting methods

From: Helene Muller-Landau [[email protected]]
Sent: Tuesday, February 13, 2018 4:14 PM

Fitted equation.
All other equations that were tested, including ones not chosen.
Whether the fits are done with biomass as the response variable log(biomass) as the response variable. (For standard regression, this is the difference between minimizing sums of squares of deviations in boimass vs. minimizing sums of squares of deviations in log biomass.)
Was the fitting done as a traditional regression (minimizing sums of squares in y) or as a model 2 regression or PCA or something else.
If the response variable is log(biomass), then whether the correction factor is already included in the reported fitted equation or not, and either way, whether the RSE needed to calculate it is reported.
Any other details of the statistical methods - e.g., the criteria for choosing the preferred model.
The dbh range of the data.
The number of data points (trees)
Measurement methods for biomass.

best,
Helene

On Tue, Feb 13, 2018 at 3:07 PM, Teixeira, Kristina A. [email protected] wrote:
Hi Helene,
Erika’s digging into adding allometry fitting methods to her database. What are the key features of fitting methods that you’d recommend she be sure to capture? Thanks, K.

Document source and purpose of files in data-raw

The following files live in data-raw. Could you please document their source and purpose?

See the following files in the directory data-raw/is_this_to_document_or_relocate/:

"Densidad_Promedio_por_Especie_Yasuni.xls"
"Pasoh_Wood_Density.xlsx"
"WoodDens_20130917_Panama.xls"

If you are unsure whether these files should live in data-raw or elsewhere, talk to me or see http://r-pkgs.had.co.nz/data.html.

Units problem to tackle

An issue (more of a reminder to myself) to tackle is to convert biomass units from original equations to the final output unit we want allodb to give (kg, Mg). That conversion factor (convert from inches, mm to cm, etc) should be incorporated in the equation. Of special attention is the DBH used to built the original allometry. We have two options:

leave it as it and the site will have to convert
or better, alert the site that their input dbh should be in cm and we "rewrite" all equations to reflect that

Invite collaborators: A record.

(Just to keep track of when and who has been invited.)

From: Teixeira, Kristina A.
Sent: Wednesday, January 24, 2018 1:40:02 PM (UTC-05:00) Eastern Time (US & Canada)
To: Lepore, Mauro; Rutishauser, Ervan; Muller-Landau, Helene; Davies, Stuart J.; McMahon, Sean; Arellano, Gabriel; Nathan G. Swenson; Wright, Joe; Jim Lutz
Cc: Gonzalez, Erika B.
Subject: ForestGEO biomass allometry data pub

Hi all,

As most/all of you already know, Mauro is preparing a new function to calculate biomass for the CTFS R package, and Erika is working to compile the best available biomass allometries and wood density data for all ForestGEO sites. Specifically, she’s consulting site PIs (particularly for temperate sites) as to the best allometries for their site, compiling species-specific allometries for temperate sites, compiling wood density data for tropical species, etc. We are aiming to formally document this and publish it as a data paper in Ecology, which will provide citable documentation of our methods and present a platform for future updates. We plan to invite all site PIs who contribute allometries or wood density data as coauthors.

At this early stage, we’d like to set up a videoconference for those of you who have already been contributing to this discussion and/or have significant interest and expertise in the subject in order to give an overview of our plans and get your comments. Those of you on this email are people who I think will be most interested in this effort, but please don’t feel obligated to engage, and feel free to add in anyone I’ve missed that you think would like to engage on this level.

We’re hoping to set up a meeting sometime within the next couple weeks. Erika will set up a poll.

We are managing this project in Github, and anyone who would like to have access to the repository should let Mauro know.

all my best,

Krista

Shrubs biomass calculation in temperate forests

This issue is in relation to issue #38, about the use of dba (diameter at stem base) in shrub allometries.
I have summarized what I think is the correct way to present these calculations. I will need Mauro’s help to convert this into a function that depends only on DBH.

Problem (this paragraph can be included in the description of the function):

Most available equations to calculate shrub biomass in temperate regions (Smith et al 1983, Lutz et al, 2014, Halpern and Millet, 1996) use diameter at the base of the stem (15 cm above ground or close to ground) as independent variable. Given that at CTFS-ForestGEO plots the standardized diameter measurement for woody stems is at breast height (DBH, 1.37 m), we WROTE a function to use stem DBH as input to calculate biomass for some shrub species (see site.species table). We suggest the following:

Potential solution (where I need help)..

Step 1=Calculate the basal area contribution of each stem within a tree
BA <- (pi/4) * dataset$dbh^2
Step 2= Sum of basal area of each stem to get basal area of per tree
tree.sum.BA.un <- tapply(BA, dataset$tag, sum, na.rm = T)
Step 3= Calculate the contribution of each stem to sum of basal area of a tree
BA.contribution <- BA / tree.sum.BA

a) If an allometry equation use DBA (diameter at base) as independent variable then:
Step 4= calculate the diameter at the base of shrub, assuming area preserving
Step 5= calculate AGB using the basal diameter equation (a*(DBA^b) or exp(a+b*ln(DBA))
Step 6= Redistribute the biomass of the main stem to other stems, using the basal area contribution

b) If DBH is the independent variable in the equation (assuming that only the diameter of the main stem was measured in a shrub), then:
Step 4= identify the main stem (stem with the largest DBH).
Step 5= calculate AGB using DBH of main stem
Step 6= redistribute the biomass to other stems within the tree, using the basal contribution of each stem.

Build a prtototype shiny app to query the database

Erika pushed her dataset Allometries_Temperate sites.xlsx.

Consider building a template for other databases the researchers may want to offer web-browsing.

Store different representations of missing values in a new column?

The master data contains different representations of missing values, which are described here:

Now there is a problem. If we specify all possible representations of missing values (e.g. via the argument na to read_csv()), then we lose information of what kind of missing value each one it is.

A simple solution, I think, is to represent the kind of missing value as a new column. The original representations will all be coerced to NA but we could identify what kind of NA it is using the new column.

Here are the columns that have some representation of missing values, and the corresponding kind:

$`wsg`
[1] "NRA"

$wsg_id
[1] NA

$wsg_specificity
[1] NA

$c
[1] NA

$d
[1] NA

$dbh_min_cm
[1] "NI"

$dbh_max_cm
[1] NA   "NI"

$sample_size
[1] NA    "NRA"

$equation_id
[1] NA

$regression_model
[1] NA

$other_equations_tested
[1] NA    "NRA"

$log_biomass
[1] NA

$bias_corrected
[1] NA

$bias_correction_factor
[1] NA    "NRA"

$notes_fitting_model
[1] NA

$development_species
[1] NA

$ref_id
[1] NA

$wsg_source
[1] NA

$ref_wsg_id
[1] NA

$original_data_availability
[1] NA

$notes_to_consider
[1] NA

$warning
[1] NA

how to generate new equation_id's

How do I generate a new equation_id at the same time that I am:

adding a new site, therefore likely new equations
adding a new equation for an existing site

This is a direct question for @maurolepore

Handle height allometries

Some sites have local height allometries, which can be used to improve estimates.

We need to plan the structure for incorporating these in the database.

This is something that @gonzalezeb, @maurolepore, and @teixeirak should discuss in person.

Tropical: Table site and equaiton – based on E and wood-density

Making a table of site and equaiton – based on E and wood-density – is possible. Currently, such equaiton is in Ervan’s code. Example from comp.AGB():

+info

2018-02-27 meeting at SCBI: A report

On 2018-02-27 Erika and Krista and I met at SCBI. These are the topics we covered:

Workflow (Erika and Mauro)
- We reviwed Erika's workflow
- We ensured that RStudio and GitHub are connected and that Erika's login-credentials don't need to be entered every time she pulls or pushes.
Structure of the allodb project (Erika and Mauro)
- We walked over the details of the structure of the project that should be changed to restore the structure of an R package, so we enjoy the benefits that R packages provide -- such as accesing metadata via help files.
- We agred that I would make some changes over the following 2 days. Erika will be able to review those changes before accepting them.
Pendent issues (Erika, Krista and Mauro)
We discussed and updated open issues.
FastField (Erika and Mauro)
We discussed work I've done with Jess Shue in preparation for using the FastField software to collect census data.
Overview of fgeo (All of Krista's lab)
I introduced Krista's lab to the fgeo package -- the single place where to look for all things related to ForestGEO's software.

Try Allometries_Temperate sites.xlsx with the code I have

I’ll study the table and try it with the code I have.

Write function to output column types to be passed to `col_types`

Assess wood density data situation, make plan

touch base with Nate S about the origins of CTFS wood density data file
@ervanSTRI , could you please tell us about the wood density values you're using?
touch base with Amy Zanne/ Jerome Chave to learn status of wood density database (we hear that Jerome has a postdoc working on this and that Amy is also involved)
write code to extract relevant data from wood density database
(consider this vignette of the BIOMASS package and code by Ervan).
try to track down other sources that contributed to database (otherwise they probably have to be dropped). A comparison of what's in the Zanne/ Chave global database versus the CTFS wood density file may help to identify site-specific data that have made it into our database.
add any available CTFS - specific data (with documentation)

Chat with Daniel Falster -- author of The Biomass And Allometry Database (BAAD)

https://twitter.com/mauro_lepore/status/968817868262526978

I'll be chatting with Daniel Falster around later in March (2018). Daniel is the author of the database baad, published in Ecology.

I plan to ask Daniel general questions about his experience building baad -- mostly from the software side of things. @teixeirak and @gonzalezeb, if you want me to ask him some questions from the scientific side of things, please list them in this issue.

task for Erika

Include this in references:

Milles, P.D., Smith. W.B (2009) Specific Gravity and Other Properties of Wood and Bark for 156 Tree Species Found in North America. Research Note NC-38. Newton Square, PA: U.S. Dept. of Agriculture, Forest Service, North Research Station.

Install packages for development

@gonzalezeb

I think you should install tools for package development. Even if you don't use them directly, you may want to run code that I wrote -- which often uses such tools.

Update R and all of your packages. And expect to keep doing so frequently

To do this easily see this article.

Install Rtools

Windows: Install Rtools. This is not an R package! It is “a collection of resources for building packages for R under Microsoft Windows, or for building R itself”. Go to https://cran.r-project.org/bin/windows/Rtools/ and install as instructed.

-- From http://usethis.r-lib.org/articles/articles/usethis-setup.html

Install packages you will likely come accross in code that I wrote

devtools
remotes
roxygen2
here
usethis
testthat
tidyverse
pkgdown

I use this packages a lot and you may come accross scripts that you need to run that require this packages. If you install them now, your workflow won't be interrupted later.

Change type of variables that are likely not characters but double

Now the type of the column wsg is defined as character. This is likely a bug and and it should be of type double. @gonzalezeb can you confirm?

2017-11-07-meeting: Notes before and after the meeting

https://bookdown.org/forestgeoguest/2017-11-07-meeting/

Table author's affiliations.

(Pulled from #12 (comment))

As we'll have a ton of coauthors on this paper, it will help to create a spreadsheet with name, email, affiliation. I have one from our GCB review that can be modified.

Update reference datasets.

A reference for each dataset is stored and automatically tested for changes whenever we run all tests. I've just updated the reference datasets. A quick look to the difference between the old and new references suggests the changes are intentional.

This feature can be very useful in helping detect unintentional changes. @gonzalezeb let me know if you at some point you want to learn how to use it yourself. You'll need to install devtools and testthat.

Test match between names of allodb_master and type_allodb_master()

This was a silent issue. I should write a test to expose it should it happen again (which is most likely).

Deal with disconnects between equations that switch when trees cross a size threshhold

The database will allow application of different equations to trees of different size (e.g., switch from a highly specific allometry for small sizes to a more generic one at sizes above that sampled for the specific one). This is desirable. However, when a tree crosses a size threshold, estimates of woody growth and ANPP will be seriously flawed. To address this, we need to incorporate code that forces the application of only one allometric equation-- even outside its range--to trees that grow over such a threshhold.

Document all data in data/

Create R/data.R and create the roxygen skeleton to document all datasets in data/. Then ask Erika to fill the gaps in the documentation, reveal the missing datasets, and the missmatches between the data and the data_metadata -- which BTW may need to be specified programatically so it doesn't come out of sync (via some function that would compare the names of the columns in each data set with the values of "Field" in its corresponding metadata-dataset).

Populate equation_id

Follows https://github.com/forestgeo/allodb/issues/36#issuecomment-423217920.

Here is a good way to generate random ids in R: ids::random_id().

[Disregard]

Disregard this issue. Put here in error.

Deal with missing species-codes

Relates to https://github.com/forestgeo/allodb/issues/36.

For few sites, which species list I got from ForestGEO website I don't have a code.

Here are some ways I think we may deal with this issue:

Throw a warning indicating the species that match the user's data for which allodb has no species-code. The match may be by species name -- if the user provides spcies data -- or by site or region, in which case we can list all species in that site or region with unknown species-code.
Provide some way of filling the gap, e.g. ask the user to provide a table with species-names and species-codes.

Could the opposite problem happen? That is, can the user user provide codes that don't match any code in allodb? We could deal with this in a way similar to that described above.

In `equations`, fix row with missing values

Ensure all equations contain valid R code

I plan to write a test to automatically check that each equation can be evaluated -- i.e. that it contains valid R code. Once that is done, we'll be able to press Control + Shift + T to run this (and all other) test, and we'll get an informative error message if some equation isn't valid R code.

Store reference data for regression tests.

@gonzalezeb,

FYI I have stored a copy of each dataset as a reference to test for unexpected changes. You can see the datasets HERE; they are named with the format ref-datasetname.

Although poorly, this feature is documented in the help file of testthat::expect_known_output(). I suggest you don't worry about this for now -- it is better to talk about this in person and I can show you how it works.

I leave this here as a reminder for future discussion, but I close the issue because it requires no more action.

Export and document data

This issue describes the process by which I export and document data. You can track this process searching for commits tagged with the number of this issue (#29).

Collect data relevant to assessing accuracy of allometries

To help assess accuracy of allometries, this table should contain the following information:

min DBH sampled
max DBH sampled
n trees sampled
fitting method (regression type)
bias correction, if applicable
method used to collect data (traditional cutting, drying, weighing; LiDAR)
note whether original source data are available (compiling these data is beyond our current scope, but it would be helpful to know where they're available)

It has all but 1-2. @gonzalezeb, please add any that aren't in there yet.

Extract all unique pieces in the allometric equations, so I can ask Erika what each one means.

2.1. allodb: extract all unique pieces in the allometric equations, so I can ask Erika what each one means. For splitting symbols, use * + ( ) [ ] ^ - CAPITAL-LETTERS exp ln

standardize categories in variable_biomass_component

Define a process to screen allometries

We'll want to set bonds on what should be considered a reasonable biomass estimate and flag any allometries that fall outside of this range.

The first step is to define how those bounds should be determined.

Write tutorial to use the BIOMASS package

Suggested by Ervan (via @).

A simple step-by-step tutorial guiding them through, after having installed a package with the functions + objects (i.e. wood density and site info) is probably the best way to go.

--Ervan

For this, maybe use the BIOMASS package which is already available.

Incorporate the "general" ecuations that are available (check if these equations are already in Erika's tables)

Ervan suggested that the most specific equations (i.e. closer geographically or of higher taxonomic resolution) not always may be the best equaiton. Instead, best might be one generic ecuation. [Here generic means not “of taxonomic genus level”; it means general – an equation that has been produced based on measuring many trees]. He said that there are generic equations for three reigons: North America, Europe and China. For each region there may be multiple equaitons: one per each taxonomical group.

Normalize database

One important aspect of making the database easy to work with and maintainable in the long run is to normalize it. Eventually, we will need to move in that direction. Our current not-normalized structure already seems to be exposing some issues. For us to assess how urgent it is to normalize the database, I'll document the issues I'm noticing here.

Remove likely useless files

As I restructure the allodb project I would like to delete useless files. This issue documents the files I would like to remove but I need to first confirm with @gonzalezeb.

Fix type_allodb_master() to reflect change in column names

@gonzalezeb, please fix type_allodb_master() (file R/type_allodb_master.R) to reflect the recent change in column names of master.

From something like this:

type_allodb_master <- function() {
  list(
    ...
    <old name> = <old type>,
    ...
  )
}

To something like this:

type_allodb_master <- function() {
  list(
    ...
    <new name> = <new type>,
    ...
  )
}

Reference for what names have changed

library(allodb)

> master <- read_csv_as_chr(here::here("data-raw/allodb_master.csv"))
> 
> setdiff(names(master), names(allodb:::type_allodb_master()))
[1] "biogeographic_zone" "region"             "proxy_species"     
[4] "notes_on_species"  
> setdiff(names(allodb:::type_allodb_master()), names(master))
[1] "development_species" "notes_to_consider"

Tropical: Add to biomass() a parameter to input wood density

Ervan suggests that the biomass() function needs a parameter to input wood density because new wood density data becomes available frequently and users may want to incorporate it. That is why he prefers to compute biomass not with a fixed equation but as a function of wood density.

reinstate DBH range in site-species table

@gonzalezeb -

We previously had DBH range in the site-species table, and this commit removed it. I think we need it. The purpose would be to allow assignment of different allometric equations to a single species based on size. For example, we may trust a local/species-specific equation for only part of the possible size range.

I'm envisioning that the equation is selected based on the dbh range in this sheet. The min and max dbh in the equations sheet would only be for reference and to give warnings if there's an attempt to apply an allometry outside the range for which it was developed.

If this doesn't make sense or if you disagree, let's discuss in person.

Clean names of master data

@gonzalezeb,

The master allotemp_main.csv data has names that are difficult to work with. Can you change its names to not have "(" or ")" or spaces? Ideally all names should start with a letter and include only letters, numbers and "_", not "." nor spaces.

If this is not possible let's discuss.