diazrenata / scadsanalysis Goto Github PK

Research compendium comparing species abundance distributions to their feasible sets.

License: MIT License

R 1.35% HTML 98.56% Shell 0.09%

scadsanalysis's Issues

Filtering vignettte

From SKM:

I think it would be good in the supplement to provide numbers for how many communities fell under these cut offs (i.e. the number filtered out).

I thought it would be nice to be able to send people directly to the code that is doing this data filtering. I’m uncertain what the best way to do this is. Whether its by adding it to the supplement or via a link to the repo. I think it depends in part on the journal rules.

FS too small for percentile to be very applicable

FS smaller than approx. 50-150 elements:

The percentile value is less meaningful, especially for extremely small FS (being in the 99th percentile of 5 numbers is not as informative as being in the 99th percentile of 20,000 numbers)
The percentile value is sensitive to decisions about < vs <= (see #39)

Consider filtering these out before you proceed with interpreting %ile outcomes

Cleanup steps

Tests for ROV metrics

Size of fs

I think there's a way to figure out the size of the feasible set from S and N and the p-table.

Setting seeds

Added a function to allow passing seeds to sample_fs_wrapper. Down the line and in CATS, this should a) help with reproducibility and b) make it so you can delete the actual FS draws from the drake cache and still be able to recreate the exact results. This is important because the actual FS draws make the cache massive and will introduce a whole new level of size issues when moving to TS (another jump in scale).

Increase nb samples

I ran into issues with 10k samples before (#13), but I think I need to find a way to draw many more samples. I increased from 2500 to 10000 samples for MACDB and had some subtle but important changes in the qualitative outcomes. Unfortunately I moved too fast and restarted with 30k samples before I copied the cache, so I'll update in a bit with specifics on what drove the change. Macro scale, what happens is the increased sampling eventually captures more of the extremes of the FS, so the observed vector - even if it is weird - is no longer outside the range of variation of the samples, and no longer has the percentile value 0 or 100. This reduces the absorbance of the absorbing boundary at 0 or 100 and increases our ability to detect variation in the extent of the deviation.

The high proportion of 0 and 100 values also probably makes it difficult to tell how much S, N, and characteristics of the feasible set affect percentile values, both visually and statistically. So this is positive in the long run.

Resampling

Rarefaction is already done.

Jackknife: to my understanding at the moment, resample a subset of the individuals in the observed sample and treat this as its own sample. Then run the analytical pipeline. Use numerous resamplings per observed community. The question is whether the subsamples also deviate....

This is from a relatively quick run through resampling methods; be sure you have the rationale right. (does subsampling something we think is error prone really...help?)

Note that subsampling that results in a very small community may cause power to break down, which is expected. This should only happen for a predictable group of borderline-small communities, though.

This may be computationally intensive, because each observed SAD will propagate into n resampled SADs, each of which will need sampling.

Refs for other formulations

uestions for HY/SKME

General

Submission details

Formatting: font? page numbers?
Corresponding author marked correctly?
Address for correspondence?
COI statement on front page?
Conflicted/suggested reviewers? I have a blank here. Both in terms of, do we provide this list or is it obvious to the editor, and if it's important that we provide it, who's on each list? Would assume Ethan, Locey, Kitzes, Newman can't review, but all are in acknowledgements/heavily cited in the text.

Data

I want to be sure we attribute things properly in the data accessibility statement, text, and refs. The data for the analysis are (mostly) the same as Baldridge (2016), the exception being that I accessed Misc Abund from figshare. That version may match the data used in Baldridge (2016); I re-accessed it because it's a more heterogeneous dataset than the others and I wanted to be able to see clearly what I was getting (instead of the processed versions). For the others, I downloaded the .csvs from weecology/sadcomparison. The data on GitHub are identical to what is on Zenodo. Details at #52
Is it OK that I'm getting the data from GitHub and figshare, rather than Zenodo? (GH data exactly matches Zenodo, for what it's worth)
Is it OK if we then re-archive on Zenodo?

Hao

Have I got the right affiliation? What's your preferred email?
Is there anything you'd like to put in the acknowledgements?
Check description of sampling algorithm now in Methods
Should algorithm vignette be a supplement?

More to sign off on/close the loop -

Minor language changes in comments (flagged)
Updates to figures & figure legends

Morgan

Signing off on language changes in comments (flagged)

Some figure changes:

Removed fig. 1 (distribution of datasets in sXn space) and put it in the supplement
Changes to fig. 4 to add breadth index - too dense now?

Try kde rather than percentile

Construct KDE of summary stat from fs
Scale to sum to 1
density at obs value

This gives a better sense of how weird it is than just the percentile (rank).

Intersections of datasets

If different datasets give different results, it would be nice to compare the subsets that have comparable S and N. I expect S and N, at large scale, to constrain the expected/possible outcomes. I'm actually not even sure how much the datasets intersect in terms of S and N, so visualizing that is the first step.

Reproducible example with subset

It takes a long time & a lot of compute to run the full analysis on all the datasets.

Create a smaller example that can run in a reasonable amount of time on a laptop.

This requires a smaller p-table, or having the p-table available for download somewhere.

Possibly doing it with just one dataset? MCDB might be the most tractable (smallest p-table, relatively few communities.)

Dataset details

From http://www.esapubs.org/archive/ecol/E093/155/appendix-A.pdf (Appendix A of White et al 2012, quoting here from the pdf):

Birds (BBS)

We used community data collected in 2009 from 2,769 routes of the Breeding Bird Survey (Sauer et al. 2011) (BBS) and 1,999 counts of the Christmas Bird Count (National Audubon Society 2002) (CBC). BBS routes are 40 km long, each consisting of 50 three minute point counts, 800 m apart, sampled annually in June. The 2009 data include a total of 1,819,908 individuals representing 347 species of diurnal landbirds, with individual routes averaging 657.2 ± 323.9 individuals (range = 53 – 3,504) and 46 ± 13 species (range = 10 – 81).

Trees (FIA and Gentry)

We also used two existing data sets of species abundance for communities of trees, the USFS Forest Inventory Analysis program (U.S. Department of Agriculture 2010, Woudenberg et al. 2010, http://apps.fs.fed.us/fiadb-downloads/datamart.html) (FIA), and the Alwyn H. Gentry Forest Transect Data Set (Phillips and Miller 2002) (referred to herein as ‘Gentry’). We used one year of data (calendar year of sampling varies among plots) for FIA phase 2 plots that were sampled using the standardized methodology implemented in 1999 [see the FIA National Core Field Guide for more information (U.S. Department of Agriculture 2010)]. The standard plot consists of four 24.0-foot (7.32 m) radius subplots, on which trees 5.0 inches (12.7 cm) and greater in diameter are identified to species and measured. We used species abundance data for 10,355 FIA plots, encompassing a total of 380,581 individuals and 236 species, with plots averaging 36.8 ± 12.5 individuals (range = 11 – 118) and 11.4 ± 1.6 species (range = 10 – 21). The Gentry data were collected from 226-0.1 hectare sites throughout the world, with each site sampled once over the course of a 22 year period. At each site, all plants with stem diameters of 2.5 cm or greater were identified and measured along ten 2 × 50 m transects. It should be noted that, due to difficulties in the taxonomy and identification of tropical trees, some species in the Gentry dataset are identified only as morpho-species (unique within sites), and species’ names vary among sites due to both typographical errors and synonymy problems. Since we only analyzed data within a site, these issues do not affect our analyses, but they artificially elevate the count of species in the Gentry dataset and therefore the number of species included in the overall analysis. We used data from 222 sites, including 67,405 individuals representing approximately 7,300 species, with individual sites averaging 303.6 ± 115.6 individuals (range = 44 – 779) and approximately 91.4 ± 59.7 species (range = 10 – 250).

MCDB

We used species abundance data for the 103 sites included in the Mammal Community Database (Thibault et al. 2011) (MCDB) that included at least 10 species (mean richness = 13.6 ± 4.0 species; range = 10 – 34). These data have been compiled from various published sources and therefore have not been collected using a standardized protocol across sites. As a result, these data are species-level abundances of small mammals that were captured using various levels of sampling effort spread across varying amounts of time and space. Despite these limitations, these data represent, to our knowledge, the largest collection of mammal community data ever analyzed in one study. The data encompass a total of 380 mammal species and 94,866 individuals (mean abundance per site = 921.0 ± 1,434.9; range = 19 – 10,085).

Misc Abund

From https://github.com/weecology/sad-comparison/blob/master/chapter1.md Baldridge's dissertation ch. 1

Data on Actinopterygii, Reptilia, Coleoptera, Arachnida, and Amphibia, were mined from literature by Baldridge and are publicly available [@Baldridge2013] (see Table 1 for details). These data were collected at the level of the site defined in the publication if raw data were available at that scale, and at the scale of the entire study otherwise. Time scales of collection for this data depended on the study but was typically one or a few years.

Off by one errors in percentile?

If you have found 100% of the feasible set, can you have a percentile value of 0 or 100?

ugh, percentiles are actually kind of sketchily defined at the edges. You can have 100 (value = max(values)), but I think not 0. There is the potential for odd edge behavior.

Edits checklist

Main text

Refs

Citing data as from Baldridge (2016)?
Format refs section
Reference for Misc Abund database (formatting)

Supplements

Submission periphera

Cover letter refs - see McGill 2007 for examples
Conflicted/suggested reviewers
Data availability statement (see above about citing Baldridge (2016) data, coming from GitHub but also downloadable from Zenodo... https://github.com/weecology/sad-comparison; https://zenodo.org/record/166725.)

rerun bbs

February revisions

Analysis

Writing

Ratio of N to S effects on percentile
(Speculation/conversation on:) how distinguishability assumptions might affect results
Expand discussion of complex-systems-in-ecology, deeper roots of SAD work
Expand ecological intuition in both intro & discussion of results - may be bolstered by number of rare species analysis

Details

Change first sentence
Define hollow curve x and y axes
Update S and N ranges
Link to codebase! 🙃
Add details of sampling algorithm to main text

Edge behavior of percentile (esp for small FS)

In clean-and-tests I changed count_below (the percentile-finding function) to count the number of values <=, not <, a focal value. This was just to match what we'd put in the manuscript.

This mostly didn't change things qualitatively. EXCEPT. It tends to increase %ile values.

Shannon diversity

Already being calculated by the pipeline.

Direct comparison of 95interval to ROV metrics

https://github.com/diazrenata/scadsanalysis/blob/clean-and-tests/analysis/reports/self_similarity.md
https://github.com/diazrenata/scadsanalysis/blob/clean-and-tests/analysis/reports/rov_metric.md

How to import SADs

by dataset
by dataset x site
should statevars be separate
many small objects vs few large objects
generally inclined towards many small, will this break drake

Memory management

10k samples for BBS creates dataframes that are too large for R. I have reduced to 2500 to see how that goes.

The same problem will definitely arise for FIA, to a greater degree. There are 2773 BBS sites and >100k FIA sites. I think I will need to break FIA in to at least 10 and possibly more smaller chunks, which raises the question of whether it even makes sense to include all of them.

Refs for skew and even

Filter MCDB to single year

Many of the datasets in MCDB as imported from Baldridge/White are aggregated over many years of data.

Looking at the MCDB itself, some of this is happening because the data as it is reported in MCDB is already combined over the duration of the study. There are about 70 timeseries in the MCDB, with observations at multiple time points.

With a little bit of fiddling I can run the cats analysis on the MCDB timeseries data. But the question is starting to wander off into this scaling/aggregation question: does the SAD generated by pooling data from all years give us the same or systematically different results than the SADs generated from each year separately? I think this is an important question, and one to address more systematically than as a one off for this particular dataset.

Parallelization

You can't run everything in parallel at the scale of the site because this breaks SLURM (too many requests sent too close together).

Bunch by dataset?
Is there a way to tell drake/batchtools to break things into chunks?

Compare datasets within region of overlap

Workflow

Scheme: https://docs.google.com/presentation/d/1E6CMdZBxB3O0ku5k3m14VBB7qhd0ruXvO9EyPsLwUFQ/edit#slide=id.g707db0eea6_9_6

Tracking dataset run status

BBS: thru cdf 2/20/20
FIA: thru cdf 2/20/20
Gentry: tthru cdf 2/20/20
MACDB: thru cdf 2/20/20
MCDB: thru cdf 2/20/20
Misc: thru cdf 2/20/20
Portal manips: thru cdf 2/20/20
Portal plants: thru cdf 2/20/20

updated 2/20/20 1pm

Prior to submitting

Some decisions

Not doing badges
Will do transparent review
Not recommending reviewers

After submitting notes

Keep preprint updated with new versions
When ready to archive on Zenodo, include language that this data is provided for reproducibility, but that for other uses it is best to access from original sources.

Originally posted by @diazrenata in #49 (comment)

Does singleton change // nb species added?

Datasets

Biological datasets

Open datasets from White et al 2012:

Gentry
FIA
MCDB
From Baldridge:
Misc. Abundance Database

Out of curiousity:

Condit (BCI and other tropical forest plots)
NEON? (In progress)

Non-biological datasets

Linux distros?
Data from non-bio paper

Nsamples vs results

Especially with FIA, seeing considerably fewer than 10000 unique samples. Visualize if there's any correspondence results :: nsamples.

Sampling plan

Mammals - mammals p table
FIA, gentry, bbs - wide p table
Misc abund - need to filter to <50k individuals, use long p table

Moonshots include >50k individuals, BCI data. Not sure if it is possible to get a p table for those communities.

Codecov

😠

prop below/above 10/90

Questions

Are empirical species abundance distributions unusual (in skewness or Simpson evenness) compared to their feasible set?
Is our conclusion sensitive to the addition of cryptic rare species?

Analysis updates

Use <= for skewness and < for evenness: this is conservative about designating ties as extreme
At least filter out FS for which there are fewer than 20 unique values for the summary statistics. If there are fewer than 20 unique values it is impossible to be in the 95th/5th percentile.

Do S and N predict percentile?

S and N shape the size of the FS and its characteristics. Do they have a detectable knock on relationship with %ile? See #20 for discussion of beta and zero-one inflated beta regressions.

Deeper than that, I think things are way too correlated to tease out. See sadspace.

Fill in X Y and Z, N, etc

Temporal scale of MCDB and Misc Abund
Filtering cutoffs (see also #43)

Is the arm the Gentry u?

Add MACDB

Tests needed

TS

This should spin off into a new repo. Use MATSS to get some timeseries; start with Portal plants and a subset of BBS? Then track %ile thru time per site.

Qs:

Do sites sit on a particular %ile or scamper around? Do sites tend to be Not Weird with occasional forays into Weird?

See https://github.com/diazrenata/scadsplants/blob/master/analysis/reports/dis_ts.md

use median

Analysis for manuscript

Working on, the pipeline/analysis that has wound up in the manuscript.

Get datasets: download from weecology repo/figshare
Processing datasets: filter out small/large communities, n = s, etc. Max N = 40720; this ceiling applies in Misc Abund. For FIA small, 10,000 samples drawn from the 66,000 sites with S >= 3, <=9.
Processing datasets: subsample the small FIA communities
Singletons (now in supplement)
Sample FS for each community
Calculate skew and evenness for every FS draw and observed
Compare observed values to distributions from FS
Calculate 95% ratio for FS
Combine results from each dataset into overall data frame (all_di.csv)
Render overall figures

The clean-and-tests branch will be for the version of the analysis that matches the manuscript draft.

There are some old/stale reports and analyses that I don't at present know what to do with. I will probably delete them from the version in clean-and-tests, which may eventually morph into the default branch.

I may need to write up an explainer on how the drake pipelines work.

Will also need an interlude on the other ROV approaches I tried (supplement)

Possible correlates of percentile

S, N, S/N
some measure of spread in the FS distribution. Not just range, as this will be sensitive to whether you happen on the very flat/very skewed ones. Sd or something.
mean/median of the FS distribution? If the central tendency is super skewed/super flat.
dataset

then, manips

Data citation and archiving

Details:

Data

I'm not 100% on how to cite the data or how to phrase the data accessibility statement. The data for the analysis are (mostly) the same as Baldridge (2016), the exception being that I accessed Misc Abund from figshare. I downloaded the .csvs from weecology/sadcomparison. The data on GitHub are identical to what is on Zenodo. This crops up in the data accessibility statement and when we discuss the source of the data. Currently:

Data accessibility statement: All data used are available publicly via Zenodo and figshare. Upon publication, all code and data will be archived and made publicly available via Zenodo.

In methods - does this need to more explicitly say that the data were accessed from the repo for the 2016 paper (except Misc. Abund, which I got from figshare):

We used a compilation of community abundance data for trees, birds, mammals, and miscellaneous other taxa that has been used in recent macroecological explorations of the SAD (White et al 2012 , Baldridge 2016, Baldridge 2015).

In refs, citing both the Baldridge paper and "Data from" that paper:

Baldridge, E. (2015). Miscellaneous Abundance Database. figshare. Available at: MiscAbundanceDB_main. https://doi.org/10.6084/m9.figshare.95843.v4
Baldridge, E., Harris, D.J., Xiao, X. & White, E.P. (2016). An extensive comparison of species-abundance distribution models. PeerJ, 4, e2823.
Baldridge, E., Harris, D.J., Xiao, X. & White, E.P. (2016). Data from An extensive comparison of species-abundance distribution models. Zenodo. Available at: https://zenodo.org/record/166725.
White, E.P., Thibault, K.M. & Xiao, X. (2012). Characterizing species abundance distributions across taxa and ecosystems using a simple maximum entropy model. Ecology, 93, 1772–1778.

Is it OK that I'm getting the data from GitHub and figshare? (GH data exactly matches Zenodo, for what it's worth; it seems excessive and not fully transparent to retrofit the code so it downloads from Zenodo instead of GH at this stage...?)

Tests

Workers time out after 1 hour

Possible reasons:

HPG resource limits?
Clustermq giving up after a couple of milliseconds?
Units of time requested?
Syntax for changing upper limits?

It would help

if the toplevel job errored instead of just lurking in limbo