diazrenata / scadsanalysis Goto Github PK
View Code? Open in Web Editor NEWResearch compendium comparing species abundance distributions to their feasible sets.
License: MIT License
Research compendium comparing species abundance distributions to their feasible sets.
License: MIT License
From SKM:
I think it would be good in the supplement to provide numbers for how many communities fell under these cut offs (i.e. the number filtered out).
I thought it would be nice to be able to send people directly to the code that is doing this data filtering. I’m uncertain what the best way to do this is. Whether its by adding it to the supplement or via a link to the repo. I think it depends in part on the journal rules.
FS smaller than approx. 50-150 elements:
Consider filtering these out before you proceed with interpreting %ile outcomes
I think there's a way to figure out the size of the feasible set from S and N and the p-table.
Added a function to allow passing seeds to sample_fs_wrapper
. Down the line and in CATS, this should a) help with reproducibility and b) make it so you can delete the actual FS draws from the drake cache and still be able to recreate the exact results. This is important because the actual FS draws make the cache massive and will introduce a whole new level of size issues when moving to TS (another jump in scale).
I ran into issues with 10k samples before (#13), but I think I need to find a way to draw many more samples. I increased from 2500 to 10000 samples for MACDB and had some subtle but important changes in the qualitative outcomes. Unfortunately I moved too fast and restarted with 30k samples before I copied the cache, so I'll update in a bit with specifics on what drove the change. Macro scale, what happens is the increased sampling eventually captures more of the extremes of the FS, so the observed vector - even if it is weird - is no longer outside the range of variation of the samples, and no longer has the percentile value 0 or 100. This reduces the absorbance of the absorbing boundary at 0 or 100 and increases our ability to detect variation in the extent of the deviation.
The high proportion of 0 and 100 values also probably makes it difficult to tell how much S, N, and characteristics of the feasible set affect percentile values, both visually and statistically. So this is positive in the long run.
Rarefaction is already done.
Jackknife: to my understanding at the moment, resample a subset of the individuals in the observed sample and treat this as its own sample. Then run the analytical pipeline. Use numerous resamplings per observed community. The question is whether the subsamples also deviate....
This is from a relatively quick run through resampling methods; be sure you have the rationale right. (does subsampling something we think is error prone really...help?)
Note that subsampling that results in a very small community may cause power to break down, which is expected. This should only happen for a predictable group of borderline-small communities, though.
This may be computationally intensive, because each observed SAD will propagate into n resampled SADs, each of which will need sampling.
Submission details
Data
More to sign off on/close the loop -
Some figure changes:
This gives a better sense of how weird it is than just the percentile (rank).
If different datasets give different results, it would be nice to compare the subsets that have comparable S and N. I expect S and N, at large scale, to constrain the expected/possible outcomes. I'm actually not even sure how much the datasets intersect in terms of S and N, so visualizing that is the first step.
It takes a long time & a lot of compute to run the full analysis on all the datasets.
Create a smaller example that can run in a reasonable amount of time on a laptop.
This requires a smaller p-table, or having the p-table available for download somewhere.
Possibly doing it with just one dataset? MCDB might be the most tractable (smallest p-table, relatively few communities.)
From http://www.esapubs.org/archive/ecol/E093/155/appendix-A.pdf (Appendix A of White et al 2012, quoting here from the pdf):
We used community data collected in 2009 from 2,769 routes of the Breeding Bird Survey (Sauer et al. 2011) (BBS) and 1,999 counts of the Christmas Bird Count (National Audubon Society 2002) (CBC). BBS routes are 40 km long, each consisting of 50 three minute point counts, 800 m apart, sampled annually in June. The 2009 data include a total of 1,819,908 individuals representing 347 species of diurnal landbirds, with individual routes averaging 657.2 ± 323.9 individuals (range = 53 – 3,504) and 46 ± 13 species (range = 10 – 81).
We also used two existing data sets of species abundance for communities of trees, the USFS Forest Inventory Analysis program (U.S. Department of Agriculture 2010, Woudenberg et al. 2010, http://apps.fs.fed.us/fiadb-downloads/datamart.html) (FIA), and the Alwyn H. Gentry Forest Transect Data Set (Phillips and Miller 2002) (referred to herein as ‘Gentry’). We used one year of data (calendar year of sampling varies among plots) for FIA phase 2 plots that were sampled using the standardized methodology implemented in 1999 [see the FIA National Core Field Guide for more information (U.S. Department of Agriculture 2010)]. The standard plot consists of four 24.0-foot (7.32 m) radius subplots, on which trees 5.0 inches (12.7 cm) and greater in diameter are identified to species and measured. We used species abundance data for 10,355 FIA plots, encompassing a total of 380,581 individuals and 236 species, with plots averaging 36.8 ± 12.5 individuals (range = 11 – 118) and 11.4 ± 1.6 species (range = 10 – 21). The Gentry data were collected from 226-0.1 hectare sites throughout the world, with each site sampled once over the course of a 22 year period. At each site, all plants with stem diameters of 2.5 cm or greater were identified and measured along ten 2 × 50 m transects. It should be noted that, due to difficulties in the taxonomy and identification of tropical trees, some species in the Gentry dataset are identified only as morpho-species (unique within sites), and species’ names vary among sites due to both typographical errors and synonymy problems. Since we only analyzed data within a site, these issues do not affect our analyses, but they artificially elevate the count of species in the Gentry dataset and therefore the number of species included in the overall analysis. We used data from 222 sites, including 67,405 individuals representing approximately 7,300 species, with individual sites averaging 303.6 ± 115.6 individuals (range = 44 – 779) and approximately 91.4 ± 59.7 species (range = 10 – 250).
We used species abundance data for the 103 sites included in the Mammal Community Database (Thibault et al. 2011) (MCDB) that included at least 10 species (mean richness = 13.6 ± 4.0 species; range = 10 – 34). These data have been compiled from various published sources and therefore have not been collected using a standardized protocol across sites. As a result, these data are species-level abundances of small mammals that were captured using various levels of sampling effort spread across varying amounts of time and space. Despite these limitations, these data represent, to our knowledge, the largest collection of mammal community data ever analyzed in one study. The data encompass a total of 380 mammal species and 94,866 individuals (mean abundance per site = 921.0 ± 1,434.9; range = 19 – 10,085).
From https://github.com/weecology/sad-comparison/blob/master/chapter1.md Baldridge's dissertation ch. 1
Data on Actinopterygii, Reptilia, Coleoptera, Arachnida, and Amphibia, were mined from literature by Baldridge and are publicly available [@Baldridge2013] (see Table 1 for details). These data were collected at the level of the site defined in the publication if raw data were available at that scale, and at the scale of the entire study otherwise. Time scales of collection for this data depended on the study but was typically one or a few years.
If you have found 100% of the feasible set, can you have a percentile value of 0 or 100?
ugh, percentiles are actually kind of sketchily defined at the edges. You can have 100 (value = max(values)), but I think not 0. There is the potential for odd edge behavior.
In clean-and-tests
I changed count_below (the percentile-finding function) to count the number of values <=, not <, a focal value. This was just to match what we'd put in the manuscript.
This mostly didn't change things qualitatively. EXCEPT. It tends to increase %ile values.
Already being calculated by the pipeline.
by dataset
by dataset x site
should statevars be separate
many small objects vs few large objects
generally inclined towards many small, will this break drake
10k samples for BBS creates dataframes that are too large for R. I have reduced to 2500 to see how that goes.
The same problem will definitely arise for FIA, to a greater degree. There are 2773 BBS sites and >100k FIA sites. I think I will need to break FIA in to at least 10 and possibly more smaller chunks, which raises the question of whether it even makes sense to include all of them.
Many of the datasets in MCDB as imported from Baldridge/White are aggregated over many years of data.
Looking at the MCDB itself, some of this is happening because the data as it is reported in MCDB is already combined over the duration of the study. There are about 70 timeseries in the MCDB, with observations at multiple time points.
With a little bit of fiddling I can run the cats
analysis on the MCDB timeseries data. But the question is starting to wander off into this scaling/aggregation question: does the SAD generated by pooling data from all years give us the same or systematically different results than the SADs generated from each year separately? I think this is an important question, and one to address more systematically than as a one off for this particular dataset.
You can't run everything in parallel at the scale of the site because this breaks SLURM (too many requests sent too close together).
updated 2/20/20 1pm
Originally posted by @diazrenata in #49 (comment)
Open datasets from White et al 2012:
Out of curiousity:
Especially with FIA, seeing considerably fewer than 10000 unique samples. Visualize if there's any correspondence results :: nsamples.
Mammals - mammals p table
FIA, gentry, bbs - wide p table
Misc abund - need to filter to <50k individuals, use long p table
Moonshots include >50k individuals, BCI data. Not sure if it is possible to get a p table for those communities.
😠
S and N shape the size of the FS and its characteristics. Do they have a detectable knock on relationship with %ile? See #20 for discussion of beta and zero-one inflated beta regressions.
Deeper than that, I think things are way too correlated to tease out. See sadspace
.
Temporal scale of MCDB and Misc Abund
Filtering cutoffs (see also #43)
This should spin off into a new repo. Use MATSS to get some timeseries; start with Portal plants and a subset of BBS? Then track %ile thru time per site.
Qs:
See https://github.com/diazrenata/scadsplants/blob/master/analysis/reports/dis_ts.md
Working on, the pipeline/analysis that has wound up in the manuscript.
The clean-and-tests branch will be for the version of the analysis that matches the manuscript draft.
There are some old/stale reports and analyses that I don't at present know what to do with. I will probably delete them from the version in clean-and-tests, which may eventually morph into the default branch.
I may need to write up an explainer on how the drake pipelines work.
Will also need an interlude on the other ROV approaches I tried (supplement)
then, manips
Details:
Data
Data accessibility statement: All data used are available publicly via Zenodo and figshare. Upon publication, all code and data will be archived and made publicly available via Zenodo.
In methods - does this need to more explicitly say that the data were accessed from the repo for the 2016 paper (except Misc. Abund, which I got from figshare):
We used a compilation of community abundance data for trees, birds, mammals, and miscellaneous other taxa that has been used in recent macroecological explorations of the SAD (White et al 2012 , Baldridge 2016, Baldridge 2015).
In refs, citing both the Baldridge paper and "Data from" that paper:
Baldridge, E. (2015). Miscellaneous Abundance Database. figshare. Available at: MiscAbundanceDB_main. https://doi.org/10.6084/m9.figshare.95843.v4
Baldridge, E., Harris, D.J., Xiao, X. & White, E.P. (2016). An extensive comparison of species-abundance distribution models. PeerJ, 4, e2823.
Baldridge, E., Harris, D.J., Xiao, X. & White, E.P. (2016). Data from An extensive comparison of species-abundance distribution models. Zenodo. Available at: https://zenodo.org/record/166725.
White, E.P., Thibault, K.M. & Xiao, X. (2012). Characterizing species abundance distributions across taxa and ecosystems using a simple maximum entropy model. Ecology, 93, 1772–1778.
Possible reasons:
It would help
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.