Giter VIP home page Giter VIP logo

scadsanalysis's Issues

Filtering vignettte

From SKM:

I think it would be good in the supplement to provide numbers for how many communities fell under these cut offs (i.e. the number filtered out).

I thought it would be nice to be able to send people directly to the code that is doing this data filtering. I’m uncertain what the best way to do this is. Whether its by adding it to the supplement or via a link to the repo. I think it depends in part on the journal rules.

FS too small for percentile to be very applicable

FS smaller than approx. 50-150 elements:

  • The percentile value is less meaningful, especially for extremely small FS (being in the 99th percentile of 5 numbers is not as informative as being in the 99th percentile of 20,000 numbers)
  • The percentile value is sensitive to decisions about < vs <= (see #39)

Consider filtering these out before you proceed with interpreting %ile outcomes

Cleanup steps

  • fix tests
  • incorporate new functions
  • incorporate manips
  • fix cache locking
  • use resources more efficiently
  • tidy reports

Size of fs

I think there's a way to figure out the size of the feasible set from S and N and the p-table.

Setting seeds

Added a function to allow passing seeds to sample_fs_wrapper. Down the line and in CATS, this should a) help with reproducibility and b) make it so you can delete the actual FS draws from the drake cache and still be able to recreate the exact results. This is important because the actual FS draws make the cache massive and will introduce a whole new level of size issues when moving to TS (another jump in scale).

Increase nb samples

I ran into issues with 10k samples before (#13), but I think I need to find a way to draw many more samples. I increased from 2500 to 10000 samples for MACDB and had some subtle but important changes in the qualitative outcomes. Unfortunately I moved too fast and restarted with 30k samples before I copied the cache, so I'll update in a bit with specifics on what drove the change. Macro scale, what happens is the increased sampling eventually captures more of the extremes of the FS, so the observed vector - even if it is weird - is no longer outside the range of variation of the samples, and no longer has the percentile value 0 or 100. This reduces the absorbance of the absorbing boundary at 0 or 100 and increases our ability to detect variation in the extent of the deviation.

The high proportion of 0 and 100 values also probably makes it difficult to tell how much S, N, and characteristics of the feasible set affect percentile values, both visually and statistically. So this is positive in the long run.

Resampling

Rarefaction is already done.

Jackknife: to my understanding at the moment, resample a subset of the individuals in the observed sample and treat this as its own sample. Then run the analytical pipeline. Use numerous resamplings per observed community. The question is whether the subsamples also deviate....

This is from a relatively quick run through resampling methods; be sure you have the rationale right. (does subsampling something we think is error prone really...help?)

Note that subsampling that results in a very small community may cause power to break down, which is expected. This should only happen for a predictable group of borderline-small communities, though.

This may be computationally intensive, because each observed SAD will propagate into n resampled SADs, each of which will need sampling.

uestions for HY/SKME

General

Submission details

  • Formatting: font? page numbers?
  • Corresponding author marked correctly?
  • Address for correspondence?
  • COI statement on front page?
  • Conflicted/suggested reviewers? I have a blank here. Both in terms of, do we provide this list or is it obvious to the editor, and if it's important that we provide it, who's on each list? Would assume Ethan, Locey, Kitzes, Newman can't review, but all are in acknowledgements/heavily cited in the text.

Data

  • I want to be sure we attribute things properly in the data accessibility statement, text, and refs. The data for the analysis are (mostly) the same as Baldridge (2016), the exception being that I accessed Misc Abund from figshare. That version may match the data used in Baldridge (2016); I re-accessed it because it's a more heterogeneous dataset than the others and I wanted to be able to see clearly what I was getting (instead of the processed versions). For the others, I downloaded the .csvs from weecology/sadcomparison. The data on GitHub are identical to what is on Zenodo. Details at #52
  • Is it OK that I'm getting the data from GitHub and figshare, rather than Zenodo? (GH data exactly matches Zenodo, for what it's worth)
  • Is it OK if we then re-archive on Zenodo?

Hao

  • Have I got the right affiliation? What's your preferred email?
  • Is there anything you'd like to put in the acknowledgements?
  • Check description of sampling algorithm now in Methods
  • Should algorithm vignette be a supplement?

More to sign off on/close the loop -

  • Minor language changes in comments (flagged)
  • Updates to figures & figure legends

Morgan

  • Signing off on language changes in comments (flagged)

Some figure changes:

  • Removed fig. 1 (distribution of datasets in sXn space) and put it in the supplement
  • Changes to fig. 4 to add breadth index - too dense now?

Try kde rather than percentile

  • Construct KDE of summary stat from fs
  • Scale to sum to 1
  • density at obs value

This gives a better sense of how weird it is than just the percentile (rank).

Intersections of datasets

If different datasets give different results, it would be nice to compare the subsets that have comparable S and N. I expect S and N, at large scale, to constrain the expected/possible outcomes. I'm actually not even sure how much the datasets intersect in terms of S and N, so visualizing that is the first step.

Reproducible example with subset

It takes a long time & a lot of compute to run the full analysis on all the datasets.

Create a smaller example that can run in a reasonable amount of time on a laptop.

This requires a smaller p-table, or having the p-table available for download somewhere.

Possibly doing it with just one dataset? MCDB might be the most tractable (smallest p-table, relatively few communities.)

Dataset details

From http://www.esapubs.org/archive/ecol/E093/155/appendix-A.pdf (Appendix A of White et al 2012, quoting here from the pdf):

Birds (BBS)

We used community data collected in 2009 from 2,769 routes of the Breeding Bird Survey (Sauer et al. 2011) (BBS) and 1,999 counts of the Christmas Bird Count (National Audubon Society 2002) (CBC). BBS routes are 40 km long, each consisting of 50 three minute point counts, 800 m apart, sampled annually in June. The 2009 data include a total of 1,819,908 individuals representing 347 species of diurnal landbirds, with individual routes averaging 657.2 ± 323.9 individuals (range = 53 – 3,504) and 46 ± 13 species (range = 10 – 81).

Trees (FIA and Gentry)

We also used two existing data sets of species abundance for communities of trees, the USFS Forest Inventory Analysis program (U.S. Department of Agriculture 2010, Woudenberg et al. 2010, http://apps.fs.fed.us/fiadb-downloads/datamart.html) (FIA), and the Alwyn H. Gentry Forest Transect Data Set (Phillips and Miller 2002) (referred to herein as ‘Gentry’). We used one year of data (calendar year of sampling varies among plots) for FIA phase 2 plots that were sampled using the standardized methodology implemented in 1999 [see the FIA National Core Field Guide for more information (U.S. Department of Agriculture 2010)]. The standard plot consists of four 24.0-foot (7.32 m) radius subplots, on which trees 5.0 inches (12.7 cm) and greater in diameter are identified to species and measured. We used species abundance data for 10,355 FIA plots, encompassing a total of 380,581 individuals and 236 species, with plots averaging 36.8 ± 12.5 individuals (range = 11 – 118) and 11.4 ± 1.6 species (range = 10 – 21). The Gentry data were collected from 226-0.1 hectare sites throughout the world, with each site sampled once over the course of a 22 year period. At each site, all plants with stem diameters of 2.5 cm or greater were identified and measured along ten 2 × 50 m transects. It should be noted that, due to difficulties in the taxonomy and identification of tropical trees, some species in the Gentry dataset are identified only as morpho-species (unique within sites), and species’ names vary among sites due to both typographical errors and synonymy problems. Since we only analyzed data within a site, these issues do not affect our analyses, but they artificially elevate the count of species in the Gentry dataset and therefore the number of species included in the overall analysis. We used data from 222 sites, including 67,405 individuals representing approximately 7,300 species, with individual sites averaging 303.6 ± 115.6 individuals (range = 44 – 779) and approximately 91.4 ± 59.7 species (range = 10 – 250).

MCDB

We used species abundance data for the 103 sites included in the Mammal Community Database (Thibault et al. 2011) (MCDB) that included at least 10 species (mean richness = 13.6 ± 4.0 species; range = 10 – 34). These data have been compiled from various published sources and therefore have not been collected using a standardized protocol across sites. As a result, these data are species-level abundances of small mammals that were captured using various levels of sampling effort spread across varying amounts of time and space. Despite these limitations, these data represent, to our knowledge, the largest collection of mammal community data ever analyzed in one study. The data encompass a total of 380 mammal species and 94,866 individuals (mean abundance per site = 921.0 ± 1,434.9; range = 19 – 10,085).

Misc Abund

From https://github.com/weecology/sad-comparison/blob/master/chapter1.md Baldridge's dissertation ch. 1

Data on Actinopterygii, Reptilia, Coleoptera, Arachnida, and Amphibia, were mined from literature by Baldridge and are publicly available [@Baldridge2013] (see Table 1 for details). These data were collected at the level of the site defined in the publication if raw data were available at that scale, and at the scale of the entire study otherwise. Time scales of collection for this data depended on the study but was typically one or a few years.

Off by one errors in percentile?

If you have found 100% of the feasible set, can you have a percentile value of 0 or 100?

ugh, percentiles are actually kind of sketchily defined at the edges. You can have 100 (value = max(values)), but I think not 0. There is the potential for odd edge behavior.

Edits checklist

Main text

  • Description of sampling algo in methods
  • Format in text refs
  • Word counts
  • Nb of refs
  • Nb of figs

Refs

  • Citing data as from Baldridge (2016)?
  • Format refs section
  • Reference for Misc Abund database (formatting)

Supplements

  • Edits to S1
  • Edits to S2
  • Check supplements for refs
  • Supplemental figure legends
  • Supplemental table captions
  • Add sampling algorithm as supplement (will change numbering)

Submission periphera

February revisions

Analysis

  • Jackknife resampling #56
  • Shannon diversity #57
  • Proportion off (~effect size) #58
  • Log ratio?
  • Distinguishability
  • Number of rare species #59

Writing

  • Ratio of N to S effects on percentile
  • (Speculation/conversation on:) how distinguishability assumptions might affect results
  • Expand discussion of complex-systems-in-ecology, deeper roots of SAD work
  • Expand ecological intuition in both intro & discussion of results - may be bolstered by number of rare species analysis

Details

  • Change first sentence
  • Define hollow curve x and y axes
  • Update S and N ranges
  • Link to codebase! 🙃
  • Add details of sampling algorithm to main text

Edge behavior of percentile (esp for small FS)

In clean-and-tests I changed count_below (the percentile-finding function) to count the number of values <=, not <, a focal value. This was just to match what we'd put in the manuscript.

This mostly didn't change things qualitatively. EXCEPT. It tends to increase %ile values.

How to import SADs

  • by dataset

  • by dataset x site

  • should statevars be separate

  • many small objects vs few large objects

  • generally inclined towards many small, will this break drake

Memory management

10k samples for BBS creates dataframes that are too large for R. I have reduced to 2500 to see how that goes.

The same problem will definitely arise for FIA, to a greater degree. There are 2773 BBS sites and >100k FIA sites. I think I will need to break FIA in to at least 10 and possibly more smaller chunks, which raises the question of whether it even makes sense to include all of them.

Filter MCDB to single year

Many of the datasets in MCDB as imported from Baldridge/White are aggregated over many years of data.

Looking at the MCDB itself, some of this is happening because the data as it is reported in MCDB is already combined over the duration of the study. There are about 70 timeseries in the MCDB, with observations at multiple time points.

With a little bit of fiddling I can run the cats analysis on the MCDB timeseries data. But the question is starting to wander off into this scaling/aggregation question: does the SAD generated by pooling data from all years give us the same or systematically different results than the SADs generated from each year separately? I think this is an important question, and one to address more systematically than as a one off for this particular dataset.

Parallelization

You can't run everything in parallel at the scale of the site because this breaks SLURM (too many requests sent too close together).

  1. Bunch by dataset?
  2. Is there a way to tell drake/batchtools to break things into chunks?

Tracking dataset run status

  • BBS: thru cdf 2/20/20
  • FIA: thru cdf 2/20/20
  • Gentry: tthru cdf 2/20/20
  • MACDB: thru cdf 2/20/20
  • MCDB: thru cdf 2/20/20
  • Misc: thru cdf 2/20/20
  • Portal manips: thru cdf 2/20/20
  • Portal plants: thru cdf 2/20/20

updated 2/20/20 1pm

Prior to submitting

Prior to submitting

  • incorporate Hao's edits
  • change results table to broken out by dataset
  • ks test for fia-not fia
  • figure & figure legend edits
  • add algorithm supplement & renumber
  • add links to Zenodo download to filtering vignette
  • merge to master, figure out why master fails but clean-and-tests does not
  • mention biorxiv in cover letter
  • HY, SKME sign off on final version
  • biorxiv upload
  • submit

Some decisions

  • Not doing badges
  • Will do transparent review
  • Not recommending reviewers

After submitting notes

  • Keep preprint updated with new versions
  • When ready to archive on Zenodo, include language that this data is provided for reproducibility, but that for other uses it is best to access from original sources.

Originally posted by @diazrenata in #49 (comment)

Datasets

Biological datasets

Open datasets from White et al 2012:

  • Gentry
  • FIA
  • MCDB
    From Baldridge:
  • Misc. Abundance Database

Out of curiousity:

  • Condit (BCI and other tropical forest plots)
  • NEON? (In progress)

Non-biological datasets

  • Linux distros?
  • Data from non-bio paper

Nsamples vs results

Especially with FIA, seeing considerably fewer than 10000 unique samples. Visualize if there's any correspondence results :: nsamples.

Sampling plan

Mammals - mammals p table
FIA, gentry, bbs - wide p table
Misc abund - need to filter to <50k individuals, use long p table

Moonshots include >50k individuals, BCI data. Not sure if it is possible to get a p table for those communities.

Questions

  1. Are empirical species abundance distributions unusual (in skewness or Simpson evenness) compared to their feasible set?
  2. Is our conclusion sensitive to the addition of cryptic rare species?

Analysis updates

  • Use <= for skewness and < for evenness: this is conservative about designating ties as extreme
  • At least filter out FS for which there are fewer than 20 unique values for the summary statistics. If there are fewer than 20 unique values it is impossible to be in the 95th/5th percentile.

Do S and N predict percentile?

S and N shape the size of the FS and its characteristics. Do they have a detectable knock on relationship with %ile? See #20 for discussion of beta and zero-one inflated beta regressions.

Deeper than that, I think things are way too correlated to tease out. See sadspace.

Analysis for manuscript

Working on, the pipeline/analysis that has wound up in the manuscript.

  • Get datasets: download from weecology repo/figshare
  • Processing datasets: filter out small/large communities, n = s, etc. Max N = 40720; this ceiling applies in Misc Abund. For FIA small, 10,000 samples drawn from the 66,000 sites with S >= 3, <=9.
  • Processing datasets: subsample the small FIA communities
  • Singletons (now in supplement)
  • Sample FS for each community
  • Calculate skew and evenness for every FS draw and observed
  • Compare observed values to distributions from FS
  • Calculate 95% ratio for FS
  • Combine results from each dataset into overall data frame (all_di.csv)
  • Render overall figures

The clean-and-tests branch will be for the version of the analysis that matches the manuscript draft.

There are some old/stale reports and analyses that I don't at present know what to do with. I will probably delete them from the version in clean-and-tests, which may eventually morph into the default branch.

I may need to write up an explainer on how the drake pipelines work.

Will also need an interlude on the other ROV approaches I tried (supplement)

Possible correlates of percentile

  • S, N, S/N
  • some measure of spread in the FS distribution. Not just range, as this will be sensitive to whether you happen on the very flat/very skewed ones. Sd or something.
  • mean/median of the FS distribution? If the central tendency is super skewed/super flat.
  • dataset

then, manips

Data citation and archiving

Details:

Data

  • I'm not 100% on how to cite the data or how to phrase the data accessibility statement. The data for the analysis are (mostly) the same as Baldridge (2016), the exception being that I accessed Misc Abund from figshare. I downloaded the .csvs from weecology/sadcomparison. The data on GitHub are identical to what is on Zenodo. This crops up in the data accessibility statement and when we discuss the source of the data. Currently:

Data accessibility statement: All data used are available publicly via Zenodo and figshare. Upon publication, all code and data will be archived and made publicly available via Zenodo.

In methods - does this need to more explicitly say that the data were accessed from the repo for the 2016 paper (except Misc. Abund, which I got from figshare):

We used a compilation of community abundance data for trees, birds, mammals, and miscellaneous other taxa that has been used in recent macroecological explorations of the SAD (White et al 2012 , Baldridge 2016, Baldridge 2015).

In refs, citing both the Baldridge paper and "Data from" that paper:

Baldridge, E. (2015). Miscellaneous Abundance Database. figshare. Available at: MiscAbundanceDB_main. https://doi.org/10.6084/m9.figshare.95843.v4
Baldridge, E., Harris, D.J., Xiao, X. & White, E.P. (2016). An extensive comparison of species-abundance distribution models. PeerJ, 4, e2823.
Baldridge, E., Harris, D.J., Xiao, X. & White, E.P. (2016). Data from An extensive comparison of species-abundance distribution models. Zenodo. Available at: https://zenodo.org/record/166725.
White, E.P., Thibault, K.M. & Xiao, X. (2012). Characterizing species abundance distributions across taxa and ecosystems using a simple maximum entropy model. Ecology, 93, 1772–1778.

  • Is it OK that I'm getting the data from GitHub and figshare? (GH data exactly matches Zenodo, for what it's worth; it seems excessive and not fully transparent to retrofit the code so it downloads from Zenodo instead of GH at this stage...?)

Workers time out after 1 hour

Possible reasons:

  • HPG resource limits?
  • Clustermq giving up after a couple of milliseconds?
  • Units of time requested?
  • Syntax for changing upper limits?

It would help

  • if the toplevel job errored instead of just lurking in limbo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.