Giter VIP home page Giter VIP logo

pyobis's Introduction

pyobis

pypi Conda Version docs tests

Python client for the `OBIS API(https://api.obis.org/).

Source on GitHub at iobis/pyobis

What is it?

Pyobis is an interesting python package that helps users fetch data from OBIS API which harvests occurrence records from thousands of datasets and makes them available as a single integrated dataset.

The Ocean Biodiversity Information System (OBIS) is a global open-access data and information clearing-house on marine biodiversity for science, conservation, and sustainable development, maintained by IOOS.

Other OBIS clients:

Main Features

Here are just a few of things pyOBIS can do:

  • Easy handling of OBIS data, easy fetching without handling the raw API response directly.
  • Built-in functions for occurrence, taxon, node, checklist and dataset endpoints of OBIS API.
  • Provides easy export of data to Pandas DataFrame, and helps researchers focus more on analysis rather than data mining.

For examples of how to use this repo, see the jupyter notebooks in the /notebooks/ directory. NOTE: GitHub's jupyter notebook display does not show interactive plots; open the notebooks in a jupyter hub (eg colab, binder, etc) for the full experience.

Installation

Install from PyPI

pip install pyobis

Install from conda-forge

Installing pyobis from the conda-forge channel can be achieved by adding conda-forge to your channels with:

conda install pyobis --channel conda-forge

More information here

Install latest development version from GitHub

pip install git+git://github.com/iobis/pyobis.git#egg=pyobis

Install editable dev version from github for local development. System prerequisites: python3, conda

# fetch code
git clone [email protected]:iobis/pyobis.git
cd pyobis
# install
python -m pip install -r requirements.txt
python -m pip install -r requirements-dev.txt
python -m pip install -e .
# test your installation
python -m pytest
# test and generate a coverage report
python -m pytest -rxs --cov=pyobis tests

Documentation

The official documentation is hosted on GitHub Pages https://iobis.github.io/pyobis.

Library API

pyobis is split up into modules for each of the groups of API methods.

  • checklist - Checklist. Generate a checklist of species under a taxa, IUCN Red List, or most recently added species.
  • dataset - Dataset. Get metadata of datasets (including datasetid, owner, institution, number of records, etc) for a queried spatiotemporal region or taxa.
  • nodes - Nodes. Get records or activities for an OBIS node.
  • occurrences - Occurrence. Fetch occurrence records, geopoints, lookup for a scientificname, extensions (e.g. DNADerivedData, MeasurementOrFacts, etc.)
  • taxa - Taxonomic names. Get taxon records with taxonid or scientificname, and scientific name annotations by the WoRMS team.

You can import the entire library, or each module individually as needed.

Usage Guide

For a detailed usage guide with information about inputs, output and module functions please read the Usage Guide

Sample analysis

Some Jupyter Notebook based sample analysis and visualization of data grabbed through pyobis have been made available through /notebooks/ directory. To get full experience of the interactive plots (eg. geoplots, etc) please open notebooks in a Jupyter Hub (eg. through Google Colab, Binder, local installation, etc.)

Meta

Further Reading

  • In case you face data quality issues, please look at OBIS QC repo
  • For issues with the package itself, feel free to open an issue here!

pyobis's People

Contributors

7yl4r avatar ayushanand18 avatar dependabot[bot] avatar ocefpaf avatar pieterprovoost avatar pre-commit-ci[bot] avatar pushkar2112 avatar sckott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pyobis's Issues

move tests alongside code modules?

This is a matter of preference, but I prefer to have the test code in the same directories as the module code. Would it be okay with everyone to move the tests alongside the code instead of in a /test/ directory? Here is a demonstration of the options:

  1. tests separate
    .
    ├── /pyobis/
    │   ├── occurrences
    │   │   └── occurrences.py
    │   ├── obisutils.py
    └── tests
        ├── test_occurrence.py
        └── test_obisutils.py
    
  2. tests alongside
    .
    ├── /pyobis/
        ├── occurrences
        │   ├── occurrences.py
        │   └── occurrences_test.py
        ├── obisutils.py
        └── obisutils_test.py
    

I prefer (2). What do ya'll think?

notebook about sampling bias?

A potentially helpful notebook to include here is a notebook which illustrates the concept of sampling bias. This notebook will present the issues and leave solving them to be dealt with elsewhere.

Things that could be shown:

  • spatial bias:
    • example: species "X" is pelagic, but records are mostly near-shore
    • include a link to biodiversity index notebooks (work-in-progress here)
  • taxa distribution bias:
    • show biases towards certain taxa vs expected taxa distribution
  • temporal bias

replace setup.py with setup.cfg|pyproject.toml

  • Clarify listing of GBIF instead of OBIS in package description ref (ref)
    • add link to pyGBIF?
    • use a notebook leveraging both pyGBIF & pyOBIS?

Might be worth reviewing the rest of setup.py while you are in there.

obisissues.py is unused?

obisissues is a module which is essentially just a lookup table for error codes coming from the API. I don't think we are using this, and I am also uncertain if this module is up-to-date. We could remove the module, but we may also want to retain it for future use. Some text noting this at the top of the file might be sufficient, or we could keep it in a separate branch. I am not sure what is the best practice for retaining legacy code.

which vertical datum do we use in OBIS?

Overview

Some endpoints in OBIS ask for an additional z coordinate over (x,y) for a location. I hope z means vertical datum here. So, which particular datum do we follow - WGS 1984 or NAD 1983?

One of the endpoints that ask for (x, y, z) is occurrences/tile/{x}/{y}/{z}

A sample query:

https://api.obis.org/v3/occurrence/tile/1/1/1?scientificname=mola%20mola

showing suggestions for possible results for a search query using undocumented API request

Overview

Many functions inside pyobis package demand the user to provide exact matches of scientificname, taxonid, or datasetid. While some researchers who are particularly interested in an individual record but many of the researchers who are starting up with their project may not know the exact id or scientific name to query for.

Notes

OBIS Mapper utilizes some endpoints to do the exact thing - suggest id or scientific name based on a query term.

Questions

Should we include a function in occurrences module for searching a species via keyword like say a search_keyword(term="query-term-you'd-like-to-search-for")?

and a seperate function search_keyword(term="query-term-you'd-like-to-search-for") in dataset module for searching datasets by name?

Add `.get_mapper_url()` function?

Sometimes it's nice to make sure you're getting the response you would expect, directly from the OBIS mapper - https://mapper.obis.org/

I would like to see an additional response where we can simply return the url for the mapper based on our search configuration. You can use the same url configuration we use for the occurrences.search function but with the following structure

'https://mapper.obis.org/?',occurrence_search_config.replace("occurrence?","")

master -> main

while we're doing all this work, mind renaming master to main?

incomplete fetch of checklist from `/checklist` endpoint

Overview

When we try to generate an OBIS checklist using pyobis.checklist -> list() function, then it returns a checklist of size at max 10. This behavior is due to the fact that OBIS API returns by default returns a list of 10 records only in a query.
To fetch subsequent records, we need to pass a skip parameter to skip the number of records already fetched.

For example, let us look at this request

https://api.obis.org/v3/checklist?size=10&skip=10&taxonid=1363

It fetches subsequent 10 records after first 10 have been fetched.

To reciprocate

Run

from pyobis.checklist import ChecklistQuery
ChecklistQuery().list(taxonid=1363)["total"] # total records
len(ChecklistQuery().list(taxonid=1363)["results"]) # total fetched

Note: This is not something mentioned in the documentation, and I got this insight thanks to OBIS Mapper.

add notebooks to docs

Is there a way to include the jupyter notebooks in the docs? A rendered output would be nice, but a list of links & short descriptions would be good too.

Add capability to query for multiple taxons using `taxonids` at the same time

Overview

At present all functions enabling querying through taxonid need it to be a valid integer (as listed in WoRMS). It is not possible to cherry-pick multiple taxonids through a single query.

It would be a powerful feature if we enable array handling of multiple taxonids as we do for scientificname. It will save both time and effort.

[Bug] pyobis.occurrences: all records not being returned by search()

Overview

The occurrences.search() function returns only the first set of records (limited to 10 rows) from OBIS. But, we need to fetch all occurrence records from the API.

Steps to reproduce this bug

  • Install pyobis
  • Run
    res = occurrence.search() # we can also query for any particular scientificname but we get the same issue
    df = pandas.DataFrame(res["results"])
    print(len(df)) # it will show only 10, however the total number of records can be determined by executing res["total"]
    

Add `.get_search_url()` function to all modules

For each of the modules, we should have the ability to return the API url that is configured with our identified search.

For example,

res = occ.search(taxonid="1363", startdepth=0, enddepth=30, geometry="POLYGON ((-180 -30, 180 -30, 180 30, -180 30, -180 -30))")

Then,

res.get_search_url()

would return

https://api.obis.org/v3/occurrence?taxonid=1363&startdepth=0&enddepth=30&geometry=POLYGON%20%28%28-180%20-30%2C%20180%20-30%2C%20180%2030%2C%20-180%2030%2C%20-180%20-30%29%29

This gives the user a way to investigate how their query might be not providing the expected responses.

Related to #69

occ.search(...,mof=True) & occ.search(...,mof=False) are inconsistent

The output of occ.search is structured differently if mof is True.


occ.search(scientificname="Egregia menziesii", hasextensions="MeasurementOrFacts")

image

vs

occ.search(scientificname="Egregia menziesii", mof=True, hasextensions="MeasurementOrFacts")

image


When mof=False, the total number of results is included in a column, but when mof=True it is as if res = res["results"] has been performed.


The output should not change in this way when mof is True vs False.

occurrence record weirdnesses

While working on this notebook to analyze migration patterns of species over time and plot them over a world map, I found that many records contained unexpected eventDate (screenshot attached). Should we drop these fields?
image

[Bug] pyobis.occurrences: MoF throws error for species with no occurrence records

Overview

When accessing MoF records for species that do not have a corresponding occurrence records, we encounter an unexpected error.

Steps to reproduce

  • Install pyobis
  • Run
import pandas as pd
from pyobis import occurrences as occ
res = occ.search(taxonid=124705, mof=True, hasextensions='MeasurementOrFact')
pd.Concat(res)

Error

Traceback (most recent call last):
  File "test.py", line 3, in <module>
    res = occ.search(taxonid=124705, mof=True, hasextensions='MeasurementOrFact')
  File "/mnt/d/programming/pyobis/pyobis/occurrences/occurrences.py", line 96, in search
    a = pd.merge(pd.DataFrame(out["results"]),mofNormalized,on='id',how='inner')
  File "/home/ayush/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 107, in merge
    op = _MergeOperation(
  File "/home/ayush/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 700, in __init__
    ) = self._get_merge_keys()
  File "/home/ayush/.local/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1097, in _get_merge_keys
    right_keys.append(right._get_label_or_level_values(rk))
  File "/home/ayush/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 1840, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'id'

dataset notebook to show of a data provider's data

grab datasets from list of ids:

datasets_of_interest = [
    "UUID_1",
    "UUID_2",
]
  • spatial distribution
  • temporal distribution
  • taxanomic distribution
  • what MoFs are included?
  • what makes this dataset unique?

Add capability to download Full OBIS Export of presence records?

OBIS provides a full export of presence records [1] in both csv [2] and parquet [3]. Would it be feasible to add something to pyobis that goes out and efficiently grabs the whole export?

I'm sure folks smarter than me know of ways to parallelize the call or do something fancy with the available services to download in a reasonable time frame.

[1] - https://obis.org/data/access/
[2] - https://obis-datasets.ams3.digitaloceanspaces.com/exports/obis_20220710.csv.zip
[3] - https://obis-datasets.ams3.digitaloceanspaces.com/exports/obis_20220710.parquet

Could also be a waste of time for this package.

Document units and direction for use of depth search parameters

If I wanted to do an occurrence search including a filter by depth. The docs don't currently tell me:

  1. What units should I use?
  2. What direction are we using?

For example, if I wanted to search between 10 and 50 meters depth, should I use:
start_depth = -10
end_depth = -50

Please update the docs to describe this.

To share some information, the Climate and Forecast conventions is what made me think of this nuance. The guidance is to specifically include units and direction with vertical coordinates. I don't think we need to handle conversions in the code, but we should be documenting what the expectation is.

data caching?

a nice feature to have would be to allow retention of cached data so that data does not need to be redownloaded.

Add capability to process DNADerivedData to `occurrences` module

Overview

The pyobis package in its current state does not support fetching DNADerivedData using the occurrences module.

I believe there can be two ways to implement it:

  • add it as a parameter in the search function like MeasurementOrFacts
  • create a new function for it say dna( self, **args, **kwargs ) and enable necessary processing and utility methods since search function is already heavy with so many features.

occurrence search using datasetid

I'm trying to rework an old notebook to use this package.

I have this piece of code:

from pyobis.occurrences import OccQuery

datasetid = '2ae2a2bd-8412-405b-8a9f-b71adc41d4c5'

occ = OccQuery()
dataset = occ.search(datasetid = datasetid)

but it didn't work - took to long.

Here are the expected details from my other process:
OBIS Dataset page: https://obis.org/dataset/2ae2a2bd-8412-405b-8a9f-b71adc41d4c5
API request: https://api.obis.org/v3/occurrence?datasetid=2ae2a2bd-8412-405b-8a9f-b71adc41d4c5
Found 698900 records.

create visualization module (or separate package)

  • pyobis.visualize
    • create separate repo?
      • keeps pyobis dependencies light
      • makes viz libarary potentially usable w/ GBIF & other DwC data

text from proposal:

Developing an umbrella module inside the pyobis package that can be used directly
by the users to visualize data through existing modules. For eg., we can build a
top-level module, say pyobis.visualize which ingests JSON data from say
pyobis.occurrence and visualizes it based on some parameters like scientificname,
basisOfRecords, etc.

curated column subsets?

Working with occurrence data I feel overwhelmed by the number of columns.

Would it be a good idea to allow for easy subsetting of columns?

Here is some code in which I have done that:

# === select a subset of columns for improved table legibility :
shortened_df = df[[
    # time-related columns:
    "date_year", "endDayOfYear", "verbatimEventDate", "startDayOfYear", "dateIdentified", "eventTime", "date_mid",
    "eventDate", "month", "date_start", "date_end", "day", "year",
    # row identifier columns:
    "recordNumber", "ownerInstitutionCode", "parentEventID", "identifiedBy", "eventID", "collectionID", "organismID", "recordedBy", "datasetID", "category", "datasetName",
    "institutionCode", "occurrenceID", "collectionCode", "dataset_id", "id", "modified", "catalogNumber", "fieldNumber",
    "institutionID",
    # additional remarks:
    "occurrenceRemarks", "taxonRemarks", "eventRemarks", "samplingProtocol",
    "typeStatus", "preparations", "establishmentMeans", "dynamicProperties", "type",
    # occurrence specifics:
    "individualCount", "occurrenceStatus", "originalScientificName", "absence",
    "terrestrial", "basisOfRecord", "dropped",
]]

We could include arrays of curated column lists so that a user could do something like:

df = df[pyobis.column_subset.taxonomic + pyobis.column_subset.temporal]

To drop everything but the curated list of "taxonomic" and "temporal" columns. Thoughts?

Undocumented `countryid` parameter for `occurrences/grid` endpoint

Overview

While studying the network logs for OBIS mapper tool, I witnessed that there is an undocumented input parameter countryid is being used to gather occurrence records published by a particular country with the help of /occurrence/grid endpoint of the OBIS API. This parameter is absent from the Swagger docs.

Screenshot of the network log

image

Response

image
It returns all the occurrence records for a particular country from its countryid code.

Questions

Should we add this parameter to pyobis

Maintenance and pyobis future

Hi @sckott, are you still interested in maintaining pyobis? There is a growing community of Pythonistas that are using OBIS and would like to update and curate this package. Would you be interested in adding more committers and/or moving it to another organization?

create 1st demo jupyter notebook

  • what are "use cases" for each notebook?
    • outline different use-cases for pyobis
  • search for relevant papers using OBIS|GBIF data & reproduce results in jupyter
  • see how species are moving downwards due to climate change.
    • search MoF for depth measurements & report on % that have it.
    • relevant paper
  • see how species are shifting north/south due to climate change.
  • can potentially publish notebooks to JOSS
  • we should also set up these notebooks to work with mybinder.org
  • should the notebooks go in this repo or should we place them elsewhere?
    • create a "notebooks" directory in the root dir
  • add this notebook

ongoing sunfish analysis in dev on this branch

Use CI push to PyPI

  • Use CI push to PyPI
    • discussed w/ Filipe : used to do it w/ Travis CI, need to switch to gh-actions

Expand README to include more information

Overview

Since pyobis is an open-source project, there is a strong need to expand the README to include information related to, but not limited to:

  • Data Quality Issues reporting: If any user finds some issues with the data grabbed from pyobis it will be a natural instinct for him/her to open issues here. Therefore, README must contain relevant information regarding the right place to open issues.
  • Data Related Queries: If any user has some queries related to the data being returned by pyobis say pertaining to the fields returned with occurrence records then it might be wise to include links to the OBIS main page, contact page or some forum.
  • Update short examples contained in the README
  • Add link to usage_guide.ipynb and to /notebooks/ folder for sample use case-analysis
  • Other changes: Changing the links that have gone obsolete, adding Further Reading material, adding link to CONTRIBUTING.md guide.

Please suggest if there are any other things that need to be added to the README.

/docs/ directory update (or remove?)

/docs/ looks to contain sphinx documentation. I am not familiar with building documentation like this. Can we add the docs build to a github action? Should we update to another tooling?

[Bug] pyobis.occurrences: MoF results DataFrame shows duplicate columns.

Overview

When we query MoF records using occurrences.search(mof=true,hasextensions='MeasurementOrFact'), it throws some duplicate columns -> ['scientificName', 'eventID']. This issue is a direct result of the implementation at line 95.

Steps to reproduce

  • Install pyobis
  • Run
    df = pd.Concat(occurrences.search(mof=True, hasextensions='MeasurementOrFact')
    df.columns # it will show two duplicate columns with the same content but with the name: scientificName_x and scientificName_y. Similar result with eventID
    

GSoC meeting planning & agenda

In June we will begin a set of weekly meetings to work on a project supported by Google Summer of Code. This issue is here to discuss plans for the first of these meetings. Expected attendees include me (Tylar Murray), Matt (@MathewBiddle), and Ayush (@ayushanand18). Let's discuss scheduling on the slack #gsoc_pyobis channel.

agenda

  • introductions
    • what you are hoping to get out of involvement in this project
  • discuss TDD as described in the workflow 1-pager
  • begin working on strategic plan as described in the management 1-pager
  • set up todo items before next meeting

improve MoF accessiblity

GSoC objective

  • Make MoF response more efficient
    • make it return a list of dataframes so it's more usable
    • currently nested table each occurrence has nested measurment or fact
    • some trick to use pandas to pull out the nested part (json.normalize?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.