pangaea-data-publisher / pangaeapy Goto Github PK

View Code? Open in Web Editor NEW

28.0 28.0 17.0 3.17 MB

PANGAEA Python Client

Home Page: https://www.pangaea.de/

License: GNU General Public License v3.0

Python 96.08% XSLT 3.85% Shell 0.07%

jupyter-notebook pangaea python researchdata

pangaeapy's People

Contributors

Stargazers

Watchers

Forkers

huberrob aarthi02 coronath3beer iris-hinrichs marcoalbaett qaysabouhousien mopyn pgierz scottclowe ernestug dokempf fspreck-indiscale mmuesly

pangaeapy's Issues

HDF5 libraries are required on OS

Since HDF5 libraries are required for the package (see pip error below), it would make sense to add this information to the README. I think it would also be reasonable to add some information on how to install these libraries. For Ubuntu, I found the command pretty quick (sudo apt-get install libhdf5-serial-dev netcdf-bin libnetcdf-dev), but maybe it's more complicated for other OS.

Cheers

Defaulting to user installation because normal site-packages is not writeable
Collecting pangaeapy
  Downloading pangaeapy-1.0.13-py3-none-any.whl (2.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 5.4 MB/s eta 0:00:00
Requirement already satisfied: lxml>=4.9.1 in /home/ruemmler/.local/lib/python3.11/site-packages (from pangaeapy) (4.9.2)
Requirement already satisfied: requests>=2.26.0 in /home/ruemmler/.local/lib/python3.11/site-packages (from pangaeapy) (2.28.2)
Requirement already satisfied: pandas>=1.3.5 in /home/ruemmler/.local/lib/python3.11/site-packages (from pangaeapy) (1.5.3)
Requirement already satisfied: numpy>=1.21.0 in /home/ruemmler/.local/lib/python3.11/site-packages (from pangaeapy) (1.23.5)
Collecting netcdf4~=1.5.6
  Downloading netCDF4-1.5.8.tar.gz (767 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.0/767.0 kB 17.5 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      Package hdf5 was not found in the pkg-config search path.
      Perhaps you should add the directory containing `hdf5.pc'
      to the PKG_CONFIG_PATH environment variable
      No package 'hdf5' found
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-rj5wrjo3/netcdf4_55387b07616a42cd9e4ebf580deadf8c/setup.py", line 419, in <module>
          _populate_hdf5_info(dirstosearch, inc_dirs, libs, lib_dirs)
        File "/tmp/pip-install-rj5wrjo3/netcdf4_55387b07616a42cd9e4ebf580deadf8c/setup.py", line 360, in _populate_hdf5_info
          raise ValueError('did not find HDF5 headers')
      ValueError: did not find HDF5 headers
      reading from setup.cfg...
      
          HDF5_DIR environment variable not set, checking some standard locations ..
      checking /home/ruemmler/include ...
      hdf5 headers not found in /home/ruemmler/include
      checking /usr/local/include ...
      hdf5 headers not found in /usr/local/include
      checking /sw/include ...
      hdf5 headers not found in /sw/include
      checking /opt/include ...
      hdf5 headers not found in /opt/include
      checking /opt/local/include ...
      hdf5 headers not found in /opt/local/include
      checking /usr/include ...
      hdf5 headers not found in /usr/include
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.```

stop loading metadata when id is not valid or dataset does not exist

Remove incorrect terms "parent" and "child" from API

This is confusing to users as we "officially" do not use terms like "parent" or "child" for datasets. With preparing our community workshop, we noticed that those terms are used in API and error messages.

As not all collections are parents and for external users a "parent" means nothing and is not a term used outside of PANGAEA, please replace all usage of "parent" by "collection". "Child datasets" are fine from the persepctive of a dataset with an "In:" citation, but from perspective of pangaeapy it makes no difference. So "normal" datasets should be "dataset" not "child".

So all error messages should be like "cannot download datas as it is collection".

Enable caching

We should be able to cache pangaeapy objects for e.g. offline processing. Therefore, data as well as metadata should be stored e.g. as pickle in a given cache directory..

Move setup.py etc. to top level

See, e.g., #37. The current structure of this repo doesn't allow for a simple

pip install git+https://github.com/pangaea-data-publisher/pangaeapy

since the top level directory of this package is not pip installable. This could be fixed by moving the setup.py, LICENSE.md, etc., to the root of this repo.

no information on data access rights via query

Dear pangaeapy team,

the pangaeapy query does not show whether access rights are needed for a particular data set. The download of a data set with access rights gives an empty object, which is confusing if the information on access rights is not given.
Example:
https://www.pangaea.de/advanced/search.php?q=B18_2012 shows that access rights are needed, if I am logged into PANGAEA.
| "

<div class="citation"><a href="https://doi.pangaea.de/10.1594/PANGAEA.931734\" target="_self" class="dataset-link">Freitag, J; Kipfstuhl, S; Weißbach, S et al. (2021): Density profile of the B18_2012 firn core<table class="result" summary="Dataset reference and size" cellspacing="0" cellpadding="0"><td class="title">Size:<td class="content">5535 data points<div class="datasetid"><a href="https://doi.pangaea.de/10.1594/PANGAEA.931734\" target="_self" class="dataset-link doi-link">https://doi.org/10.1594/PANGAEA.931734 – <a href="https://doi.pangaea.de/10.1594/PANGAEA.931734?format=textfile\" target="_self" class="dataset-download-link">Download <span class="glyphicon glyphicon-lock" title="access rights needed"/> – <span title="The score is a measurement of relevancy (see TF-IDF algorithm). The value is not absolute and only suitable to compare hits from the same result set.">Score: 10.09

This info is not shown if I am not logged in. It would be great to get this information always, independent of being logged in.

since pangaeapy does not know whether I am logged in:
test1=pan.PanQuery("B18_2012")
test1.result[0]['html']
'

Freitag, J; Kipfstuhl, S; Weißbach, S et al. (2021): Density profile of the B18_2012 firn core

Size:	5535 data points

https://doi.org/10.1594/PANGAEA.931734 – Score: 10.09

Please adapt query with information on access rights.

best wishes,
Kathrin

Coverage (Lat/Lon) columns are not exported in DWCA

See: 10.1594/PANGAEA.853606 which has a 'Latitude', 'Longitude', 'Event', 'Date/Time' etc. but dwca_exporter does not include this in the matrix

`exporter` folder is missing `init.py`

This leads to issues with installs directly from GitHub, e.g. pip install git+https://github.com/pangaea-data-publisher/pangaeapy, where subfolder exporter is ignored during install. Importing the module with from pangaeapy.pandataset import PanDataSet leads to a ModuleNotFoundError: No module named 'pangaeapy.exporter.

pip install pangaeapy does not have this issue and import completes successfully.

Further info:

pip install git+https://github.com/pangaea-data-publisher/pangaeapy installs version 0.0.3
pip install pangaeapy installs version 0.0.4

Enable authorisation

Pangaea users which are logged in can retrieve a authorisation token via the user page at https://www.pangaea.de/user/

pangaeapy should include a auth_token in the panDataset constructor.

Connection error

Hi, I'm getting a connection error when trying to access any dataset in PANGAEA. This has been happening for the last couple of weeks and I wasn't sure if it was my internet connection. Seeing how this connection error persists, I wanted to raise this issue here.

I am getting the following error just by executing these two lines

from pangaeapy.src.pandataset import PanDataSet
PanDataSet('10.1594/PANGAEA.57849')

Traceback (most recent call last):
  File "C:\Users\Python\venv\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\Python\venv\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Python\venv\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn
    conn.connect()
  File "C:\Users\Python\venv\lib\site-packages\urllib3\connection.py", line 362, in connect
    self.sock = ssl_wrap_socket(
  File "C:\Users\Python\venv\lib\site-packages\urllib3\util\ssl_.py", line 384, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\Users\Python\Python39\lib\ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "C:\Users\Python\Python39\lib\ssl.py", line 1040, in _create
    self.do_handshake()
  File "C:\Users\Python\Python39\lib\ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\Users\Python\venv\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "C:\Users\Python\venv\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "C:\Users\Python\venv\lib\site-packages\urllib3\util\retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='doi.pangaea.de', port=443): Max retries exceeded with url: /10.1594/PANGAEA.57849?format=citation_text&charset=UTF-8 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\Users\Python\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-dcc1b675b247>", line 1, in <module>
    PanDataSet('10.1594/PANGAEA.57849')
  File "C:\Users\Python\venv\lib\site-packages\pangaeapy\src\pandataset.py", line 404, in __init__
    self.setMetadata()
  File "C:\Users\Python\venv\lib\site-packages\pangaeapy\src\pandataset.py", line 746, in setMetadata
    self._setCitation()
  File "C:\Users\Python\venv\lib\site-packages\pangaeapy\src\pandataset.py", line 737, in _setCitation
    r=requests.get(citationURL)
  File "C:\Users\Python\venv\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\Python\venv\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\Python\venv\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\Python\venv\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\Python\venv\lib\site-packages\requests\adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='doi.pangaea.de', port=443): Max retries exceeded with url: /10.1594/PANGAEA.57849?format=citation_text&charset=UTF-8 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)')))

Rewrite parts of data download to simplify especially error handling: Use content negotiation

I reviewed the current code to downaload datasets and figured out that it does a lot of if/then/else and parses XML files to figure out if datasets are freely accessible, or if they are parents. This is done for the reason because it needs to guess datatype. It also looks like the code wants to not hammer PANGAEA with useless requests. But this is no problem at all. The response that a content type is not supported is cheap and the http status code comes fast. I'd do the data download like that:

Use the plain DOI as URL for the download (both works: "https://doi.pangaea.de" but also "https://doi.org" and other variants). Previously with a doi.org URL no download was possible as "format=" parameter gets lost.
Set Authentication: Bearer token if available (see below). No need to check if it is login protected before. Just send always if available.
Set Accept: text/tab-separated-values as header. This enables content negotiation. As this header does NOT look like a plain stupid browser, the PANGAEA code will switch to real "REST mode" and for example respond with correct headers instead of redirects to login page if the dataset is password protected and the credentials do not match. So you don't need to do best guesses when you were redirected and you get the HTML login page. A real REST client will get correct status code to know: "unauthorized".

This should always return the normal tab-separated-values format. No need to cross-check content-type in response or anything like that. The download code should only look at status code:

200 (OK): All went well, you can be sure it is a tab-delimited matrix in PANGAEA format
401 (Unauthorized): Dataset is protected and access rights do not match the bearer token or there's no bearer token at all (e.g., wrong user) or no bearer token at all. This can be reported as error message.
406 (Not acceptable): The format in Accept header cannot be fulfilled. This happens when it is a parent or another type of collection or a static URL dataset with a different media type
404 (Not Found): Dataset does not exist
429 (Too many requests): Wait a few seconds
5xx: some server error, especially 503 means "PANGAEA is down". Report this as hard error to user.

If you want to get the native PANGAEA metadata in panmd format, please DO NOT use oai-pmh (I think pangaear dors this not sure about pangaeapy). The native PANGAEA metadata can and should also be retrieved by content negotiation: Accept: application/vnd.pangaea.metadata+xml

And finally to get the citation string use: Accept: text/x-bibliography (the default charset is always UTF-8). The current code does not parse any charset parameter on the content-type.

Add possibility to choose xarray as data container instead of pandas dataframes

It would be nice to allow the user to choose between pandas dataframes and xarrays as data container. Pandas should be the default container..

The setData method could then look like this:

def setData(self, addEventColumns=True, container='pandas'):

to choose xarray the method should be called like setData(container='xarray') ..

Empty HTTP when retrieving many (> 300) datasets with `include_data=False`

Apparently, when retrieving many of datasets in a short time with the include_data=False argument (i.e., only retrieving metadata), the Received too many requests (for data) error (429)...waiting 30s treatment doesn't work correctly. Only an empty HTTP response is returned instead, resulting in an empty PanDataSet object:

from pangaeapy.pandataset import PanDataSet
dois = [
    "'doi:10.1594/PANGAEA.880129'",
   ...
]  # list of ≈ 300 different (valid, of course) Pangaea DOIs
datasets = [PanDataSet(doi) for doi in dois]  # Will result in a 429 and a retry correctly after ≈ 200 dois

versus

datasets_metadata = [PanDataSet(doi, include_data=False) for doi in dois]

Where the latter will result in No HTTP response object received for: 924666 - outputs for the last ≈ 100 dois, and subsequently the last ≈ 100 elements of datasets_no_metadata being empty.

bug in definition of path to file pan_mappings.json

Operating system: Linux

When using method .to_netcdf following error occurs:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/dist-packages/pangaeapy/src\mappings\pan_mappings.json'

Obviously, the path is not interpreted the right way.

Preserve old occurenceIDs which used to start with 1 instead of 0

From a message from GBIF:

As far as I can see, the Pangea datasets that were paused had their catalogue numbers changed and some occurrenceIDs added.

For example, in this dataset: https://www.gbif.org/dataset/8d91c862-f762-11e1-a439-00145eb45e9a, the occurrence https://www.gbif.org/occurrence/332989554 (with the catalogue number 7108134_2) seem to correspond to the record with the catalogue number 7108134_1 and the occurrenceID 1_4 in the updated Darwin Core Archive: https://digir.pangaea.de/dwca/get?doi=10.1594/PANGAEA.231616.

Another example would be this dataset: https://www.gbif.org/dataset/52c460e5-c322-4bc7-9eea-b4beee2b8920.

fix projects attributes

the now return XML obects but should return text

Typo Δδ13C

Hi Robert,
there is a typo in the last Parameter on the list https://github.com/pangaea-data-publisher/pangaeapy/blob/master/pangaeapy/src/data/parameter_mapping.csv
δ13C (PANGAEA ID 795), should be Δδ13C
Best,
Astrid

Add support for references like `Related to`, `Further details`

As far as I understand, there is currently now way to access the Related to information in a PanDataset object. E.g. https://doi.pangaea.de/10.1594/PANGAEA.921340 in panmd format, they are stored in references like

<md:reference dataciteRelType="References" group="210" id="ref105185" relationType="Related to" relationTypeId="12" typeId="ref">
<md:author id="ref105185.author67841">
<md:lastName>Pisternick</md:lastName>
<md:firstName>Timo</md:firstName>
<md:orcid>0000-0001-5396-9075</md:orcid>
</md:author>
<md:author id="ref105185.author71358">
<md:lastName>Lilkendey</md:lastName>
<md:firstName>Julian</md:firstName>
<md:eMail>[email protected]</md:eMail>
<md:orcid>0000-0003-3165-1079</md:orcid>
</md:author>
<md:author id="ref105185.author77017">
<md:lastName>Audit-Manna</md:lastName>
<md:firstName>Anishta</md:firstName>
</md:author>
<md:author id="ref105185.author67842">
<md:lastName>Dumur Neelayya</md:lastName>
<md:firstName>Danishta</md:firstName>
</md:author>
<md:author id="ref105185.author67843">
<md:lastName>Neehaul</md:lastName>
<md:firstName>Yashvin</md:firstName>
</md:author>
<md:author id="ref105185.author42600">
<md:lastName>Moosdorf</md:lastName>
<md:firstName>Nils</md:firstName>
<md:eMail>[email protected]</md:eMail>
<md:orcid>0000-0003-2822-8261</md:orcid>
</md:author>
<md:year>2020</md:year>
<md:title>Submarine groundwater springs are characterized by distinct fish communities</md:title>
<md:source id="ref105185.journal16702" relatedTermIds="33943" type="journal">Marine Ecology</md:source>
<md:URI>https://doi.org/10.1111/maec.12610</md:URI>
</md:reference>
<md:reference dataciteRelType="IsDocumentedBy" group="640" id="ref105174" relationType="Further details" relationTypeId="17" typeId="ref1" typeName="peer reviewed">
<md:author id="ref105174.author77000">
<md:lastName>Bellwood</md:lastName>
<md:firstName>D R</md:firstName>
</md:author>
<md:author id="ref105174.author74889">
<md:lastName>Hughes</md:lastName>
<md:firstName>Terry P</md:firstName>
<md:orcid>0000-0002-5257-5063</md:orcid>
</md:author>
<md:author id="ref105174.author77001">
<md:lastName>Folke</md:lastName>
<md:firstName>C</md:firstName>
<md:orcid>0000-0002-4050-3281</md:orcid>
</md:author>
<md:author id="ref105174.author77002">
<md:lastName>Nyström</md:lastName>
<md:firstName>M</md:firstName>
<md:orcid>0000-0003-3608-2426</md:orcid>
</md:author>
<md:year>2004</md:year>
<md:title>Confronting the coral reef crisis</md:title>
<md:source id="ref105174.journal11658" relatedTermIds="34013" type="journal">Nature</md:source>
<md:volume>429(6994)</md:volume>
<md:URI>https://doi.org/10.1038/nature02691</md:URI>
<md:pages>827-833</md:pages>
</md:reference>
<md:reference dataciteRelType="IsDocumentedBy" group="640" id="ref105137" relationType="Further details" relationTypeId="17" typeId="ref2" typeName="report">
<md:author id="ref105137.author76961">
<md:lastName>Cappo</md:lastName>
<md:firstName>M</md:firstName>
</md:author>
<md:author id="ref105137.author76962">
<md:lastName>Harvey</md:lastName>
<md:firstName>Euan</md:firstName>
</md:author>
<md:author id="ref105137.author76963">
<md:lastName>Malcolm</md:lastName>
<md:firstName>H</md:firstName>
</md:author>
<md:author id="ref105137.author76964">
<md:lastName>Speare</md:lastName>
<md:firstName>P</md:firstName>
</md:author>
<md:year>2003</md:year>
<md:title>Potential of video techniques to monitor diversity, abundance and size of fish in studies of marine protected areas</md:title>
<md:source>In: Beumer, J.P.; Grant, A.; and Smith, D.C. (eds.),  Aquatic protected areas. What works best and how do we know? World Congress on Aquatic Protected Areas proceedings, Cairns, Australia, August 2002</md:source>
<md:pages>455-464</md:pages>
</md:reference>
<md:reference dataciteRelType="IsDocumentedBy" group="640" id="ref105175" relationType="Further details" relationTypeId="17" typeId="ref1" typeName="peer reviewed">
<md:author id="ref105175.author77003">
<md:lastName>Cole</md:lastName>
<md:firstName>Andrew J</md:firstName>
</md:author>
<md:author id="ref105175.author35958">
<md:lastName>Pratchett</md:lastName>
<md:firstName>M S</md:firstName>
<md:orcid>0000-0002-1862-8459</md:orcid>
</md:author>
<md:author id="ref105175.author70763">
<md:lastName>Jones</md:lastName>
<md:firstName>Geoffrey P</md:firstName>
<md:orcid>0000-0002-6244-1245</md:orcid>
</md:author>
<md:year>2008</md:year>
<md:title>Diversity and functional importance of coral-feeding fishes on tropical coral reefs</md:title>
<md:source id="ref105175.journal5630" type="journal">Fish and Fisheries</md:source>
<md:volume>9(3)</md:volume>
<md:URI>https://doi.org/10.1111/j.1467-2979.2008.00290.x</md:URI>
<md:pages>286-307</md:pages>
</md:reference>

that could be parsed; maybe just for relationType, URI, and title in a first iteration.

Are there any plans of implementing this? Would you be open to a suggestion via a pull request?

Don't cache datasets if matrix is empty

PanDataSet.getEventsAsFrame() not exporting Method/Device field

When exporting the events of a pangaea dataset, the Method/Device field is returning empty (None).
For instance:

from pangaeapy.pandataset import PanDataSet
ds = PanDataSet('10.1594/PANGAEA.71218')
ds.getEventsAsFrame().device

Returns the events of that dataset, with the column "device" as None's.

fix licence attributes and add to doku

the now return XML obects but should return text
property is missing in doku

Nested datasets do not load

Dear Pangaea Colleagues,

I've been experimenting with the PyPangaea interface to directly load data without needing to use the web interface. Very nice I've been running into some difficulty, however. It seems some datasets are "nested", and contain several child datasets inside. An example is the MARGO Sea Surface Temperature reconstruction for the Last Glacial Maximum (10.1594/PANGAEA.760904). If I try to access the child datasets, I am unable to actually get a pandas table back. Any hints?

Here is what I have tried:

>>> import pangaeapy as panpy
>>> pds = panpy.PanDataSet("10.1594/PANGAEA.760904")
>>> pds.title
'Various paleoclimate proxy parameters compiled within the MARGO project'
>>> pds.citation
'Barrows, Timothy T; Chen, Min-Te; de Vernal, Anne; Eynaud, Frédérique; Hillaire-Marcel, Claude; Kiefer, Thorsten; Lee, Kyung Eun; Marret, Fabienne; Henry, Maryse; Juggins, Stephen; Londeix, Laurent; Mangin, Sylvie; Matthiessen, Jens; Radi, Taoufik; Rochon, André; Solignac, Sandrine; Turon, Jean-Louis; Waelbroeck, Claire; Weinelt, Mara (2011): Various paleoclimate proxy parameters compiled within the MARGO project. PANGAEA, https://doi.org/10.1594/PANGAEA.760904'
>>> pds.children
['doi:10.1594/PANGAEA.227326','doi:10.1594/PANGAEA.127383','doi:10.1594/PANGAEA.227620','doi:10.1594/PANGAEA.227319','doi:10.1594/PANGAEA.227318','doi:10.1594/PANGAEA.103069','doi:10.1594/PANGAEA.103070']

So far, so good. Let's try to grab the last "sub-dataset":

>>> # Weinelt, M (2004): Compilation of global planktic foraminifera LGM SST data.
>>> pds_planktic_foraminifera = panpy.PanDataSet(pds.children[-1])
>>> # Above does not work, it is the same as:
>>> pds_planktic_foraminifera = panpy.PanDataSet("doi:10.1594/PANGAEA.103070")
>>> # Maybe remove doi, yet this does not work either:
>>> pds_planktic_foraminifera = panpy.PanDataSet("10.1594/PANGAEA.103070")
>>> # Maybe just the dataset ID
>>> pds_planktic_foraminifera = panpy.PanDataSet("103070")
>>> pds_planktic_foraminifera.title # <-- empty
>>> pds_planktic_foraminifera.abstract # <-- empty
>>> pds_planktic_foraminifera.data # <-- empty

If tracebacks are helpful, I can post those as well, but I guess this is something easy enough to reproduce locally without copy/pasting walls of error text...

Thanks,
Paul

NetCDF HDF installation errors

Install fails with message 'HDF5_DIR environment variable not set, checking some standard locations ..'

seems to be an issue with python 3.7:
Unidata/netcdf4-python#901

Preserve QC flags

PANGAEA QC flags should be preserved an e.g. added as additional columns which are named like the flagged column but with a '_qc' exctension. This column shall contain the original flags..

PANGAEA contact details in dwca exporter

Hi,
in the metadata file please add "first name", "last name" and "position" of PANGAEA staff to both the "Resource contacts" and "Metadata providers".
Thanks a lot!
A

actionable DOIs are not recognized in constructor

This works:
PanDataSet('10.1594/PANGAEA.902215')

but this doesn't work:
PanDataSet('https://doi.pangaea.de/10.1594/PANGAEA.902215')

Darwin Core export: wrong observation type for living benthos or infauna

https://issues.pangaea.de/browse/PGI-4773
Darwin Core export gives wrong observation type in case living benthos or infauna is reported in sediment

https://www.gbif.org/dataset/81b78f9a-f762-11e1-a439-00145eb45e9a

downloaded data has no unit

Dear pangaeapy team,

I was wondering whether it is on purpose, that the parameter of a downoaded file don't have a unit.
Example:
ds = PanDataSet('10.1594/PANGAEA.890530')
ds.data.columns
Index(['Depth', 'Age', 'DBD', 'TC', 'TN', 'Corg dens', 'Event', 'Latitude',
'Longitude', 'Elevation'],
dtype='object')
If I download the same data set via the PANGAEA webpage, short name plus unit is provided.

best wishes,
Kathrin

Allow more recent version of Python Requests

I'm having version conflicts when installing Pangeapy together with another dependency of my projects which requires requests[socks] 2.28.1 whereas Pangeapy requires requests~=2.26.0. Is there any reason not to upgrade the dependency?

PANGAEA full citatation in dwca exporter

Not to forget that there needs to be a full citation incl. DOI of the original PANGAEA dataset somewhere in the metadata of exported data.

Attribute 'citation' is None

Attribute 'citation' of 'pangaeapy.pandataset.PanDataSet' is 'None'.
This is what I did:
import pangaeapy as pd
doi_num = 946390
pds = pd.PanDataSet(doi_num)
pds.citation

I tested two different data sets to make sure that the behaviour is independent of data set.
A workaround for me is to execute this before accessing the attribute 'citation':
pds._setCitation()

Suggestion: the call to the method '_setCitation' should be made centrally in the class 'pangaeapy.pandataset.PanDataSet' where it is appropriate.

Add a changelog

This package is missing a changelog telling users what parts of the API were added, changed, removed, deprecated, fixed...

A good idea would be to follow https://keepachangelog.com/en/1.1.0/ and use semantic versioning.

Multiple Events

By now, missing lat,lon, event columns are only added to the data matrix in ase only one event is represented in the dataset. Multiple events have to be supported also

Reading Comment information from pangaea dataset

I have been looking for a way to read the information given in "Comment" for pangaea datasets (example: https://doi.pangaea.de/10.1594/PANGAEA.800288), but as far as I can see this is not yet possible.

Is this a feature you are planning on implementing?

Resolve attribute error preventing PanDataSet from accessing data

Dear Pangaea team,
having generated a query using PanQuery, I have run into issues accessing numerous data sets using PanDataSet. I am avoiding issue 27 by checking that the data sets I access are of type 'child'.
However, problems persist, with the most common error returned being:
AttributeError: 'NoneType' object has no attribute 'text'
Below, I provide a reproducible example. In this query, 43 out of 212 return an error, 155 succeed (the remaining 14 are of type 'parent' and thus aren't tested). Of the 43 failures to return the dataset, only two errors relate to access restrictions, all other are attribute errors.

from pangaeapy import panquery as pq
from pangaeapy import PanDataSet as pdset
bb = [-11, 48, -10.5, 48.5]
my_query = pq.PanQuery("method:CTD/Rosette", bbox = bb, limit = 500)
my_query.totalcount

Reality check, this should return 212 results:

successes = 0
errors = 0
for i in range(my_query.totalcount):
  if my_query.result[i]["type"] == 'child':
    # Checks if the data set is a parent data set, in which case we skip it,  
    # only "child" data sets containing actual data matter:
    uid = int(my_query.result[i]["URI"].split('.')[-1])
    try:
      pdset(uid)
      successes += 1
    except Exception as e: 
      print(uid)
      print(e) 
      errors += 1

My results are:
errors
43
successes
155

Add "Supplement to" information explicitly

It is currently only available implicitly in the citation attribute. See my suggestion in #48.

Add keywords to PanDataset

It would be good to have the keywords (actual, not technical or auto-generated) in the dataset object, too. This could be a simple list of keyword strings, but could also be a list of dictionaries. E.g., for https://doi.pangaea.de/10.1594/PANGAEA.957810, it could be either

["Benguela Upwelling System", "elemental stoichiometry", "marine carbon cycle", "Sea surface partial pressure of CO2"]

[
  {
    "name": "Benguela Upwelling System",
    "id":  "keywords.term72123",
    "type": "fromDatabase"
  },
  ...
]

The former is a bit easier to read while the latter is more easy to extend, so I've a slight tendency toward the list of dictionaries.

@huberrob Which way would you prefer? I'm happy to provide a pull request for the implementation.

add abstract/description property to PanDataSet

No module named 'pangaeapy'

Tryed to install via classic pip and returns the error:
ERROR: Could not install packages due to an OSError: [Errno 2] Arquivo ou diretório inexistente: '/home/victor/anaconda3/envs/pyleo/lib/python3.10/site-packages/numpy-1.24.1.dist-info/METADATA'

In a second attempt I've tried to install via git 'pip install git+https://github.com/pangaea-data-publisher/pangaeapy' and returned the error:
'ERROR: git+https://github.com/pangaea-data-publisher/pangaeapy does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.'

relations for collections missing

Hi Robert,
would it be possible to also get the relations (related literature) for collections (previously parents)?
When I do e.g.
ds = pg.PanDataSet(839065) # Is child dataset
ds.relations
[{'id': 'ref66016',
'title': 'Late Holocene primary productivity and sea surface temperature variations in the northeastern Arabian Sea: Implications for winter monsoon variability',
'uri': 'https://doi.org/10.1002/2013PA002579',
'type': 'Related to'}]

However:
dn = pg.PanDataSet('10.1594/PANGAEA.839067') # Is parent record.
dn.relations
[]

Thanks!
Astrid

Date/Time field not correctly read anymore

I have been reading a dataset (https://doi.org/10.1594/PANGAEA.937536) that includes a Date/Time column. Until recently (at least Aug 30, 2023 but I believe I ran the notebook since), it was correctly read as 2008-04-23T17:15:00 (for the first data point). Now, only the year appears to be read. To reproduce the issue:
pandata = pg.PanDataSet('doi:10.1594/PANGAEA.937536') then print(pandata.data['Date/Time'][0]) (returns 2008-01-01 00:00:00).

Thank you!!