Giter VIP home page Giter VIP logo

pangaeapy's People

Contributors

aarthi02 avatar dokempf avatar egor93 avatar fspreck-indiscale avatar huberrob avatar iris-hinrichs avatar markusstocker avatar qaysabouhousien avatar scottclowe avatar uschindler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pangaeapy's Issues

HDF5 libraries are required on OS

Since HDF5 libraries are required for the package (see pip error below), it would make sense to add this information to the README. I think it would also be reasonable to add some information on how to install these libraries. For Ubuntu, I found the command pretty quick (sudo apt-get install libhdf5-serial-dev netcdf-bin libnetcdf-dev), but maybe it's more complicated for other OS.

Cheers

Defaulting to user installation because normal site-packages is not writeable
Collecting pangaeapy
  Downloading pangaeapy-1.0.13-py3-none-any.whl (2.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 5.4 MB/s eta 0:00:00
Requirement already satisfied: lxml>=4.9.1 in /home/ruemmler/.local/lib/python3.11/site-packages (from pangaeapy) (4.9.2)
Requirement already satisfied: requests>=2.26.0 in /home/ruemmler/.local/lib/python3.11/site-packages (from pangaeapy) (2.28.2)
Requirement already satisfied: pandas>=1.3.5 in /home/ruemmler/.local/lib/python3.11/site-packages (from pangaeapy) (1.5.3)
Requirement already satisfied: numpy>=1.21.0 in /home/ruemmler/.local/lib/python3.11/site-packages (from pangaeapy) (1.23.5)
Collecting netcdf4~=1.5.6
  Downloading netCDF4-1.5.8.tar.gz (767 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.0/767.0 kB 17.5 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      Package hdf5 was not found in the pkg-config search path.
      Perhaps you should add the directory containing `hdf5.pc'
      to the PKG_CONFIG_PATH environment variable
      No package 'hdf5' found
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-rj5wrjo3/netcdf4_55387b07616a42cd9e4ebf580deadf8c/setup.py", line 419, in <module>
          _populate_hdf5_info(dirstosearch, inc_dirs, libs, lib_dirs)
        File "/tmp/pip-install-rj5wrjo3/netcdf4_55387b07616a42cd9e4ebf580deadf8c/setup.py", line 360, in _populate_hdf5_info
          raise ValueError('did not find HDF5 headers')
      ValueError: did not find HDF5 headers
      reading from setup.cfg...
      
          HDF5_DIR environment variable not set, checking some standard locations ..
      checking /home/ruemmler/include ...
      hdf5 headers not found in /home/ruemmler/include
      checking /usr/local/include ...
      hdf5 headers not found in /usr/local/include
      checking /sw/include ...
      hdf5 headers not found in /sw/include
      checking /opt/include ...
      hdf5 headers not found in /opt/include
      checking /opt/local/include ...
      hdf5 headers not found in /opt/local/include
      checking /usr/include ...
      hdf5 headers not found in /usr/include
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.```

Remove incorrect terms "parent" and "child" from API

This is confusing to users as we "officially" do not use terms like "parent" or "child" for datasets. With preparing our community workshop, we noticed that those terms are used in API and error messages.

As not all collections are parents and for external users a "parent" means nothing and is not a term used outside of PANGAEA, please replace all usage of "parent" by "collection". "Child datasets" are fine from the persepctive of a dataset with an "In:" citation, but from perspective of pangaeapy it makes no difference. So "normal" datasets should be "dataset" not "child".

So all error messages should be like "cannot download datas as it is collection".

Enable caching

We should be able to cache pangaeapy objects for e.g. offline processing. Therefore, data as well as metadata should be stored e.g. as pickle in a given cache directory..

Move setup.py etc. to top level

See, e.g., #37. The current structure of this repo doesn't allow for a simple

pip install git+https://github.com/pangaea-data-publisher/pangaeapy

since the top level directory of this package is not pip installable. This could be fixed by moving the setup.py, LICENSE.md, etc., to the root of this repo.

no information on data access rights via query

Dear pangaeapy team,

the pangaeapy query does not show whether access rights are needed for a particular data set. The download of a data set with access rights gives an empty object, which is confusing if the information on access rights is not given.
Example:
https://www.pangaea.de/advanced/search.php?q=B18_2012 shows that access rights are needed, if I am logged into PANGAEA.
  | "

  • <div class="citation"><a href="https://doi.pangaea.de/10.1594/PANGAEA.931734\" target="_self" class="dataset-link">Freitag, J; Kipfstuhl, S; Weißbach, S et al. (2021): Density profile of the B18_2012 firn core<table class="result" summary="Dataset reference and size" cellspacing="0" cellpadding="0"><td class="title">Size:<td class="content">5535 data points<div class="datasetid"><a href="https://doi.pangaea.de/10.1594/PANGAEA.931734\" target="_self" class="dataset-link doi-link">https://doi.org/10.1594/PANGAEA.931734 – <a href="https://doi.pangaea.de/10.1594/PANGAEA.931734?format=textfile\" target="_self" class="dataset-download-link">Download <span class="glyphicon glyphicon-lock" title="access rights needed"/> – <span title="The score is a measurement of relevancy (see TF-IDF algorithm). The value is not absolute and only suitable to compare hits from the same result set.">Score: 10.09
  • "

    This info is not shown if I am not logged in. It would be great to get this information always, independent of being logged in.

    since pangaeapy does not know whether I am logged in:
    test1=pan.PanQuery("B18_2012")
    test1.result[0]['html']
    '

  • Size:5535 data points
  • '

    Please adapt query with information on access rights.

    best wishes,
    Kathrin

    `exporter` folder is missing `__init__.py`

    This leads to issues with installs directly from GitHub, e.g. pip install git+https://github.com/pangaea-data-publisher/pangaeapy, where subfolder exporter is ignored during install. Importing the module with from pangaeapy.pandataset import PanDataSet leads to a ModuleNotFoundError: No module named 'pangaeapy.exporter.

    pip install pangaeapy does not have this issue and import completes successfully.

    Further info:

    • pip install git+https://github.com/pangaea-data-publisher/pangaeapy installs version 0.0.3
    • pip install pangaeapy installs version 0.0.4

    Connection error

    Hi, I'm getting a connection error when trying to access any dataset in PANGAEA. This has been happening for the last couple of weeks and I wasn't sure if it was my internet connection. Seeing how this connection error persists, I wanted to raise this issue here.

    I am getting the following error just by executing these two lines

    from pangaeapy.src.pandataset import PanDataSet
    PanDataSet('10.1594/PANGAEA.57849')
    
    Traceback (most recent call last):
      File "C:\Users\Python\venv\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
        httplib_response = self._make_request(
      File "C:\Users\Python\venv\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
        self._validate_conn(conn)
      File "C:\Users\Python\venv\lib\site-packages\urllib3\connectionpool.py", line 978, in _validate_conn
        conn.connect()
      File "C:\Users\Python\venv\lib\site-packages\urllib3\connection.py", line 362, in connect
        self.sock = ssl_wrap_socket(
      File "C:\Users\Python\venv\lib\site-packages\urllib3\util\ssl_.py", line 384, in ssl_wrap_socket
        return context.wrap_socket(sock, server_hostname=server_hostname)
      File "C:\Users\Python\Python39\lib\ssl.py", line 500, in wrap_socket
        return self.sslsocket_class._create(
      File "C:\Users\Python\Python39\lib\ssl.py", line 1040, in _create
        self.do_handshake()
      File "C:\Users\Python\Python39\lib\ssl.py", line 1309, in do_handshake
        self._sslobj.do_handshake()
    ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "C:\Users\Python\venv\lib\site-packages\requests\adapters.py", line 439, in send
        resp = conn.urlopen(
      File "C:\Users\Python\venv\lib\site-packages\urllib3\connectionpool.py", line 726, in urlopen
        retries = retries.increment(
      File "C:\Users\Python\venv\lib\site-packages\urllib3\util\retry.py", line 439, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='doi.pangaea.de', port=443): Max retries exceeded with url: /10.1594/PANGAEA.57849?format=citation_text&charset=UTF-8 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)')))
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "C:\Users\Python\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3437, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-4-dcc1b675b247>", line 1, in <module>
        PanDataSet('10.1594/PANGAEA.57849')
      File "C:\Users\Python\venv\lib\site-packages\pangaeapy\src\pandataset.py", line 404, in __init__
        self.setMetadata()
      File "C:\Users\Python\venv\lib\site-packages\pangaeapy\src\pandataset.py", line 746, in setMetadata
        self._setCitation()
      File "C:\Users\Python\venv\lib\site-packages\pangaeapy\src\pandataset.py", line 737, in _setCitation
        r=requests.get(citationURL)
      File "C:\Users\Python\venv\lib\site-packages\requests\api.py", line 76, in get
        return request('get', url, params=params, **kwargs)
      File "C:\Users\Python\venv\lib\site-packages\requests\api.py", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "C:\Users\Python\venv\lib\site-packages\requests\sessions.py", line 530, in request
        resp = self.send(prep, **send_kwargs)
      File "C:\Users\Python\venv\lib\site-packages\requests\sessions.py", line 643, in send
        r = adapter.send(request, **kwargs)
      File "C:\Users\Python\venv\lib\site-packages\requests\adapters.py", line 514, in send
        raise SSLError(e, request=request)
    requests.exceptions.SSLError: HTTPSConnectionPool(host='doi.pangaea.de', port=443): Max retries exceeded with url: /10.1594/PANGAEA.57849?format=citation_text&charset=UTF-8 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1123)')))
    

    Rewrite parts of data download to simplify especially error handling: Use content negotiation

    I reviewed the current code to downaload datasets and figured out that it does a lot of if/then/else and parses XML files to figure out if datasets are freely accessible, or if they are parents. This is done for the reason because it needs to guess datatype. It also looks like the code wants to not hammer PANGAEA with useless requests. But this is no problem at all. The response that a content type is not supported is cheap and the http status code comes fast. I'd do the data download like that:

    • Use the plain DOI as URL for the download (both works: "https://doi.pangaea.de" but also "https://doi.org" and other variants). Previously with a doi.org URL no download was possible as "format=" parameter gets lost.
    • Set Authentication: Bearer token if available (see below). No need to check if it is login protected before. Just send always if available.
    • Set Accept: text/tab-separated-values as header. This enables content negotiation. As this header does NOT look like a plain stupid browser, the PANGAEA code will switch to real "REST mode" and for example respond with correct headers instead of redirects to login page if the dataset is password protected and the credentials do not match. So you don't need to do best guesses when you were redirected and you get the HTML login page. A real REST client will get correct status code to know: "unauthorized".

    This should always return the normal tab-separated-values format. No need to cross-check content-type in response or anything like that. The download code should only look at status code:

    • 200 (OK): All went well, you can be sure it is a tab-delimited matrix in PANGAEA format
    • 401 (Unauthorized): Dataset is protected and access rights do not match the bearer token or there's no bearer token at all (e.g., wrong user) or no bearer token at all. This can be reported as error message.
    • 406 (Not acceptable): The format in Accept header cannot be fulfilled. This happens when it is a parent or another type of collection or a static URL dataset with a different media type
    • 404 (Not Found): Dataset does not exist
    • 429 (Too many requests): Wait a few seconds
    • 5xx: some server error, especially 503 means "PANGAEA is down". Report this as hard error to user.

    If you want to get the native PANGAEA metadata in panmd format, please DO NOT use oai-pmh (I think pangaear dors this not sure about pangaeapy). The native PANGAEA metadata can and should also be retrieved by content negotiation: Accept: application/vnd.pangaea.metadata+xml

    And finally to get the citation string use: Accept: text/x-bibliography (the default charset is always UTF-8). The current code does not parse any charset parameter on the content-type.

    Add possibility to choose xarray as data container instead of pandas dataframes

    It would be nice to allow the user to choose between pandas dataframes and xarrays as data container. Pandas should be the default container..

    The setData method could then look like this:

    def setData(self, addEventColumns=True, container='pandas'):

    to choose xarray the method should be called like setData(container='xarray') ..

    Empty HTTP when retrieving many (> 300) datasets with `include_data=False`

    Apparently, when retrieving many of datasets in a short time with the include_data=False argument (i.e., only retrieving metadata), the Received too many requests (for data) error (429)...waiting 30s treatment doesn't work correctly. Only an empty HTTP response is returned instead, resulting in an empty PanDataSet object:

    from pangaeapy.pandataset import PanDataSet
    dois = [
        "'doi:10.1594/PANGAEA.880129'",
       ...
    ]  # list of ≈ 300 different (valid, of course) Pangaea DOIs
    datasets = [PanDataSet(doi) for doi in dois]  # Will result in a 429 and a retry correctly after ≈ 200 dois

    versus

    datasets_metadata = [PanDataSet(doi, include_data=False) for doi in dois]

    Where the latter will result in No HTTP response object received for: 924666 - outputs for the last ≈ 100 dois, and subsequently the last ≈ 100 elements of datasets_no_metadata being empty.

    bug in definition of path to file pan_mappings.json

    Operating system: Linux

    When using method .to_netcdf following error occurs:
    FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/dist-packages/pangaeapy/src\mappings\pan_mappings.json'

    Obviously, the path is not interpreted the right way.

    Preserve old occurenceIDs which used to start with 1 instead of 0

    From a message from GBIF:

    As far as I can see, the Pangea datasets that were paused had their catalogue numbers changed and some occurrenceIDs added.

    For example, in this dataset: https://www.gbif.org/dataset/8d91c862-f762-11e1-a439-00145eb45e9a, the occurrence https://www.gbif.org/occurrence/332989554 (with the catalogue number 7108134_2) seem to correspond to the record with the catalogue number 7108134_1 and the occurrenceID 1_4 in the updated Darwin Core Archive: https://digir.pangaea.de/dwca/get?doi=10.1594/PANGAEA.231616.

    Another example would be this dataset: https://www.gbif.org/dataset/52c460e5-c322-4bc7-9eea-b4beee2b8920.

    Add support for references like `Related to`, `Further details`

    As far as I understand, there is currently now way to access the Related to information in a PanDataset object. E.g. https://doi.pangaea.de/10.1594/PANGAEA.921340 in panmd format, they are stored in references like

    <md:reference dataciteRelType="References" group="210" id="ref105185" relationType="Related to" relationTypeId="12" typeId="ref">
    <md:author id="ref105185.author67841">
    <md:lastName>Pisternick</md:lastName>
    <md:firstName>Timo</md:firstName>
    <md:orcid>0000-0001-5396-9075</md:orcid>
    </md:author>
    <md:author id="ref105185.author71358">
    <md:lastName>Lilkendey</md:lastName>
    <md:firstName>Julian</md:firstName>
    <md:eMail>[email protected]</md:eMail>
    <md:orcid>0000-0003-3165-1079</md:orcid>
    </md:author>
    <md:author id="ref105185.author77017">
    <md:lastName>Audit-Manna</md:lastName>
    <md:firstName>Anishta</md:firstName>
    </md:author>
    <md:author id="ref105185.author67842">
    <md:lastName>Dumur Neelayya</md:lastName>
    <md:firstName>Danishta</md:firstName>
    </md:author>
    <md:author id="ref105185.author67843">
    <md:lastName>Neehaul</md:lastName>
    <md:firstName>Yashvin</md:firstName>
    </md:author>
    <md:author id="ref105185.author42600">
    <md:lastName>Moosdorf</md:lastName>
    <md:firstName>Nils</md:firstName>
    <md:eMail>[email protected]</md:eMail>
    <md:orcid>0000-0003-2822-8261</md:orcid>
    </md:author>
    <md:year>2020</md:year>
    <md:title>Submarine groundwater springs are characterized by distinct fish communities</md:title>
    <md:source id="ref105185.journal16702" relatedTermIds="33943" type="journal">Marine Ecology</md:source>
    <md:URI>https://doi.org/10.1111/maec.12610</md:URI>
    </md:reference>
    <md:reference dataciteRelType="IsDocumentedBy" group="640" id="ref105174" relationType="Further details" relationTypeId="17" typeId="ref1" typeName="peer reviewed">
    <md:author id="ref105174.author77000">
    <md:lastName>Bellwood</md:lastName>
    <md:firstName>D R</md:firstName>
    </md:author>
    <md:author id="ref105174.author74889">
    <md:lastName>Hughes</md:lastName>
    <md:firstName>Terry P</md:firstName>
    <md:orcid>0000-0002-5257-5063</md:orcid>
    </md:author>
    <md:author id="ref105174.author77001">
    <md:lastName>Folke</md:lastName>
    <md:firstName>C</md:firstName>
    <md:orcid>0000-0002-4050-3281</md:orcid>
    </md:author>
    <md:author id="ref105174.author77002">
    <md:lastName>Nyström</md:lastName>
    <md:firstName>M</md:firstName>
    <md:orcid>0000-0003-3608-2426</md:orcid>
    </md:author>
    <md:year>2004</md:year>
    <md:title>Confronting the coral reef crisis</md:title>
    <md:source id="ref105174.journal11658" relatedTermIds="34013" type="journal">Nature</md:source>
    <md:volume>429(6994)</md:volume>
    <md:URI>https://doi.org/10.1038/nature02691</md:URI>
    <md:pages>827-833</md:pages>
    </md:reference>
    <md:reference dataciteRelType="IsDocumentedBy" group="640" id="ref105137" relationType="Further details" relationTypeId="17" typeId="ref2" typeName="report">
    <md:author id="ref105137.author76961">
    <md:lastName>Cappo</md:lastName>
    <md:firstName>M</md:firstName>
    </md:author>
    <md:author id="ref105137.author76962">
    <md:lastName>Harvey</md:lastName>
    <md:firstName>Euan</md:firstName>
    </md:author>
    <md:author id="ref105137.author76963">
    <md:lastName>Malcolm</md:lastName>
    <md:firstName>H</md:firstName>
    </md:author>
    <md:author id="ref105137.author76964">
    <md:lastName>Speare</md:lastName>
    <md:firstName>P</md:firstName>
    </md:author>
    <md:year>2003</md:year>
    <md:title>Potential of video techniques to monitor diversity, abundance and size of fish in studies of marine protected areas</md:title>
    <md:source>In: Beumer, J.P.; Grant, A.; and Smith, D.C. (eds.),  Aquatic protected areas. What works best and how do we know? World Congress on Aquatic Protected Areas proceedings, Cairns, Australia, August 2002</md:source>
    <md:pages>455-464</md:pages>
    </md:reference>
    <md:reference dataciteRelType="IsDocumentedBy" group="640" id="ref105175" relationType="Further details" relationTypeId="17" typeId="ref1" typeName="peer reviewed">
    <md:author id="ref105175.author77003">
    <md:lastName>Cole</md:lastName>
    <md:firstName>Andrew J</md:firstName>
    </md:author>
    <md:author id="ref105175.author35958">
    <md:lastName>Pratchett</md:lastName>
    <md:firstName>M S</md:firstName>
    <md:orcid>0000-0002-1862-8459</md:orcid>
    </md:author>
    <md:author id="ref105175.author70763">
    <md:lastName>Jones</md:lastName>
    <md:firstName>Geoffrey P</md:firstName>
    <md:orcid>0000-0002-6244-1245</md:orcid>
    </md:author>
    <md:year>2008</md:year>
    <md:title>Diversity and functional importance of coral-feeding fishes on tropical coral reefs</md:title>
    <md:source id="ref105175.journal5630" type="journal">Fish and Fisheries</md:source>
    <md:volume>9(3)</md:volume>
    <md:URI>https://doi.org/10.1111/j.1467-2979.2008.00290.x</md:URI>
    <md:pages>286-307</md:pages>
    </md:reference>

    that could be parsed; maybe just for relationType, URI, and title in a first iteration.

    Are there any plans of implementing this? Would you be open to a suggestion via a pull request?

    PanDataSet.getEventsAsFrame() not exporting Method/Device field

    When exporting the events of a pangaea dataset, the Method/Device field is returning empty (None).
    For instance:

    from pangaeapy.pandataset import PanDataSet
    ds = PanDataSet('10.1594/PANGAEA.71218')
    ds.getEventsAsFrame().device

    Returns the events of that dataset, with the column "device" as None's.

    Nested datasets do not load

    Dear Pangaea Colleagues,

    I've been experimenting with the PyPangaea interface to directly load data without needing to use the web interface. Very nice I've been running into some difficulty, however. It seems some datasets are "nested", and contain several child datasets inside. An example is the MARGO Sea Surface Temperature reconstruction for the Last Glacial Maximum (10.1594/PANGAEA.760904). If I try to access the child datasets, I am unable to actually get a pandas table back. Any hints?

    Here is what I have tried:

    >>> import pangaeapy as panpy
    >>> pds = panpy.PanDataSet("10.1594/PANGAEA.760904")
    >>> pds.title
    'Various paleoclimate proxy parameters compiled within the MARGO project'
    >>> pds.citation
    'Barrows, Timothy T; Chen, Min-Te; de Vernal, Anne; Eynaud, Frédérique; Hillaire-Marcel, Claude; Kiefer, Thorsten; Lee, Kyung Eun; Marret, Fabienne; Henry, Maryse; Juggins, Stephen; Londeix, Laurent; Mangin, Sylvie; Matthiessen, Jens; Radi, Taoufik; Rochon, André; Solignac, Sandrine; Turon, Jean-Louis; Waelbroeck, Claire; Weinelt, Mara (2011): Various paleoclimate proxy parameters compiled within the MARGO project. PANGAEA, https://doi.org/10.1594/PANGAEA.760904'
    >>> pds.children
    ['doi:10.1594/PANGAEA.227326','doi:10.1594/PANGAEA.127383','doi:10.1594/PANGAEA.227620','doi:10.1594/PANGAEA.227319','doi:10.1594/PANGAEA.227318','doi:10.1594/PANGAEA.103069','doi:10.1594/PANGAEA.103070']

    So far, so good. Let's try to grab the last "sub-dataset":

    >>> # Weinelt, M (2004): Compilation of global planktic foraminifera LGM SST data.
    >>> pds_planktic_foraminifera = panpy.PanDataSet(pds.children[-1])
    >>> # Above does not work, it is the same as:
    >>> pds_planktic_foraminifera = panpy.PanDataSet("doi:10.1594/PANGAEA.103070")
    >>> # Maybe remove doi, yet this does not work either:
    >>> pds_planktic_foraminifera = panpy.PanDataSet("10.1594/PANGAEA.103070")
    >>> # Maybe just the dataset ID
    >>> pds_planktic_foraminifera = panpy.PanDataSet("103070")
    >>> pds_planktic_foraminifera.title # <-- empty
    >>> pds_planktic_foraminifera.abstract # <-- empty
    >>> pds_planktic_foraminifera.data # <-- empty

    If tracebacks are helpful, I can post those as well, but I guess this is something easy enough to reproduce locally without copy/pasting walls of error text...

    Thanks,
    Paul

    Preserve QC flags

    PANGAEA QC flags should be preserved an e.g. added as additional columns which are named like the flagged column but with a '_qc' exctension. This column shall contain the original flags..

    PANGAEA contact details in dwca exporter

    Hi,
    in the metadata file please add "first name", "last name" and "position" of PANGAEA staff to both the "Resource contacts" and "Metadata providers".
    Thanks a lot!
    A

    downloaded data has no unit

    Dear pangaeapy team,

    I was wondering whether it is on purpose, that the parameter of a downoaded file don't have a unit.
    Example:
    ds = PanDataSet('10.1594/PANGAEA.890530')
    ds.data.columns
    Index(['Depth', 'Age', 'DBD', 'TC', 'TN', 'Corg dens', 'Event', 'Latitude',
    'Longitude', 'Elevation'],
    dtype='object')
    If I download the same data set via the PANGAEA webpage, short name plus unit is provided.

    best wishes,
    Kathrin

    Allow more recent version of Python Requests

    I'm having version conflicts when installing Pangeapy together with another dependency of my projects which requires requests[socks] 2.28.1 whereas Pangeapy requires requests~=2.26.0. Is there any reason not to upgrade the dependency?

    Attribute 'citation' is None

    Attribute 'citation' of 'pangaeapy.pandataset.PanDataSet' is 'None'.
    This is what I did:
    import pangaeapy as pd
    doi_num = 946390
    pds = pd.PanDataSet(doi_num)
    pds.citation

    I tested two different data sets to make sure that the behaviour is independent of data set.
    A workaround for me is to execute this before accessing the attribute 'citation':
    pds._setCitation()

    Suggestion: the call to the method '_setCitation' should be made centrally in the class 'pangaeapy.pandataset.PanDataSet' where it is appropriate.

    Multiple Events

    By now, missing lat,lon, event columns are only added to the data matrix in ase only one event is represented in the dataset. Multiple events have to be supported also

    Resolve attribute error preventing PanDataSet from accessing data

    Dear Pangaea team,
    having generated a query using PanQuery, I have run into issues accessing numerous data sets using PanDataSet. I am avoiding issue 27 by checking that the data sets I access are of type 'child'.
    However, problems persist, with the most common error returned being:
    AttributeError: 'NoneType' object has no attribute 'text'
    Below, I provide a reproducible example. In this query, 43 out of 212 return an error, 155 succeed (the remaining 14 are of type 'parent' and thus aren't tested). Of the 43 failures to return the dataset, only two errors relate to access restrictions, all other are attribute errors.

    from pangaeapy import panquery as pq
    from pangaeapy import PanDataSet as pdset
    bb = [-11, 48, -10.5, 48.5]
    my_query = pq.PanQuery("method:CTD/Rosette", bbox = bb, limit = 500)
    my_query.totalcount
    

    Reality check, this should return 212 results:

    successes = 0
    errors = 0
    for i in range(my_query.totalcount):
      if my_query.result[i]["type"] == 'child':
        # Checks if the data set is a parent data set, in which case we skip it,  
        # only "child" data sets containing actual data matter:
        uid = int(my_query.result[i]["URI"].split('.')[-1])
        try:
          pdset(uid)
          successes += 1
        except Exception as e: 
          print(uid)
          print(e) 
          errors += 1
    

    My results are:
    errors
    43
    successes
    155

    Add keywords to PanDataset

    It would be good to have the keywords (actual, not technical or auto-generated) in the dataset object, too. This could be a simple list of keyword strings, but could also be a list of dictionaries. E.g., for https://doi.pangaea.de/10.1594/PANGAEA.957810, it could be either

    ["Benguela Upwelling System", "elemental stoichiometry", "marine carbon cycle", "Sea surface partial pressure of CO2"]
    

    or

    [
      {
        "name": "Benguela Upwelling System",
        "id":  "keywords.term72123",
        "type": "fromDatabase"
      },
      ...
    ]
    

    The former is a bit easier to read while the latter is more easy to extend, so I've a slight tendency toward the list of dictionaries.

    @huberrob Which way would you prefer? I'm happy to provide a pull request for the implementation.

    No module named 'pangaeapy'

    Tryed to install via classic pip and returns the error:
    ERROR: Could not install packages due to an OSError: [Errno 2] Arquivo ou diretório inexistente: '/home/victor/anaconda3/envs/pyleo/lib/python3.10/site-packages/numpy-1.24.1.dist-info/METADATA'

    In a second attempt I've tried to install via git 'pip install git+https://github.com/pangaea-data-publisher/pangaeapy' and returned the error:
    'ERROR: git+https://github.com/pangaea-data-publisher/pangaeapy does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.'

    relations for collections missing

    Hi Robert,
    would it be possible to also get the relations (related literature) for collections (previously parents)?
    When I do e.g.
    ds = pg.PanDataSet(839065) # Is child dataset
    ds.relations
    [{'id': 'ref66016',
    'title': 'Late Holocene primary productivity and sea surface temperature variations in the northeastern Arabian Sea: Implications for winter monsoon variability',
    'uri': 'https://doi.org/10.1002/2013PA002579',
    'type': 'Related to'}]

    However:
    dn = pg.PanDataSet('10.1594/PANGAEA.839067') # Is parent record.
    dn.relations
    []

    Thanks!
    Astrid

    Date/Time field not correctly read anymore

    I have been reading a dataset (https://doi.org/10.1594/PANGAEA.937536) that includes a Date/Time column. Until recently (at least Aug 30, 2023 but I believe I ran the notebook since), it was correctly read as 2008-04-23T17:15:00 (for the first data point). Now, only the year appears to be read. To reproduce the issue:
    pandata = pg.PanDataSet('doi:10.1594/PANGAEA.937536') then print(pandata.data['Date/Time'][0]) (returns 2008-01-01 00:00:00).

    Thank you!!

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.