Use Base julia semantics about gbif.jl HOT 10 CLOSED

rafaqz commented on August 16, 2024

Use Base julia semantics

from gbif.jl.

Comments (10)

tpoisot commented on August 16, 2024

Starting with a 0-length array (which was the case at some point) is obviously a terrible idea for performance reasons, so this will not happen.

While I do get your point about the semantics, I think we can have some flexibility for now - the package works, the documentation is updated, the code has been reviewed a few times. If we want to change all of that, with all that implies for both the user-facing code and the internals, there needs to be a discussion prior to opening issues, or at least suggestions. Not only about read! (which I don't think would make sense for most users), but about how to rename the number of effective records and the total number of records.

from gbif.jl.

rafaqz commented on August 16, 2024

As I mentioned in read! I was hoping to use GBIF.jl in teaching examples, and for documentation examples in GeoData.jl, e.g. for extract. It's the perfect package to use to demonstrate getting some point observations, and extracting variables for them. But code examples cant really contain base julia methods with non-standard semantics.

I was hoping to make some improvements to the package, and willing to do all of the work, and wouldn't do anything that reduces the performance.

from gbif.jl.

tpoisot commented on August 16, 2024

Not that I am unwilling to entertain the idea - but I need a concrete proposal, and I need this before we introduce read! (which I think is not a user-friendly name, but we'll discuss this later).

I am really grateful that you want to make improvements to the package, but as it's developed and heavily used in the lab, I want to make sure that we discuss things before writing code, so as not to waste your time writing it and our time reviewing it.

If you could reply with a proposed fix coupled to each of your issues, we can discuss this before moving on to a PR.

from gbif.jl.

rafaqz commented on August 16, 2024

Methods to modify:

length: feels like it should be the number of records downloaded already
size: is defined in base as (length,) for an array. We should do the same here.
view(::GBIFRecords): probably shouldn't be defined here. In base it returns a zero dimensional SubArray

There are two main ways to do this.

Use everything as is, and just change some names, so

size (the length of the array) becomes e.g. total_occurrences .
length (the number of fetched records) stays as length
view becomes e.g records which calls view(o.occuurences, length(o)).
Array interface methods like Array/vec/collect etc could all call records.
getindex is a little odd... does it index o.occurrences or records(o) ? I'm not sure.

Generally in this case array interface methods like length have to be manipulated to give the illusion of a shorter array of GBIFRecord than we actually have.

Swap how the occurrences array works so it starts small and grows with new data.

The occurrences vector is initialized with the length of the first request. Each successive request uses push!. This isn't slow compared to how long a HTTP request takes.
GBIFRecords is then <: AbstractArray and can be a direct forwarding of the whole interface to o.occurrences. It can just be indexed directly as none of the indices are undef. So view isn't needed at all, but you can use view on it in the normal way to take views of indices.
total_occurrences is then a field we store in the GBIFRecords object, to track how many there are in total, and how meany are left to download.

In this case the array interface makes sense for GBIFRecords as-is - it's just a vector or GBIFRecord, with some metadata - the original request and the total_occurrences field.

from gbif.jl.

rafaqz commented on August 16, 2024

Any comments on this? It would be good to clarify this before the upcoming course. We really should be using GBIF.jl in it.

from gbif.jl.

rafaqz commented on August 16, 2024

For a real world demonstration, this will be in the GeoData.jl docs and used in teaching:

# A basic species distribution modelling workflow
using GeoData, GBIF, CSV, Plots 

# Load occurrences for the Mountain Pygmy Possum
t = taxon("Burramys parvus")
records = occurrences(t, :limit => 300)

# Get the rest of the occurrences, we need to do this manually with a loop.
# Note: GBIF.jl uses non-standard semantics for `size`. This is comparing 
# the occurrances already downloaded with the total occurrances.
while length(records) < size(records)
    occurrences!(records)
end

# Extract the longitude/latitude value to a Vector of Tuple
coords = map(r -> (r.longitude, r.latitude), records)

# Get BioClim layers and subset to south-east australia
A = GeoStack(WorldClim{BioClim}, (1, 3, 7, 12))
SE_aus = A[X=Between(138, 155), Y=Between(-40, -25), Band=1]

# Plot BioClim predictors and scatter occurrence points on all subplots
p = plot(SE_aus);
foreach(i -> scatter!(p, coords; subplot=i, legend=:none), 1:4)
display(p)

# Extract predictor variables and write to CSV
predictors = extract(SE_aus, coords)
CSV.write("burramys_parvus_predictors.csv", predictors)

From my perspective it would be best to instead of the while loop do something like:

read!(occurences)

But if you were to use a while loop or anything similar, it would be something like:

while length(records) < total_occurences(records)
    occurrences!(records)
end

So that the non-standard meaning of size doesn't have to be explained. view(records) would give an error, for the same reason.

from gbif.jl.

tpoisot commented on August 16, 2024

What about count for total_occurrences?

from gbif.jl.

tpoisot commented on August 16, 2024

Re. read!, I don't think it's a good idea, because we have a number of cases where we do something at the end of every page of request.

from gbif.jl.

tpoisot commented on August 16, 2024

And as a third point, I don't think the Array should grow. We know its size at the beginning, and back when it did grew with each page, the requests were getting slower and slower.

from gbif.jl.

tpoisot commented on August 16, 2024

So I'd vote for option 1, with maybe count instead of total_occurrences - go ahead with a PR?

from gbif.jl.

Use Base julia semantics about gbif.jl HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent