Functions that perform pagination like get_all_observations(

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

P.S., you're always welcome to create new issues, or <a href="https://github.com/orgs/

Continued in <a class="issue-link js-issue-link" data-error-text="Failed to load title

Improve performance for auto-paginated functions about pyinaturalist HOT 9 CLOSED

pyinat commented on June 5, 2024

Improve performance for auto-paginated functions

from pyinaturalist.

Comments (9)

jwidness commented on June 5, 2024 1

If you want a smaller response from the server, you'd have to use the (alpha) v2 api.

from pyinaturalist.

JWCook commented on June 5, 2024 1

Right, the v2 API would help with this. I have an open issue to add support for that (#155), which I haven't gotten around to yet, but plan to... soonish? Meanwhile you could use plain python requests to try that out, if you want.

For repeated downloads, pyinaturalist has some built-in caching that helps with this (briefly mentioned in the docs here). Let me know if you'd like help changing the settings for that.

For minimizing disk usage, what you're already doing (removing the info you don't need and compressing it) is probably the easiest option. Another option would be to use a more space-efficient format like parquet. Or even SQLite, or pretty much anything other than JSON, would likely be an improvement; the downsides are that the files are no longer human-readable/editable, and it adds a couple extra steps to read and write observation data. I have a separate library here that helps with this kind of thing, and I could give some examples if needed: https://github.com/pyinat/pyinaturalist-convert

from pyinaturalist.

abubelinha commented on June 5, 2024 1

Thanks a lot for giving tons of great info!
And sorry about the discussion threads.
Most github repositories I've seen don't use them ... so I always forget that they exist until somebody reminds me like you.

from pyinaturalist.

abubelinha commented on June 5, 2024

Not sure if this would deserve a new issue because this is not a bug.
This is the only "related" subject I found ... although it is not related either.

I am downloading all my user occurrences as json files which I store for later processing. I have the feeling they are much bigger than I would need, basically because of the big size of the "identifications" section inside each "result".

i.e., if I only download 1 result (per_page=1) I got a JSON of 912 lines.
The single whole "result" (one observation item) takes 904 lines itself in this case.
The "taxon" section inside "result" takes 59 lines.
The "identifications" section inside "result" takes 616 lines, despite of having only one identification in this example !!! (so the "identifications" sections takes even more relative amount of the JSON file size when there are several identifications per observation item).

For what I need, I am OK with downloading just the stuff inside the "taxon" section.
Is it possible to somehow avoid downloading the "identifications" section, specially when I do a get_observations(user_id='my_username', page='all') request?
That would reduce my json files downloads to less than 1/3 size
(which is important since I need to do this weekly for several users of my institution).

Thanks a lot
@abubelinha

from pyinaturalist.

JWCook commented on June 5, 2024

@abubelinha Yeah, full observation responses are fairly verbose, mainly because they include all the information you see on the observation pages on inaturalist.org. The bulk of it, as you noticed, is from the identifications and full taxonomy details for each identification.

Are you mainly concerned about network bandwidth, or disk space?

from pyinaturalist.

abubelinha commented on June 5, 2024

Are you mainly concerned about network bandwidth, or disk space?

Well, mainly about disk space (I rsync this folder 2-4 times a day between home and work).
But also script time, since I plan to repeat same downloads regularly.
Of course I can reprocess the json once it is in my disk and remove that part (I am also compressing the json too).
But if you know how to skip identifications from api response, the whole script would take less time and save bandwidth (and energy).

I don't see any iNaturalist api options to get a "summarized" api return (other than using only_id which would only return the ids but nothing else).

As you are the expert I preferred to ask here, just in case pyinaturalist already had an option for this.

from pyinaturalist.

abubelinha commented on June 5, 2024

Oh, you mean the fields parameter!
Yes, that seems to be exactly what I need.
I guess you will include that option in pyinaturalist when v2 api becomes stable, don't you?
Thank you so much.

from pyinaturalist.

JWCook commented on June 5, 2024

P.S., you're always welcome to create new issues, or discussion threads for more open-ended questions. Usually those are easier for me to catch up on than comments on closed issues.

from pyinaturalist.

JWCook commented on June 5, 2024

Continued in #155

from pyinaturalist.

Improve performance for auto-paginated functions about pyinaturalist HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent