Plants project 🌴

Next milestone:

web app with interactive graphs and dashboard about the climatic properties of different occurencies of a plant species

Next steps:

decide data organization: sql or bucket?
set Apache Airflow to schedule the workflow to retrieve the climatic data for all the observations to the API (since it is limited to 500 request per day)
develop a GUI and a web service to create a sort of web application that allows to visualize the collected data. Possibilities could be:
- Plotly Dash (Python) - free service
- Tableau online or server
- Amazon quickSight
- Power BI

Open topics:

sql database ond aws rds instance or aws bucket? what is the best data organization
data pipelines
climatic data saved in json in the database or better to parse it or extract the fields?
reference in the climatic table to plant observation database?

Journal

1.04-13.04.24
Developed on aws a ec2 compunting instance running ubuntu connected to a postgresql database hosted on a aws rds instance. Managed to connect to the database trough cli from my local machine (database is not public, hosted in a private subnet of the ex2 instance)
14.01.24
Decision: a new table with the parsed climaic data will be linked to every instance of the observation table. That is to be able to better analyze the data.
Finally managed to create a docker-compose with two containers for the db and pg admin web interface to have a GUI to access the database. Created the corresponding guide and adapted the Makefiles for semplicity of operations (for example starting the containers).
06.01.24
Retrieved multiple instances and plotted on the graph the temperature for the different months. Decided that I wanted to store the climatic data in a table linked to the database of observations. Decided then that I want to move the database in a container for semplicity and to organize the retriving of climatic information with a cron job on a server in the future. Tried to containerize but didn't manage.
02.12.23
added a function for a scatter plot of with the observation temperature value for one month (for example may, about 15 observations)
25/26.11.23 - project structure and table in database for climatic data
structure with source code in src/, main.py as entrypoint, config.py for keys, setting the venv, Makefile, structure of the readme, setting a table in the database for the observations to store the climatic data (responses saved as Json), added a makefile.
11.11.23 - started the project
bla bla

Miscellaneous

Misc.

Data and APIs

PlantNet Open Data
PlantNet Observations Dataset GBIF registation section contains information about how to use the API
OpenWeather Statistic Data

PostgreSQL Database

PosgreSQL

setting an aws RDS instances and connect to it

PosgreSQL

11.11.23 Request to ChatGPT

so I have an ongoing project I want to develop. The project is the following: I want to develop a platform that combines plant data with climatic and geografic data. The process should be the following: I take the data of plantnet, wich is a plant recognition app, and epecially I use their data to see where are located the obseravations o a specific plant in the world (coordinates). So for example I do that for a single plant. Then, thanks to an API i want to get the climatic data of the different locations of the plant. Then, with some machine learning, I want to know what are the climatic conditions where this plant successfully grows.

Now, I need some help with the architecture of the all. I want to develop my project in python and first start with just a plant. I can either download the dataset from plantnet (it is a .csv file, about 1700000 occurrencies in total, about 10 - 15 fields per row, or make an API request to the database). Once i collect the data and for example I get all the rows corresponding to the observations of a same plant species, then I can ask to the climatic API data for every single location...so I have to store this data somewhere. at the end I perform machine learning (I think someting like linear regression) and I should be able to get a climatic description where this plant thrive. So once I have this data I can start building my own database, and then proceed (but not now in the future) to apply that to all the plant species observated in the plantnet dataset.

Now I would need some help in the architecture, how to set the database and the pipelines: should I use .csv files or is there something more efficient? how do i manage the pipeline of data to python, and where should I temporary store for example all the climatic data that I need for the machine learning calculations, how do I build my on database and so on. And should I maybe work with containers or not so important at the beginning?

I'm of course just starting so I would start easy and then implement.

Thank you

11.11.23 - The response:

Data Collection: PlantNet Data: You have two options for obtaining PlantNet data:

Download Dataset: If the dataset is available for download, you can start by downloading it and working with it locally.

API Requests: If there's an API available, you can make requests directly from your Python application. This approach allows you to fetch real-time data and potentially keep your local data more up-to-date.

Data Storage: Initial Plant Data: For storing the plant data (from PlantNet), you can use a database. Considering the structured nature of the data, a relational database like PostgreSQL or SQLite could be a good choice.

Climatic Data: You can also use a database for storing climatic data. However, since climatic data is likely to be time-series data, a time-series database like InfluxDB might be suitable.

Data Processing Pipeline: ETL (Extract, Transform, Load): Design an ETL process to clean and transform the raw data before storing it in the database. This can be done using a tool like Apache Airflow or a simple custom Python script.
Machine Learning: Model Training and Prediction: Use Python libraries such as scikit-learn or TensorFlow for machine learning. Train your model on the climatic and geographic data, and make predictions on where the plant is likely to thrive.

Model Storage: Store the trained model in a format that allows easy retrieval within your application. Pickle or joblib can be used for this purpose.

Additional Considerations: Temporary Storage: For temporary storage during data processing, you might use a local cache or a temporary table in your database.

Containerization: While not crucial initially, containerization (using tools like Docker) can help with reproducibility and deployment consistency as your project grows.

Scalability: Consider the scalability of your architecture. If you plan to scale up, tools like Apache Kafka for data streaming or cloud services like AWS S3 and Google Cloud Storage can be considered.

Future Steps: Expanding to All Plant Species: As you mentioned, once you have a prototype working for one plant, you can expand it to handle all plant species in the PlantNet dataset.

Frontend and User Interface: Consider developing a user interface for interacting with your platform, allowing users to explore and visualize the data.

Conclusion: Start small, build iteratively, and refine your architecture as needed. Utilize Python's rich ecosystem of libraries and tools for each step of the process. Remember to document your work as you progress. Good luck with your project!

Define data structure for bucket 🪴

Purpose :

We want to define a data structure to save the different occurrences in the bucket.

Using the plant classification System

Since we are analysing plants, and specifically knowing that the data operations are gonna be species-based, the most natural way to classify them is to use the plant classification system. It is possible to use the values of the .csv database (see a snippet of it below) from Planet to form the path of the instance in the bucket.

An occurrence would be then stored for example like that in the bucket: <kingdom>/<phylum>/<class>/<order>/<family>/<genus>/<species>/<occurrence_id>.json
If then we want to apply some AI/ML should facilitate the operations since plants from the same family have some similar properties and from the same genus too and so on.

Object type to store data of single occurrences?

I would suggest .json? Since it's light-weight, easy to read and actually how we get the climatic data responses from the API

in this file, we would first save the information of the plantnet database, probably converted to json key-value pairs?

Here the first three lines of the .csv:

gbifID	datasetKey	occurrenceID	kingdom	phylum	class	order	family	genus	species	infraspecificEpithet	taxonRank	scientificName	verbatimScientificName	verbatimScientificNameAuthorship	countryCode	locality	stateProvince	occurrenceStatus	individualCount	publishingOrgKey	decimalLatitude	decimalLongitude	coordinateUncertaintyInMeters	coordinatePrecision	elevation	elevationAccuracy	depth	depthAccuracy	eventDate	day	month	year	taxonKey	speciesKey	basisOfRecord	institutionCode	collectionCode	catalogNumber	recordNumber	identifiedBy	dateIdentified	license	rightsHolder	recordedBy	typeStatus	establishmentMeans	lastInterpreted	mediaType	issue
3949793303	7a3679ef-5582-4aaa-81f0-8c2545cafc81	o-1011560681	Plantae	Tracheophyta	Magnoliopsida	Caryophyllales	Caryophyllaceae	Gypsophila	Gypsophila repens		SPECIES	Gypsophila repens L.	Gypsophila repens L.	L.	FR			PRESENT	1	da86174a-a605-43a4-a5e8-53d484152cd3	45.489682	6.98267			2329.0	0.0			2021-07-22T10:04:40	22	7	2021	5384464	5384464	HUMAN_OBSERVATION							CC_BY_4_0					2023-08-26T02:21:33.883Z	StillImage	COUNTRY_DERIVED_FROM_COORDINATES;CONTINENT_DERIVED_FROM_COORDINATES
3949793308	7a3679ef-5582-4aaa-81f0-8c2545cafc81	o-1011561022	Plantae	Tracheophyta	Magnoliopsida	Cornales	Cornaceae	Cornus	Cornus mas		SPECIES	Cornus mas L.	Cornus mas L.	L.	DE			PRESENT	1	da86174a-a605-43a4-a5e8-53d484152cd3	49.001282	8.398373			156.16	0.0			2021-07-31T10:00:31	31	7	2021	3082263	3082263	HUMAN_OBSERVATION							CC_BY_4_0					2023-08-26T02:21:33.886Z	StillImage	COORDINATE_ROUNDED;COUNTRY_DERIVED_FROM_COORDINATES;CONTINENT_DERIVED_FROM_COORDINATES

which could be for an instance translated in json:

{
  "plantnet_data": {
    "gbifID": "3949793303",
    "datasetKey": "7a3679ef-5582-4aaa-81f0-8c2545cafc81",
    "occurrenceID": "o-1011560681",
    "kingdom": "Plantae",
    "phylum": "Tracheophyta",
    "class": "Magnoliopsida",
    "order": "Caryophyllales",
    "family": "Caryophyllaceae",
    "genus": "Gypsophila",
    "species": "Gypsophila repens",
    "infraspecificEpithet": "",
    "etc": "..."
    }
}

After that we could just add the climatic data (12 different requests for the different months) just copy-paste of the API response, probably just organised for the different months:

{
    "climatic_data": {
        "1": {
            "temp": {
                "record_min": 269.85,
                "record_max": 296.05,
                "average_min": 274.17,
                "average_max": 291.97,
                "median": 283.71,
                "mean": 283.69,
                "p25": 281.05,
                "p75": 286.18,
                "st_dev": 3.98,
                "num": 3953
            },
            "pressure": {
                "min": 988,
                "max": 1036,
                "etc": "..."              
            }
        },
        "2": {
            "etc": "..."
        }
    }
}

@wenneton what do you think? 🌱

mvomiero / plants Goto Github PK

plants's Introduction

Plants project 🌴

Next milestone:

Next steps:

Open topics:

Journal

Miscellaneous

Data and APIs

PostgreSQL Database

setting an aws RDS instances and connect to it

11.11.23 Request to ChatGPT

11.11.23 - The response:

plants's People

Contributors

Stargazers

Watchers

plants's Issues

Purpose :

Using the plant classification System

Object type to store data of single occurrences?

Recommend Projects

Recommend Topics

Recommend Org