worldfishcenter / peskas.timor.data.pipeline Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 1.0 19.17 MB

Peskas data pipeline for East Timor

Home Page: https://worldfishcenter.github.io/peskas.timor.data.pipeline/

License: GNU General Public License v3.0

Dockerfile 0.37% R 98.30% JavaScript 0.53% TeX 0.80%

peskas.timor.data.pipeline's People

Contributors

Stargazers

Watchers

Forkers

sozinhovillgit

peskas.timor.data.pipeline's Issues

Species geographic information is limited

Currently species from which to download taxonomic information is determined by filtering those in which fishbase has a record of a (museum) specimen from a particular country. Because of that several species are missing from timor and we currently accept species that have specimens recorded in Indonesia as well. However, some species are still missed and there might be a few false positives as well.

An ideal approach would be to consult occurrence databases like GBIF instead of FishBase to determine wether a species should be included in our length-weight relationship analysis.

Download metadata tables from Airtable

Now that we're using Airtable some extra functionality needs to be added so that metadata tables are retrieved from that source instead of Google Sheets.

Main pipeline failing

The main pipeline is failing because there seems to be a problem with the rfishbase and the data download. See for example: https://github.com/WorldFishCenter/peskas.timor.data.pipeline/actions/runs/1168646086

@langbart, do you mind having a look at what might be causing this problem and solving it?

Finding the taxonomic rank for Sharks Tunas and Rays is not ideal

Sharks (Selachimorpha), Rays (Rajiformes) and Tunas (Thunnini) get special treatment in the code that retrieves length-weight and length-length relationships. The reason is that these do not correspond to a specific taxonomic order/family, but rather super-orders or a tribe. As such it's not easy to find the species that compose these groups in FishBase.

Currently, we search species by common name (Shark, and Tuna) but that misses some species that do not have that word in their name and includes some that do not belong to these groups. (e.g. tuna includes Akihito futuna which is a goby and Gymnosarda unicolor which is closely related but actually a bonito).

A more appropriate treatment from these catch groups would potentially look like:

No hard-coding of group names
A table (airtable?) that relates a superorder to the corresponding orders (or a tribe to the corresponding genus). This could potentially be part of the fao_table

Improve validation of prices

Currently the only validation of revenue (and prices) is removing large or small values. Once we have the catch data we can use that to estimate the price of a catch and identify those that might be too large or too small for the catch.

This is important because there was some confusion with the data collection that caused some prices to be collected in different ways (total, per weight, per individual). Doing this smarter validation will allow us to filter inadequate cases with more ease.

Validate data

Data validation is a big part of the pipeline. There are many items that require validation:

In each of these items and for each record the pseudocode of the validation consists of:

this_record <- read_record()
flag <- retrieve_flag(this_record)
record_ok <- test_record(this_record)
if (!record_ok) {
  fixed_record <- fix_record(this_record)
  fix_ok <- test_record(fixed_record)
  if (!fix_ok & !flag)  add_flag()
} else if (record_ok & flag) {
  mark_flag_as_fixed()
} else

The flags will be stored a GoogleSheets table: https://docs.google.com/spreadsheets/d/1aquZSimR2okURO08q1lmoVUwNZvwwSwbpqblE4fMKk8 that should be edited

Tuning of validation algorithm

We need greater effectiveness of the catch parameters validation algorithm. At the moment it seems to have a too low tolerance threshold.

Among the possible solutions:

increase the tolerance threshold k
multiply k by a coefficient associated with the Gear type (es. max for FAD min for HL)

Ingest trips and tracks from pelagic

Peskas must be able to ingest data from Pelagic. This includes:

Trips data daily.
A mechanism to detect when existing trips have been modified.
Location data - Only as new trips are recorded. Once location data has been retrieved it does not need to be re-downloaded unless information about the trip has been updated.
Save the data in a cloud storage service.

Improve README

The README is lacking clarity and is not enough to bring someone new to how the code works. It should have information about

Ingest data from legacy landings survey

Some of the landings data comes from an old survey that was used prior to 2019 when it was redesigned. The legacy landings should be ingested in a very similar way as the landings ingestions it should use survey related functions like retrieve_survey(), retrieve_survey_data(), and retrieve_survey_metadata().

Some parameters about the legacy landings have already been inserted in the configuration file (inst/conf.yml). See below.

  landings_legacy:
      api: kobohr
      survey_id:
      token: !expr Sys.getenv('KOBO_TOKEN')
      file_prefix: timor-landings-v1
      version:
        preprocess: latest

A key parameter missing is the survey_id, which we still need to figure out. We also need to ensure that the kobo token has permissions to retrieve data from that survey. The survey is hosted by https://kobo.humanitarianresponse.info and the peskas account should have access to it.

Just as the survey data ingestion we need to:

Retrieve the survey artefacts (data and metadata) from Kobo
Upload the survey artefacts (data and metadata) into cloud storage
Incorporate it in the workflow (a dummy step already exists)

Export anonymised data to WorldFish dataverse and/or GARDIAN

As part of the funding requirements data needs to be available. Once data has been cleaned and curated, we need to anonymise it and export it to a public repository.

Te priority and details about the implementation still need to be discussed.

feat/fish_parameters

Take the catch types data from airtable (from the metadata jobs) and use it to retrieve the length_weight relationship parameters and max_lengths. for the fish species belonging to each family that are present in Timor or neighbouring countries.

Data pipeline workflow fails when triggered by a new release

The data pipeline workflow (https://github.com/WorldFishCenter/peskas.timor.data.pipeline/blob/main/.github/workflows/data-pipeline.yaml) fails when it's triggered by a new GitHub release. See this run for an example: https://github.com/WorldFishCenter/peskas.timor.data.pipeline/actions/runs/678473765

The reason is that the workflow attempts to name the image with the tags from the Surgo/docker-smart-tag-action@v1 but because the release generates a large number of tags, it creates an invalid image name when appended.

Solving this issue probably involves rethinking how the images are being named and tagged.

Species (this might require several steps to check for weight, length, etc).

Validate IMEI

Related to #15 we are going to start validating the IMEI. We can start using some code from the total catch estimation:

  survey %>%
  mutate(boat_imei = as.numeric(boat_imei_r),
           # If it was negative make it positive as it was probably a typo
           boat_imei = if_else(boat_imei < 0,
                               boat_imei * -1,
                               boat_imei),
           # Optimistically we need at least 5 digits to work with,
           boat_imei = if_else(boat_imei < 9999,
                               NA_real_,
                               boat_imei),
           # back to character for further treatment
           boat_imei = as.character(boat_imei),
           imei_regex = paste0(boat_imei, "$")) %>%

Print bounds of outlier detection algorithms to be able to see what they are

Currently LocScaleB() is used to detect the bounds at which we consider a value to be an outlier. However, we don't have any good idea of what those values are and wether they are changing over time. Ideally we want that data to be stored somehow in the bucket, but for the time being just printing the bound values in the console so we can check the logs should suffice.

Test validity of IMEI installs over time

pt_validate_vms_installs() used to check that an IMEI was uniquely associated with a single boat. However, it later became evident that some devices were moved between boats, particularly early in the installation period. Consequently, this test (see code below) was deactivated.

v <- vms_installs_table %>%
    dplyr::mutate(device_event_date = lubridate::as_date(.data$device_event_date),
                  createdTime = lubridate::ymd_hms(.data$createdTime),
                  created_date = lubridate::as_date(.data$created_date))

ok_boat_installs <- v %>%
    dplyr::group_by(.data$device_imei) %>%
    dplyr::summarise(n_boats = dplyr::n_distinct(.data$boat_id),
                     .groups = "drop")
  if (any(ok_boat_installs$n_boats > 1))
    stop("detected a vms device in more than one boat")

A new test is needed where the consistency of devices and boats is still validated. The test needs to acknowledge that IMEIS can be associated to multiple boats but not at the same time.

Fixing catch groups codes

Ideally, there should be a match between the catch codes in the survey data and those in the metadata tables. Actually, we have these situations to solve:

Legacy landings data include some catch codes that are missing from the metadata tables. These are "205" "418" "207" "208" "200" "216" "204" "206" "201" "210" "148" "39_1". Observations labelled with these codes consist of 5.01% of legacy landings data. As it is unclear where to retrieve the information associated with these ghost codes, dropping these observations seems to be the best solution at the moment.
There are some catch types that were virtually never recorded in the surveys. Specifically, 55 52 53 40 39 48 54 57 47 56 51 seem to have never been recorded in legacy landings, 52 57 47 44 56 seems to have never been recorded in recent landings, and 52 57 47 56 seem to have never been recorded considering all the data. Perhaps that is something to discuss more deeply with Joctan.

39: Butterflyfish
40: Cardinalfish
44: Cobia
47: Bannerfish
48: Milkfish
51: Remora
52: Tripodfish
53: Wolf herring
54: Stingrays
55: Sicklefish
56: Lobster
57: Sea cucumber

Most of the "seaweed" catches in recent landings are labelled as "seaweed" instead of the associated code "58". I think we could just add a code line in pt_nest_species in order to replace it.

Environment variable assignment fails

When the workflow is running in the main branch the environment variable R_CONFIG_ACTIVE is supposed to take the value "production" to do that we use the allenevans/set-env action which retrieves the value from a previous step that checks the branch where the code is running. See below for how this work:

      - name: Set variables
        id: setvars
        run: |
          if [[ "${{github.base_ref}}" == "main" || "${{github.ref}}" == "refs/heads/main" ]]; then
            echo "::set-output name=r-config::production"
          else
            echo "::set-output name=r-config::default"
          fi
      - name: Set outputs as environment variables
        uses: allenevans/[email protected]
        with:
          R_CONFIG_ACTIVE: ${{ steps.setvars.outputs.r-config }}

The problem is that allenevans/set-env is failing to set the environment variable and therefore the code runs in development mode even when running from the main branch. The step sends the warning below:

Warning: Unexpected input(s) 'R_CONFIG_ACTIVE', valid inputs are ['']
Run allenevans/[email protected]
  with:
    R_CONFIG_ACTIVE: production
  env:
    KOBO_TOKEN: ***
    GCP_SA_KEY: ***
    AIRTABLE_KEY: ***

Follow this link for an example of an action run in which this happens https://github.com/WorldFishCenter/peskas.timor.data.pipeline/actions/runs/854950038

I suspect the solution might be as easy as declaring all environment variables in the step rather than at the top level. Declaring the global environment variables in two different places might not be allowed (although I was unable to find any documentation about this)

Sometimes survey retrieval appears to be successful when in fact it isn't

As an example see the logs from https://github.com/WorldFishCenter/peskas.timor.data.pipeline/pull/8/checks?check_run_id=2190423335:

Run Rscript -e 'peskas.timor.data.pipeline::ingest_timor_landings()'
INFO [2021-03-25 05:02:33] Loading configuration file...
INFO [2021-03-25 05:02:33] Using configutation: default
DEBUG [2021-03-25 05:02:33] Running with parameters list(landings = list(api = "kobohr", survey_id = 344563, token = "***", file_prefix = "timor-landings-v2"), landings_legacy = list(api = "kobohr", survey_id = NULL, token = "***", file_prefix = "timor-landings-v1"))
DEBUG [2021-03-25 05:02:33] Running with parameters list(google = list(options = list(project = "peskas", bucket = "timor-dev", service_account_key = "***"), key = "gcs"))
INFO [2021-03-25 05:02:33] Downloading survey metadata as timor-landings-v2_metadata__20210325050233_34392a7__.json...
SUCCESS [2021-03-25 05:02:34] Metadata download succeeded
INFO [2021-03-25 05:02:34] Downloading survey csv data as timor-landings-v2_raw__20210325050233_34392a7__.csv...
SUCCESS [2021-03-25 05:04:35] Survey csv data download succeeded
INFO [2021-03-25 05:04:35] Downloading survey json data as timor-landings-v2_raw__20210325050233_34392a7__.json...
SUCCESS [2021-03-25 05:06:36] Survey json data download succeeded
INFO [2021-03-25 05:06:36] Uploading files to cloud...
2021-03-25 05:06:36 -- File size detected as 72.3 Kb
2021-03-25 05:06:37 -- File size detected as 157 bytes
2021-03-25 05:06:37 -- File size detected as 157 bytes
SUCCESS [2021-03-25 05:06:38] File upload succeded

The file sizes are extremely small, so probably the API requested failed and the content message (probably an error request status) is being saved as the data or metadata.

Fixing this bug probably requires:

Inspecting the response content and check the format corresponds with expectations
Inspecting the response code and logging it so that it's easier to understand the source of the errors

Add capabilities to import data from KOBO

Peskas must be able to import catch data from Kobo in a regular basis. This feature includes:

Code to import catch data both strtuctured (csv format) and unstructured (json format)
Code to import survey metadata (translations, choices, etc)
Save the data into a cloud storage bucket (either Amazon S3 or Google Cloud Storage). Data imports must be version controlled and distinguish between development and production environments.
Workflow code to run the Kobo import job on a daily basis. Code could run in GitHub Actions
Code to log events accurately
Workflow to recognise between images from different branches

Model catch prices and volumes

Catch prices and volumes are an important component of the data. However, many errors are introduced when enumerators record this information. We aim to solve that by using a model of price and weight. Using the model predictions we can then identify records that are unlikely to be correct or should be reviewed.

This model should be re-trained in a monthly basis. Depending on the model complexity and training time it can run on GitHub runners or cloud providers computing instances (e.g. Google Cloud Run or Amazon EC2/Batch).

Improve validation of PDS trips

The information obtained from the PDS trips is used to calculate the most important statistics from Timor. Unfortunately, the quality of the data is patchy.

For instance, the model we're currently using indicates that the average boat makes 330-360 landings per year. This is very likely to be not the case and might be caused by trips in the system that are not accurate.

We need to identify these spurious trips using more stringent automated checks than we are currently use.

Add AWS capabilities to cloud storage functions

Function has been implemented for Google Cloud Storage exclusively for now. Using AWS is pending and possibly will be left for a future issue.

Originally posted by @efcaguab in #1 (comment)

Ona servers are periodically failing and screwing up the tests

Consider dropping support for ONA servers

Raw data preprocessing pipeline step returns exit code 137

The preprocessing GithHub actions' job in the pipeline returns the error "Process completed with exit code 137". It seems that the Docker container running in GithHub actions runs out of memory and the process is killed. A possible solution is to split the preprocessing into two separate jobs (and so in two separate containers) and then merge the final processed data

Kobo ingestion limited at 30,000 records

Kobo limits the number of submissions retrieved via the API to 30,000. If more are required, a multipage request is needed. Currently, there are more than 30,000 submissions in the landings data but only the first 30,000 are retrieved.

ingest_landings() needs to be updated so that all records are downloaded whether is the tabular or the list format.