asreview / synergy-dataset Goto Github PK

SYNERGY - Open machine learning dataset on study selection in systematic reviews

License: Creative Commons Zero v1.0 Universal

Python 100.00%

systematic-reviews-datasets utrecht-university dataset graphs machine-learning citation-network natural-language-processing prisma scholarly-articles research

synergy-dataset's Issues

Links for cohen datasets are broken

https://academic.oup.com/jamia/article/13/2/206/731153

wilson dataset - 1/3 of abstracts is missing

About 1/3 of the abstracts in the wilson dataset is missing (of which 3 inclusions).

I think it would be worthwhile to see if we could figure out what causes this missingness.
One of the authors (Dr. Hannah Ewald) named the following possible causes:

572/1090 missing abstracts are from the years 1912 to 1989.

The rest comes from both Embase and Medline and I can’t see any systematic error (i.e. there is mixed page numbers, all names from A to Z, all indexed as journal articles, different languages and journals). Although it’s odd in such a high number, maybe they just don’t have abstracts.

I am awaiting response from the first author (dr. Christian Appenzeller-Herzog) who is currently out of office.

original systematic review for the PTSD dataset

It turns out there is a discrepancy of 4 inclusions between the ptsd-dataset in this repo (38 inclusions) and the systematic review we linked it to (34 inclusions).

For the ptsd dataset, up until now we refer to https://doi.org/10.1080/00273171.2017.1412293, a systematic review (1) reporting 34 inclusions. It turns out that the .ris files in the corresponding osf-page contain 38 inclusions.

This number belongs to another systematic review (2) (on the same dataset), http://dx.doi.org/10.1080/10705511.2016.1247646. On the corresponding osf-page, https://osf.io/6vdfk/ however, there are no .ris files uploaded (yet).

I think we have refer to one or the other paper. For the 34 inclusions paper we need to update the dataset by recoding the 4 inclusions to exclusions (see comment by @J535D165 below). For the paper with 38 inclusions, the osf-page should be updated (@Rensvandeschoot), and all information in the documentation on systematic review 1 should be replaced with information on systematic review 2.

Any thoughts?

Thanks @terrymyc! Thanks for your contribution.

A couple of additions and remarks are listed below:

Exclude 4 more papers

~Based on the paper, the team excluded 4 more papers. 34 of the 38 papers were described in the paper. The papers listed below have to be excluded (isn't it? @Rensvandeschoot @GerbrichFerdinands ): ~

~~Sterling, M., Hendrikz, J., & Kenardy, J. (2010). Compensation claim lodgement and health outcome developmental trajectories following whiplash injury: A prospective study. Pain, 150(1), 22-28.~~
Hou, W. K., Law, C. C., Yin, J., & Fu, Y. T. (2010). Resource Loss, Resource Gain, and Psychological Resilience and Dysfunction Following Cancer Diagnosis: A Growth Mixture Modeling Approach. Health Psychology, 29(5), 484-495. doi:10.1037/a0020809
Mason, S. T., Corry, N., Gould, N. F., Amoyal, N., Gabriel, V., Wiechman-Askay, S., . . . Fauerbach, J. A. (2010). Growth curve trajectories of distress in burn patients. Journal of Burn Care and Research, 31(1), 64-72. doi:10.1097/BCR.0b013e3181cb8ee6
Pérez, S., Conchado, A., Andreu, Y., Galdón, M. J., Cardeña, E., Ibáñez, E., & Durá, E. (2016). Acute stress trajectories 1 year after a breast cancer diagnosis. Supportive Care in Cancer, 24(4), 1671-1678. doi:10.1007/s00520-015-2960-x

Connect to RIS files on OSF

The RIS files are now available on OSF. Can you connect them to your code and remove the ones in the Github repository?

Count duplicates

Thanks for your statistics so far. Can you count the number of duplicate items as well? Please don't make things too complicated, count check of duplicate abstracts for example.

It turns out that @qubixes is also doing some work on the dataset statistics. This is implemented in an extension for asreview https://github.com/asreview/asreview-statistics. It might be interesting to have a look. It would be nice to integrate the functionality with this repo (not for now).

Originally posted by @J535D165 in #13 (comment)

Kwok dataset doesn't have files on persistent location

Files were shared directly with us. For this reason, no files are found in the OSF url. I removed this entry as I'm cleaning the repo from source files.

{
  "dataset_id": "Kwok_2020",
  "url": "https://raw.githubusercontent.com/asreview/systematic-review-datasets/master/datasets/Kwok_2020/output/Kwok_2020.csv",
  "reference": "https://doi.org/10.3390/v12010107",
  "link": "https://doi.org/10.17605/OSF.IO/5S27M",
  "license": "CC-BY Attribution 4.0 International",
  "title": "Virus Metagenomics in Farm Animals: A Systematic Review",
  "authors": [
    "Kwok, K. T. T.", 
    "Nieuwenhuijse, D. F.", 
    "Phan, M. V. T.", 
    "Koopmans, M. P. G."
    ],
  "year": 2020,
  "topic": "Virus Metagenomics",
  "final_inclusions": true,
  "title_abstract_inclusions": false
}

Deal with different reviewers regarding inclusion / exclusion ?

I see here, that you only use label2 to decide on inclusion. Can you motivate this ?

https://github.com/asreview/systematic-review-datasets/blob/d10528b2a26fc7f87caa7148adbcf03f5351fb1c/datasets/Cohen_EBM/clean_cohen.py#L35

We have as well the case, that our screening gets done by 2 people.
We tend to include, if at least one reviewer wants to include.

You seem to do this differently.

SSL Issue when running a get command in Plain Text on MacOS

I received an ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129) error when trying to run the synergy_dataset get on macOS Ventura Version 13.6.6.

It is a known issue on MacOS that I was able to solve using the open /Applications/Python\ 3.9/Install\ Certificates.command in a terminal.

This is a common issue on macOS where Python is not able to verify the SSL certificate provided by the server. To fix it we are using Certificates.command script from python.org for macOS.

Suggested dataset - Particle bean radiation therapies for cancer

https://www.ncbi.nlm.nih.gov/books/NBK44538/ see appendix C and D.

Code for data collection: https://github.com/mcallaghan/rapid-screening/blob/master/analysis/get_data/scrape_pbr.py
Used in publication: https://systematicreviewsjournal.biomedcentral.com/articles/10.1186/s13643-020-01521-4

Contribution of SRs from EFSA

I started to have a look at our full SR database.

Maybe I start to describe what I have here, and then we can iteratively think about what would be worth to include.
Maybe we can have as well an other call.

We have in total 299 "projects" in Distiller.
Quite some of them are "tests" or other garbage.
Hard to say how many, but by looking at the project names, there might be at least 100 project which should not be looked at at all.
so 200 projects remaining.

Each of it has at least "one level", in which a level could mean different things:

"title screening"
"abstract screening"
"title + abstract screening"
"full text screening"
"data extraction"
"abstract screening 1" vs "abstract screening 2".
......
.....

There is no "clear nomenclature" or metadata on this, but often we use the word "abstract" in the name of the level to indicate "abstract screening"

The number of "levels" in total (including garbage projects) is:
1226

So in total we have 1226 times , that
"humans have decided to exclude x papers out of y"

(sometimes x or y or x and y are 0)

I filtered the levels by the ones which have "abstract" in the "level name".
These SHOULD be all about abstract screening, but we might have more.

This leaves then 126 rows.

I just pasted here for you information, some of the "statistics" I get for these.

We can see that the first row is:

related to an EFSA question: EFSA-Q-2012-00234 and was about "Leishmaniosis" . (from this you could find the EFSA output: https://efsa.onlinelibrary.wiley.com/doi/epdf/10.2903/sp.efsa.2014.EN-466 where on page 19 you find a summary of the SR. Sometimes we publish as well the concrete references included / excluded, but not always
we look at a level/phase of the systematic review called "Title and abstract screening - Study eligibility form: Title and abstract screening"
(so I think, this is a "real" abstract screening, probably worth to add to your database or to be used in a simulation)
it started with 961 references
we excluded 877 and included 84

|                                      project |                                                                                                    level | References Added | Unreviewed | Some Reviews | Included | Excluded | Conflict | Fully Reviewed | Saved, Unsubmitted |
|----------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------|------------|--------------|----------|----------|----------|----------------|--------------------|
|         AHAW_EFSA-Q-2012-00234_Leishmaniosis |                      Title and abstract screening - Study eligibility form: Title and abstract screening |              961 |          0 |            0 |       84 |      877 |        0 |            961 |                  0 |
|         AHAW_EFSA-Q-2012-00234_Leishmaniosis | Full paper screening - Study eligibility form: Full paper screening of unclear title and abstract papers |                  |          0 |            0 |       23 |       61 |        0 |             84 |                  0 |
|                   AHAW_EFSA-Q-2013-00546_EBL |                                              Title Abstract screening - Title and abstract screening EBL |             5181 |          0 |            0 |      255 |     4926 |        0 |           5181 |                  0 |
|         AHAW_EFSA-Q-2013-00835_leishmaniasis |                                                   relevance - First stage screening (title and abstract) |              182 |          0 |            0 |       14 |      168 |        0 |            182 |                  0 |
|                   AHAW_EFSA-Q-2013-00918_pox |                                                           Screening 1 - POX Screening 1 (title&abstract) |               86 |          0 |            0 |       37 |       49 |        0 |             86 |                  0 |
|                   AHAW_EFSA-Q-2013-01034_PPR |                                                             Screening - PPR Screening 1 (title&abstract) |             1076 |          0 |            0 |      243 |      833 |        0 |           1076 |                  0 |
| AHAW_EFSA-Q-2014-00187- VBD-review-GEOG-DIST |                                             Title and abstract screening - Tittle and abstract screening |              816 |         15 |            0 |      255 |      521 |       12 |            801 |                  0 |
|                   AHAW_EFSA-Q-2015-00160_PED |                                      Title and abstract screening PED - Title and abstract screening PED |             1609 |          0 |            0 |      246 |     1363 |        0 |           1609 |                  0 |
|            AHAW_EFSA-Q-2016-00160_Bluetongue |                                                               Level 1 - Q3 screening title and abstracts |              287 |          0 |            0 |      103 |      184 |        0 |            287 |                  0 |
|                   AHAW_EFSA-Q-2018-00141_ASF |                                                             ASF screening - ASF Title abstract Screening |             1512 |          0 |            0 |       89 |     1422 |        1 |           1512 |                  0 |
|         AHAW_EFSA-Q-2018-00269_AI_Monitoring |                                                      Title abstract screening - Title abstract screening |               47 |         47 |            0 |        0 |        0 |        0 |              0 |                  0 |
|  AHAW_EFSAQ201400187_DACRAH2_GeoDistribution |                                             Title and abstract screening - Tittle and abstract screening |             5433 |          0 |            0 |      982 |     4451 |        0 |           5433 |                  0 |
|       AHAW_EFSA_Q_-2014-00187-VECTORNET-OBJ1 |                                                ti/abstract screening - MIR_Tittle and abstract screening |             1756 |          0 |            0 |      679 |     1077 |        0 |           1756 |                  0 |
|       AHAW_EFSA_Q_-2014-00187-VECTORNET-OBJ2 |                                                               Level 1 - R0_Tittle and abstract screening |              145 |          0 |            0 |      107 |       38 |        0 |            145 |                  0 |
|       AHAW_EFSA_Q_-2014-00187-VECTORNET-OBJ3 |                                                          Level 1 - VecComp_Tittle and abstract screening |              703 |         27 |            0 |      327 |      349 |        0 |            676 |                  0 |
|                  AMU_EFSA-Q-2015-00592_crowd |                                                                 screening - Title and abstract screening |              371 |          0 |            0 |       25 |      346 |        0 |            371 |                  0 |
|                AMU_EFSA-Q-2016-00294_MLT- SR |                                                           Level 1 - LEVEL1 screening title and abstracts |              953 |          0 |            0 |      257 |      696 |        0 |            953 |                  0 |
|      BIOCONTAM_EFSA-Q-2014-00189_QPS2014G+NS |  Title and abstract screening - STEP 1 (Title and/or abstract): GRAM-POSITIVE - NON-SPORULATING BACTERIA |              875 |        113 |          393 |       16 |      353 |        0 |            369 |                  0 |
|       BIOCONTAM_EFSA-Q-2014-00189_QPS2014G+S |       Screening Title and Abstract - STEP 1 (Title and/or abstract): GRAM-POSITIVE -SPORULATING BACTERIA |              447 |          0 |          421 |       17 |        9 |        0 |             26 |                  0 |
|         BIOCONTAM_EFSA-Q-2014-00189_QPS2014V |         Title and Abstract screening - STEP 1 (Title and/or abstract): Viruses used for plant protection |               77 |          0 |           77 |        0 |        0 |        0 |              0 |                  0 |
|         BIOCONTAM_EFSA-Q-2014-00189_QPS2014Y |                                   Title and Abstract screening - STEP 1 (Title and/or abstract):  YEASTS |              488 |          0 |          477 |       11 |        0 |        0 |             11 |                  0 |
|       BIOCONTAM_EFSA-Q-2014-00536_EAEC_Trial |                                                       Title and abstracts - Title and abstract screening |              240 |          0 |          100 |      106 |       34 |        0 |            140 |                  0 |
|        BIOCONTAM_EFSA-Q-2015-00028_DIOX_FARM |                               Level 1 _title and abstract - DIOXIN _ FARM / Title and abstract screening |             4202 |          0 |            0 |      503 |     3699 |        0 |           4202 |                  0 |
|       BIOCONTAM_EFSA-Q-2015-00028_DIOX_NP06C |                                                 Level 1 - RPA_IEH_updated / Title and abstract screening |             6101 |          0 |            0 |     2218 |     3883 |        0 |           6101 |                  0 |
|       BIOCONTAM_EFSA-Q-2015-00028_DIOX_NP07C |                                       Level 1 - DIOXIN _TOXICOLOGY MODELS / Title and abstract screening |             4906 |          0 |            0 |      633 |     4273 |        0 |           4906 |                  0 |

So one contribution to you could be the (at least 126) abstract screenings from our database, including its meta-data:

title
authors
DOI (not always present)
year (not always present)
journal (not always present)
label (excluded / included)

Some might be "half done", but that you could see from the numbers of "total papers", "included", "excluded" , "conflict".
I would say "nearly all" are complete.

I have automated all extractions, so the "volume of SRs" does not make any difference for me.

Inconsistent names label columns

The column with labels does not have the same name over all datasets.

In the Kwok data the label is final_included (see cell 8 in this notebook),
It looks like that most datasets have a label column called label_included. For example in the Cohen datasets (see cell 4 in this notebook.

Maybe there are more variations, but I didn't find them this far.

Then, the README of this repo says the following:

To indicate labelling decisions, one can use "included" or "label_included". The latter label called "included" is needed to indicate the final included publications in the simulations.

It would be nice to make the name of the label column more consistent throughout all datasets.

Information of "saving" rates ?

Did you use in some form all these datasets for simulation studies ?
So, do you have some numbers "how much effort" can be saved potentially using ASReview for a larger number of reviews ?

Additional datapoints: domain and inclusion criteria

This may not be very necessary for active learning, but it makes the data more meaningful, and accessible on its own. In a structured format, it can be read using scripts without needing to go to the source of the data.

Would very much prefer if the inclusion criteria is a list of the criteria all in boolean question format. This is important for a project I am working on.

And domain, the general field of the research so researchers can be selective.

asreview / synergy-dataset Goto Github PK

synergy-dataset's Issues

Links for cohen datasets are broken

wilson dataset - 1/3 of abstracts is missing

original systematic review for the PTSD dataset

Exclude 4 more papers

Connect to RIS files on OSF

Count duplicates

Kwok dataset doesn't have files on persistent location

Deal with different reviewers regarding inclusion / exclusion ?

SSL Issue when running a get command in Plain Text on MacOS

Suggested dataset - Particle bean radiation therapies for cancer

Contribution of SRs from EFSA

Inconsistent names label columns

Information of "saving" rates ?

Additional datapoints: domain and inclusion criteria

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent