asreview / synergy-dataset Goto Github PK

SYNERGY - Open machine learning dataset on study selection in systematic reviews

License: Creative Commons Zero v1.0 Universal

Python 100.00%

systematic-reviews-datasets utrecht-university dataset graphs machine-learning citation-network natural-language-processing prisma scholarly-articles research

synergy-dataset's Introduction

ASReview: Active learning for Systematic Reviews

Systematically screening large amounts of textual data is time-consuming and often tiresome. The rapidly evolving field of Artificial Intelligence (AI) has allowed the development of AI-aided pipelines that assist in finding relevant texts for search tasks. A well-established approach to increasing efficiency is screening prioritization via Active Learning.

The Active learning for Systematic Reviews (ASReview) project, published in Nature Machine Intelligence implements different machine learning algorithms that interactively query the researcher. ASReview LAB is designed to accelerate the step of screening textual data with a minimum of records to be read by a human with no or very few false negatives. ASReview LAB will save time, increase the quality of output and strengthen the transparency of work when screening large amounts of textual data to retrieve relevant information. Active Learning will support decision-making in any discipline or industry.

ASReview software implements three different modes:

Oracle Screen textual data in interaction with the active learning model. The reviewer is the 'oracle', making the labeling decisions.
Exploration Explore or demonstrate ASReview LAB with a completely labeled dataset. This mode is suitable for teaching purposes.
Simulation Evaluate the performance of active learning models on fully labeled data. Simulations can be run in ASReview LAB or via the command line interface with more advanced options.

Installation

The ASReview software requires Python 3.8 or later. Detailed step-by-step instructions to install Python and ASReview are available for Windows and macOS users.

pip install asreview

Upgrade ASReview with the following command:

pip install --upgrade asreview

To install ASReview LAB with Docker, see Install with Docker.

How it works

Getting started

Getting Started with ASReview LAB.

Citation

If you wish to cite the underlying methodology of the ASReview software, please use the following publication in Nature Machine Intelligence:

van de Schoot, R., de Bruin, J., Schram, R. et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell 3, 125–133 (2021). https://doi.org/10.1038/s42256-020-00287-7

For citing the software, please refer to the specific release of the ASReview software on Zenodo https://doi.org/10.5281/zenodo.3345592. The menu on the right can be used to find the citation format of prevalence.

For more scientific publications on the ASReview software, go to asreview.ai/papers.

Contact

For an overview of the team working on ASReview, see ASReview Research Team. ASReview LAB is maintained by Jonathan de Bruin and Yongchao Terry Ma.

The best resources to find an answer to your question or ways to get in contact with the team are:

Documentation - asreview.readthedocs.io
Newsletter - asreview.ai/newsletter/subscribe
Quick tour - ASReview LAB quick tour
Issues or feature requests - ASReview issue tracker
FAQ - ASReview Discussions
Donation - asreview.ai/donate
Contact - [email protected]

License

The ASReview software has an Apache 2.0 LICENSE. The ASReview team accepts no responsibility or liability for the use of the ASReview tool or any direct or indirect damages arising out of the application of the tool.

synergy-dataset's People

Contributors

Stargazers

Watchers

synergy-dataset's Issues

Contribution of SRs from EFSA

I started to have a look at our full SR database.

Maybe I start to describe what I have here, and then we can iteratively think about what would be worth to include.
Maybe we can have as well an other call.

We have in total 299 "projects" in Distiller.
Quite some of them are "tests" or other garbage.
Hard to say how many, but by looking at the project names, there might be at least 100 project which should not be looked at at all.
so 200 projects remaining.

Each of it has at least "one level", in which a level could mean different things:

"title screening"
"abstract screening"
"title + abstract screening"
"full text screening"
"data extraction"
"abstract screening 1" vs "abstract screening 2".
......
.....

There is no "clear nomenclature" or metadata on this, but often we use the word "abstract" in the name of the level to indicate "abstract screening"

The number of "levels" in total (including garbage projects) is:
1226

So in total we have 1226 times , that
"humans have decided to exclude x papers out of y"

(sometimes x or y or x and y are 0)

I filtered the levels by the ones which have "abstract" in the "level name".
These SHOULD be all about abstract screening, but we might have more.

This leaves then 126 rows.

I just pasted here for you information, some of the "statistics" I get for these.

We can see that the first row is:

related to an EFSA question: EFSA-Q-2012-00234 and was about "Leishmaniosis" . (from this you could find the EFSA output: https://efsa.onlinelibrary.wiley.com/doi/epdf/10.2903/sp.efsa.2014.EN-466 where on page 19 you find a summary of the SR. Sometimes we publish as well the concrete references included / excluded, but not always
we look at a level/phase of the systematic review called "Title and abstract screening - Study eligibility form: Title and abstract screening"
(so I think, this is a "real" abstract screening, probably worth to add to your database or to be used in a simulation)
it started with 961 references
we excluded 877 and included 84

|                                      project |                                                                                                    level | References Added | Unreviewed | Some Reviews | Included | Excluded | Conflict | Fully Reviewed | Saved, Unsubmitted |
|----------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------|------------|--------------|----------|----------|----------|----------------|--------------------|
|         AHAW_EFSA-Q-2012-00234_Leishmaniosis |                      Title and abstract screening - Study eligibility form: Title and abstract screening |              961 |          0 |            0 |       84 |      877 |        0 |            961 |                  0 |
|         AHAW_EFSA-Q-2012-00234_Leishmaniosis | Full paper screening - Study eligibility form: Full paper screening of unclear title and abstract papers |                  |          0 |            0 |       23 |       61 |        0 |             84 |                  0 |
|                   AHAW_EFSA-Q-2013-00546_EBL |                                              Title Abstract screening - Title and abstract screening EBL |             5181 |          0 |            0 |      255 |     4926 |        0 |           5181 |                  0 |
|         AHAW_EFSA-Q-2013-00835_leishmaniasis |                                                   relevance - First stage screening (title and abstract) |              182 |          0 |            0 |       14 |      168 |        0 |            182 |                  0 |
|                   AHAW_EFSA-Q-2013-00918_pox |                                                           Screening 1 - POX Screening 1 (title&abstract) |               86 |          0 |            0 |       37 |       49 |        0 |             86 |                  0 |
|                   AHAW_EFSA-Q-2013-01034_PPR |                                                             Screening - PPR Screening 1 (title&abstract) |             1076 |          0 |            0 |      243 |      833 |        0 |           1076 |                  0 |
| AHAW_EFSA-Q-2014-00187- VBD-review-GEOG-DIST |                                             Title and abstract screening - Tittle and abstract screening |              816 |         15 |            0 |      255 |      521 |       12 |            801 |                  0 |
|                   AHAW_EFSA-Q-2015-00160_PED |                                      Title and abstract screening PED - Title and abstract screening PED |             1609 |          0 |            0 |      246 |     1363 |        0 |           1609 |                  0 |
|            AHAW_EFSA-Q-2016-00160_Bluetongue |                                                               Level 1 - Q3 screening title and abstracts |              287 |          0 |            0 |      103 |      184 |        0 |            287 |                  0 |
|                   AHAW_EFSA-Q-2018-00141_ASF |                                                             ASF screening - ASF Title abstract Screening |             1512 |          0 |            0 |       89 |     1422 |        1 |           1512 |                  0 |
|         AHAW_EFSA-Q-2018-00269_AI_Monitoring |                                                      Title abstract screening - Title abstract screening |               47 |         47 |            0 |        0 |        0 |        0 |              0 |                  0 |
|  AHAW_EFSAQ201400187_DACRAH2_GeoDistribution |                                             Title and abstract screening - Tittle and abstract screening |             5433 |          0 |            0 |      982 |     4451 |        0 |           5433 |                  0 |
|       AHAW_EFSA_Q_-2014-00187-VECTORNET-OBJ1 |                                                ti/abstract screening - MIR_Tittle and abstract screening |             1756 |          0 |            0 |      679 |     1077 |        0 |           1756 |                  0 |
|       AHAW_EFSA_Q_-2014-00187-VECTORNET-OBJ2 |                                                               Level 1 - R0_Tittle and abstract screening |              145 |          0 |            0 |      107 |       38 |        0 |            145 |                  0 |
|       AHAW_EFSA_Q_-2014-00187-VECTORNET-OBJ3 |                                                          Level 1 - VecComp_Tittle and abstract screening |              703 |         27 |            0 |      327 |      349 |        0 |            676 |                  0 |
|                  AMU_EFSA-Q-2015-00592_crowd |                                                                 screening - Title and abstract screening |              371 |          0 |            0 |       25 |      346 |        0 |            371 |                  0 |
|                AMU_EFSA-Q-2016-00294_MLT- SR |                                                           Level 1 - LEVEL1 screening title and abstracts |              953 |          0 |            0 |      257 |      696 |        0 |            953 |                  0 |
|      BIOCONTAM_EFSA-Q-2014-00189_QPS2014G+NS |  Title and abstract screening - STEP 1 (Title and/or abstract): GRAM-POSITIVE - NON-SPORULATING BACTERIA |              875 |        113 |          393 |       16 |      353 |        0 |            369 |                  0 |
|       BIOCONTAM_EFSA-Q-2014-00189_QPS2014G+S |       Screening Title and Abstract - STEP 1 (Title and/or abstract): GRAM-POSITIVE -SPORULATING BACTERIA |              447 |          0 |          421 |       17 |        9 |        0 |             26 |                  0 |
|         BIOCONTAM_EFSA-Q-2014-00189_QPS2014V |         Title and Abstract screening - STEP 1 (Title and/or abstract): Viruses used for plant protection |               77 |          0 |           77 |        0 |        0 |        0 |              0 |                  0 |
|         BIOCONTAM_EFSA-Q-2014-00189_QPS2014Y |                                   Title and Abstract screening - STEP 1 (Title and/or abstract):  YEASTS |              488 |          0 |          477 |       11 |        0 |        0 |             11 |                  0 |
|       BIOCONTAM_EFSA-Q-2014-00536_EAEC_Trial |                                                       Title and abstracts - Title and abstract screening |              240 |          0 |          100 |      106 |       34 |        0 |            140 |                  0 |
|        BIOCONTAM_EFSA-Q-2015-00028_DIOX_FARM |                               Level 1 _title and abstract - DIOXIN _ FARM / Title and abstract screening |             4202 |          0 |            0 |      503 |     3699 |        0 |           4202 |                  0 |
|       BIOCONTAM_EFSA-Q-2015-00028_DIOX_NP06C |                                                 Level 1 - RPA_IEH_updated / Title and abstract screening |             6101 |          0 |            0 |     2218 |     3883 |        0 |           6101 |                  0 |
|       BIOCONTAM_EFSA-Q-2015-00028_DIOX_NP07C |                                       Level 1 - DIOXIN _TOXICOLOGY MODELS / Title and abstract screening |             4906 |          0 |            0 |      633 |     4273 |        0 |           4906 |                  0 |

So one contribution to you could be the (at least 126) abstract screenings from our database, including its meta-data:

title
authors
DOI (not always present)
year (not always present)
journal (not always present)
label (excluded / included)

Some might be "half done", but that you could see from the numbers of "total papers", "included", "excluded" , "conflict".
I would say "nearly all" are complete.

I have automated all extractions, so the "volume of SRs" does not make any difference for me.

original systematic review for the PTSD dataset

It turns out there is a discrepancy of 4 inclusions between the ptsd-dataset in this repo (38 inclusions) and the systematic review we linked it to (34 inclusions).

For the ptsd dataset, up until now we refer to https://doi.org/10.1080/00273171.2017.1412293, a systematic review (1) reporting 34 inclusions. It turns out that the .ris files in the corresponding osf-page contain 38 inclusions.

This number belongs to another systematic review (2) (on the same dataset), http://dx.doi.org/10.1080/10705511.2016.1247646. On the corresponding osf-page, https://osf.io/6vdfk/ however, there are no .ris files uploaded (yet).

I think we have refer to one or the other paper. For the 34 inclusions paper we need to update the dataset by recoding the 4 inclusions to exclusions (see comment by @J535D165 below). For the paper with 38 inclusions, the osf-page should be updated (@Rensvandeschoot), and all information in the documentation on systematic review 1 should be replaced with information on systematic review 2.

Any thoughts?

Thanks @terrymyc! Thanks for your contribution.

A couple of additions and remarks are listed below:

Exclude 4 more papers

~Based on the paper, the team excluded 4 more papers. 34 of the 38 papers were described in the paper. The papers listed below have to be excluded (isn't it? @Rensvandeschoot @GerbrichFerdinands ): ~

~~Sterling, M., Hendrikz, J., & Kenardy, J. (2010). Compensation claim lodgement and health outcome developmental trajectories following whiplash injury: A prospective study. Pain, 150(1), 22-28.~~
Hou, W. K., Law, C. C., Yin, J., & Fu, Y. T. (2010). Resource Loss, Resource Gain, and Psychological Resilience and Dysfunction Following Cancer Diagnosis: A Growth Mixture Modeling Approach. Health Psychology, 29(5), 484-495. doi:10.1037/a0020809
Mason, S. T., Corry, N., Gould, N. F., Amoyal, N., Gabriel, V., Wiechman-Askay, S., . . . Fauerbach, J. A. (2010). Growth curve trajectories of distress in burn patients. Journal of Burn Care and Research, 31(1), 64-72. doi:10.1097/BCR.0b013e3181cb8ee6
Pérez, S., Conchado, A., Andreu, Y., Galdón, M. J., Cardeña, E., Ibáñez, E., & Durá, E. (2016). Acute stress trajectories 1 year after a breast cancer diagnosis. Supportive Care in Cancer, 24(4), 1671-1678. doi:10.1007/s00520-015-2960-x

Connect to RIS files on OSF

The RIS files are now available on OSF. Can you connect them to your code and remove the ones in the Github repository?

Count duplicates

Thanks for your statistics so far. Can you count the number of duplicate items as well? Please don't make things too complicated, count check of duplicate abstracts for example.

It turns out that @qubixes is also doing some work on the dataset statistics. This is implemented in an extension for asreview https://github.com/asreview/asreview-statistics. It might be interesting to have a look. It would be nice to integrate the functionality with this repo (not for now).

Originally posted by @J535D165 in #13 (comment)

Kwok dataset doesn't have files on persistent location

Files were shared directly with us. For this reason, no files are found in the OSF url. I removed this entry as I'm cleaning the repo from source files.

{
  "dataset_id": "Kwok_2020",
  "url": "https://raw.githubusercontent.com/asreview/systematic-review-datasets/master/datasets/Kwok_2020/output/Kwok_2020.csv",
  "reference": "https://doi.org/10.3390/v12010107",
  "link": "https://doi.org/10.17605/OSF.IO/5S27M",
  "license": "CC-BY Attribution 4.0 International",
  "title": "Virus Metagenomics in Farm Animals: A Systematic Review",
  "authors": [
    "Kwok, K. T. T.", 
    "Nieuwenhuijse, D. F.", 
    "Phan, M. V. T.", 
    "Koopmans, M. P. G."
    ],
  "year": 2020,
  "topic": "Virus Metagenomics",
  "final_inclusions": true,
  "title_abstract_inclusions": false
}

Additional datapoints: domain and inclusion criteria

This may not be very necessary for active learning, but it makes the data more meaningful, and accessible on its own. In a structured format, it can be read using scripts without needing to go to the source of the data.

Would very much prefer if the inclusion criteria is a list of the criteria all in boolean question format. This is important for a project I am working on.

And domain, the general field of the research so researchers can be selective.

Deal with different reviewers regarding inclusion / exclusion ?

I see here, that you only use label2 to decide on inclusion. Can you motivate this ?

https://github.com/asreview/systematic-review-datasets/blob/d10528b2a26fc7f87caa7148adbcf03f5351fb1c/datasets/Cohen_EBM/clean_cohen.py#L35

We have as well the case, that our screening gets done by 2 people.
We tend to include, if at least one reviewer wants to include.

You seem to do this differently.

wilson dataset - 1/3 of abstracts is missing

About 1/3 of the abstracts in the wilson dataset is missing (of which 3 inclusions).

I think it would be worthwhile to see if we could figure out what causes this missingness.
One of the authors (Dr. Hannah Ewald) named the following possible causes:

572/1090 missing abstracts are from the years 1912 to 1989.

The rest comes from both Embase and Medline and I can’t see any systematic error (i.e. there is mixed page numbers, all names from A to Z, all indexed as journal articles, different languages and journals). Although it’s odd in such a high number, maybe they just don’t have abstracts.

I am awaiting response from the first author (dr. Christian Appenzeller-Herzog) who is currently out of office.

In the Kwok data the label is final_included (see cell 8 in this notebook),
It looks like that most datasets have a label column called label_included. For example in the Cohen datasets (see cell 4 in this notebook.

Maybe there are more variations, but I didn't find them this far.

Then, the README of this repo says the following:

To indicate labelling decisions, one can use "included" or "label_included". The latter label called "included" is needed to indicate the final included publications in the simulations.

It would be nice to make the name of the label column more consistent throughout all datasets.

Suggested dataset - Particle bean radiation therapies for cancer

https://www.ncbi.nlm.nih.gov/books/NBK44538/ see appendix C and D.

Code for data collection: https://github.com/mcallaghan/rapid-screening/blob/master/analysis/get_data/scrape_pbr.py
Used in publication: https://systematicreviewsjournal.biomedcentral.com/articles/10.1186/s13643-020-01521-4