asreview / synergy-dataset Goto Github PK
View Code? Open in Web Editor NEWSYNERGY - Open machine learning dataset on study selection in systematic reviews
License: Creative Commons Zero v1.0 Universal
SYNERGY - Open machine learning dataset on study selection in systematic reviews
License: Creative Commons Zero v1.0 Universal
About 1/3 of the abstracts in the wilson dataset is missing (of which 3 inclusions).
I think it would be worthwhile to see if we could figure out what causes this missingness.
One of the authors (Dr. Hannah Ewald) named the following possible causes:
572/1090 missing abstracts are from the years 1912 to 1989.
The rest comes from both Embase and Medline and I can’t see any systematic error (i.e. there is mixed page numbers, all names from A to Z, all indexed as journal articles, different languages and journals). Although it’s odd in such a high number, maybe they just don’t have abstracts.
I am awaiting response from the first author (dr. Christian Appenzeller-Herzog) who is currently out of office.
It turns out there is a discrepancy of 4 inclusions between the ptsd-dataset in this repo (38 inclusions) and the systematic review we linked it to (34 inclusions).
For the ptsd dataset, up until now we refer to https://doi.org/10.1080/00273171.2017.1412293, a systematic review (1) reporting 34 inclusions. It turns out that the .ris
files in the corresponding osf-page contain 38 inclusions.
This number belongs to another systematic review (2) (on the same dataset), http://dx.doi.org/10.1080/10705511.2016.1247646. On the corresponding osf-page, https://osf.io/6vdfk/ however, there are no .ris
files uploaded (yet).
I think we have refer to one or the other paper. For the 34 inclusions paper we need to update the dataset by recoding the 4 inclusions to exclusions (see comment by @J535D165 below). For the paper with 38 inclusions, the osf-page should be updated (@Rensvandeschoot), and all information in the documentation on systematic review 1 should be replaced with information on systematic review 2.
Any thoughts?
Thanks @terrymyc! Thanks for your contribution.
A couple of additions and remarks are listed below:
~Based on the paper, the team excluded 4 more papers. 34 of the 38 papers were described in the paper. The papers listed below have to be excluded (isn't it? @Rensvandeschoot @GerbrichFerdinands ): ~
The RIS files are now available on OSF. Can you connect them to your code and remove the ones in the Github repository?
Thanks for your statistics so far. Can you count the number of duplicate items as well? Please don't make things too complicated, count check of duplicate abstracts for example.
It turns out that @qubixes is also doing some work on the dataset statistics. This is implemented in an extension for asreview https://github.com/asreview/asreview-statistics. It might be interesting to have a look. It would be nice to integrate the functionality with this repo (not for now).
Originally posted by @J535D165 in #13 (comment)
Files were shared directly with us. For this reason, no files are found in the OSF url. I removed this entry as I'm cleaning the repo from source files.
{
"dataset_id": "Kwok_2020",
"url": "https://raw.githubusercontent.com/asreview/systematic-review-datasets/master/datasets/Kwok_2020/output/Kwok_2020.csv",
"reference": "https://doi.org/10.3390/v12010107",
"link": "https://doi.org/10.17605/OSF.IO/5S27M",
"license": "CC-BY Attribution 4.0 International",
"title": "Virus Metagenomics in Farm Animals: A Systematic Review",
"authors": [
"Kwok, K. T. T.",
"Nieuwenhuijse, D. F.",
"Phan, M. V. T.",
"Koopmans, M. P. G."
],
"year": 2020,
"topic": "Virus Metagenomics",
"final_inclusions": true,
"title_abstract_inclusions": false
}
I see here, that you only use label2 to decide on inclusion. Can you motivate this ?
We have as well the case, that our screening gets done by 2 people.
We tend to include, if at least one reviewer wants to include.
You seem to do this differently.
I received an ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)
error when trying to run the synergy_dataset get
on macOS Ventura Version 13.6.6.
It is a known issue on MacOS that I was able to solve using the open /Applications/Python\ 3.9/Install\ Certificates.command
in a terminal.
This is a common issue on macOS where Python is not able to verify the SSL certificate provided by the server. To fix it we are using Certificates.command script from python.org for macOS.
https://www.ncbi.nlm.nih.gov/books/NBK44538/ see appendix C and D.
Code for data collection: https://github.com/mcallaghan/rapid-screening/blob/master/analysis/get_data/scrape_pbr.py
Used in publication: https://systematicreviewsjournal.biomedcentral.com/articles/10.1186/s13643-020-01521-4
I started to have a look at our full SR database.
Maybe I start to describe what I have here, and then we can iteratively think about what would be worth to include.
Maybe we can have as well an other call.
We have in total 299 "projects" in Distiller.
Quite some of them are "tests" or other garbage.
Hard to say how many, but by looking at the project names, there might be at least 100 project which should not be looked at at all.
so 200 projects remaining.
Each of it has at least "one level", in which a level could mean different things:
"title screening"
"abstract screening"
"title + abstract screening"
"full text screening"
"data extraction"
"abstract screening 1" vs "abstract screening 2".
......
.....
There is no "clear nomenclature" or metadata on this, but often we use the word "abstract" in the name of the level to indicate "abstract screening"
The number of "levels" in total (including garbage projects) is:
1226
So in total we have 1226 times , that
"humans have decided to exclude x papers out of y"
(sometimes x or y or x and y are 0)
I filtered the levels by the ones which have "abstract" in the "level name".
These SHOULD be all about abstract screening, but we might have more.
This leaves then 126 rows.
I just pasted here for you information, some of the "statistics" I get for these.
We can see that the first row is:
related to an EFSA question: EFSA-Q-2012-00234 and was about "Leishmaniosis" . (from this you could find the EFSA output: https://efsa.onlinelibrary.wiley.com/doi/epdf/10.2903/sp.efsa.2014.EN-466 where on page 19 you find a summary of the SR. Sometimes we publish as well the concrete references included / excluded, but not always
we look at a level/phase of the systematic review called "Title and abstract screening - Study eligibility form: Title and abstract screening"
(so I think, this is a "real" abstract screening, probably worth to add to your database or to be used in a simulation)
it started with 961 references
we excluded 877 and included 84
| project | level | References Added | Unreviewed | Some Reviews | Included | Excluded | Conflict | Fully Reviewed | Saved, Unsubmitted |
|----------------------------------------------|----------------------------------------------------------------------------------------------------------|------------------|------------|--------------|----------|----------|----------|----------------|--------------------|
| AHAW_EFSA-Q-2012-00234_Leishmaniosis | Title and abstract screening - Study eligibility form: Title and abstract screening | 961 | 0 | 0 | 84 | 877 | 0 | 961 | 0 |
| AHAW_EFSA-Q-2012-00234_Leishmaniosis | Full paper screening - Study eligibility form: Full paper screening of unclear title and abstract papers | | 0 | 0 | 23 | 61 | 0 | 84 | 0 |
| AHAW_EFSA-Q-2013-00546_EBL | Title Abstract screening - Title and abstract screening EBL | 5181 | 0 | 0 | 255 | 4926 | 0 | 5181 | 0 |
| AHAW_EFSA-Q-2013-00835_leishmaniasis | relevance - First stage screening (title and abstract) | 182 | 0 | 0 | 14 | 168 | 0 | 182 | 0 |
| AHAW_EFSA-Q-2013-00918_pox | Screening 1 - POX Screening 1 (title&abstract) | 86 | 0 | 0 | 37 | 49 | 0 | 86 | 0 |
| AHAW_EFSA-Q-2013-01034_PPR | Screening - PPR Screening 1 (title&abstract) | 1076 | 0 | 0 | 243 | 833 | 0 | 1076 | 0 |
| AHAW_EFSA-Q-2014-00187- VBD-review-GEOG-DIST | Title and abstract screening - Tittle and abstract screening | 816 | 15 | 0 | 255 | 521 | 12 | 801 | 0 |
| AHAW_EFSA-Q-2015-00160_PED | Title and abstract screening PED - Title and abstract screening PED | 1609 | 0 | 0 | 246 | 1363 | 0 | 1609 | 0 |
| AHAW_EFSA-Q-2016-00160_Bluetongue | Level 1 - Q3 screening title and abstracts | 287 | 0 | 0 | 103 | 184 | 0 | 287 | 0 |
| AHAW_EFSA-Q-2018-00141_ASF | ASF screening - ASF Title abstract Screening | 1512 | 0 | 0 | 89 | 1422 | 1 | 1512 | 0 |
| AHAW_EFSA-Q-2018-00269_AI_Monitoring | Title abstract screening - Title abstract screening | 47 | 47 | 0 | 0 | 0 | 0 | 0 | 0 |
| AHAW_EFSAQ201400187_DACRAH2_GeoDistribution | Title and abstract screening - Tittle and abstract screening | 5433 | 0 | 0 | 982 | 4451 | 0 | 5433 | 0 |
| AHAW_EFSA_Q_-2014-00187-VECTORNET-OBJ1 | ti/abstract screening - MIR_Tittle and abstract screening | 1756 | 0 | 0 | 679 | 1077 | 0 | 1756 | 0 |
| AHAW_EFSA_Q_-2014-00187-VECTORNET-OBJ2 | Level 1 - R0_Tittle and abstract screening | 145 | 0 | 0 | 107 | 38 | 0 | 145 | 0 |
| AHAW_EFSA_Q_-2014-00187-VECTORNET-OBJ3 | Level 1 - VecComp_Tittle and abstract screening | 703 | 27 | 0 | 327 | 349 | 0 | 676 | 0 |
| AMU_EFSA-Q-2015-00592_crowd | screening - Title and abstract screening | 371 | 0 | 0 | 25 | 346 | 0 | 371 | 0 |
| AMU_EFSA-Q-2016-00294_MLT- SR | Level 1 - LEVEL1 screening title and abstracts | 953 | 0 | 0 | 257 | 696 | 0 | 953 | 0 |
| BIOCONTAM_EFSA-Q-2014-00189_QPS2014G+NS | Title and abstract screening - STEP 1 (Title and/or abstract): GRAM-POSITIVE - NON-SPORULATING BACTERIA | 875 | 113 | 393 | 16 | 353 | 0 | 369 | 0 |
| BIOCONTAM_EFSA-Q-2014-00189_QPS2014G+S | Screening Title and Abstract - STEP 1 (Title and/or abstract): GRAM-POSITIVE -SPORULATING BACTERIA | 447 | 0 | 421 | 17 | 9 | 0 | 26 | 0 |
| BIOCONTAM_EFSA-Q-2014-00189_QPS2014V | Title and Abstract screening - STEP 1 (Title and/or abstract): Viruses used for plant protection | 77 | 0 | 77 | 0 | 0 | 0 | 0 | 0 |
| BIOCONTAM_EFSA-Q-2014-00189_QPS2014Y | Title and Abstract screening - STEP 1 (Title and/or abstract): YEASTS | 488 | 0 | 477 | 11 | 0 | 0 | 11 | 0 |
| BIOCONTAM_EFSA-Q-2014-00536_EAEC_Trial | Title and abstracts - Title and abstract screening | 240 | 0 | 100 | 106 | 34 | 0 | 140 | 0 |
| BIOCONTAM_EFSA-Q-2015-00028_DIOX_FARM | Level 1 _title and abstract - DIOXIN _ FARM / Title and abstract screening | 4202 | 0 | 0 | 503 | 3699 | 0 | 4202 | 0 |
| BIOCONTAM_EFSA-Q-2015-00028_DIOX_NP06C | Level 1 - RPA_IEH_updated / Title and abstract screening | 6101 | 0 | 0 | 2218 | 3883 | 0 | 6101 | 0 |
| BIOCONTAM_EFSA-Q-2015-00028_DIOX_NP07C | Level 1 - DIOXIN _TOXICOLOGY MODELS / Title and abstract screening | 4906 | 0 | 0 | 633 | 4273 | 0 | 4906 | 0 |
So one contribution to you could be the (at least 126) abstract screenings from our database, including its meta-data:
Some might be "half done", but that you could see from the numbers of "total papers", "included", "excluded" , "conflict".
I would say "nearly all" are complete.
I have automated all extractions, so the "volume of SRs" does not make any difference for me.
The column with labels does not have the same name over all datasets.
final_included
(see cell 8 in this notebook),label_included
. For example in the Cohen datasets (see cell 4 in this notebook.Maybe there are more variations, but I didn't find them this far.
Then, the README of this repo says the following:
To indicate labelling decisions, one can use "included" or "label_included". The latter label called "included" is needed to indicate the final included publications in the simulations.
It would be nice to make the name of the label column more consistent throughout all datasets.
Did you use in some form all these datasets for simulation studies ?
So, do you have some numbers "how much effort" can be saved potentially using ASReview for a larger number of reviews ?
This may not be very necessary for active learning, but it makes the data more meaningful, and accessible on its own. In a structured format, it can be read using scripts without needing to go to the source of the data.
Would very much prefer if the inclusion criteria is a list of the criteria all in boolean question format. This is important for a project I am working on.
And domain, the general field of the research so researchers can be selective.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.