eshackathon / citesource Goto Github PK

View Code? Open in Web Editor NEW

14.0 8.0 1.0 307.45 MB

Home Page: http://www.eshackathon.org/CiteSource/

License: GNU General Public License v3.0

R 100.00%

evidence-synthesis evidence-based literature-review systematic-mapping systematic-review

citesource's Introduction

CiteSource

About the Pacakge

CiteSource was developed to provide researchers the ability to examine the utility and efficacy of literature resources and search methodologies. The idea behind CiteSource is simply allowing users to deduplicate citation records, while maintaining customizable metadata about the citation.

Development

Development of this project began as part of the Evidence Synthesis Hackathon and as part of Evidence Synthesis & Meta-Analysis in R Conference - ESMARConf 2022. to learn more about this awesome conference and hackathon please visit @ https://esmarconf.org/

License

CiteSource was created under the General Public License (>=v3).

Features

Customizable Metadata Tags

Users can provide customizable metadata in three fields, cite_source, cite_string, and cite_label. Metadata can include anything from a resource name (e.g. Web of Science, LENS.org, PubMed), a method (database search, handsearching, citation snowballing), a variation used within a method (WoS string #1, Wos string #2, WoS string #3), a research phase (search, Ti/Ab screening, Full-text Screening), or a unique group of citations (benchmarking articles, articles from a previous review, articles with a specific author affiliation).

Record Merging

The CiteSource deduplication process is better described as a record merging process due to the fact that the customizable metadata from duplicate records is maintained through the creation of a single, primary record. Beyond the merging of customizable metadata, the primary record is created by using the most complete metadata available between duplicate records (currently DOI and Abstract fields).The ASySD package, developed by Kaitlyn Hair, serves as the backbone of this process.

Table and Plot Visualizations

Once records are deduplicated, users are able to easily create plots and tables to answer specific questions or to simply explore the data in an effort to develop new hypotheses. Examples of analysis may include how many unique records a specific source contributed or how traditional methods of searching fare against a new AI discovery tool in finding relevant articles. Users may want to understand the overlap in records between two different search strings or evaluate the impact of including Google Scholar in a review. Before searching, a user may even develop a targeted search to better understand the topical coverage across databases that they intend to search, and once the search has been developed, how a particular source, string, or method performed in discovering benchmarking articles.

Getting Started

Installation

Install CiteSource in R with remotes::install_github("ESHackathon/CiteSource")

Vignettes

Vignettes covering various use cases can be found on the CiteSource web page.

Feedback

Be sure to check out our discussion page to engage with us or to learn more about the various use cases for CiteSource. You can provide comments/suggestions or suggest a vignette for a specific use case.

citesource's People

Contributors

Stargazers

Watchers

Forkers

tnriley

citesource's Issues

Error in dedup_citations--no applicable method

I'm running through Kaitlyn's example code with a different set of files (files are in test data: ASFAb_70.ris, Greenfile_40.ris, Scopus_100.ris, WoS_90.ris). When I get to dedup_citations I get the following:

> dedup_results <- dedup_citations(citations, merge_citations = TRUE)
[1] "formatting data..."
[1] "identifying potential duplicates..."
 Error in UseMethod("mutate") : 
no applicable method for 'mutate' applied to an object of class "NULL"

Angle axis labels in comparison bar plot

The label names I chose for my comparison bar plot are 'included', 'screened' and 'search'. Given that I'm comparing across eight sources, the labels are getting smushed together. Can we angle the labels at the bottom of the bar plot to make them more readable?

Records with no metadata added

When we were testing the new files and vignette, we noticed an odd problem (that I'm now realizing was also occurring when I ran my sample files). When the charts are created there is a group of records that seem to be untagged with a source name (just called source_). It's unclear where they coming from other than they are from the Screening and Final files (not the source files), and they appear to have no other metadata associated with them like titles, etc.. When I looked at the various dataframes that are created along the process, I noticed a bunch of rows of NAs appearing in the n_unique file.

Not sure how to better explain this!

read_citations warning

I get this warning when using read_citations: "Warning message:
In if (is.na(cite_sources)) { :
the condition has length > 1 and only the first element will be used"

Everything works okay - I was just confused as to why I get the warning.

file.exists(paste0(here::here(),"/tests/testthat/data"))  
list.files(paste0(here::here(),"/tests/testthat/data"))
files=list.files(paste0(here::here(),"/tests/testthat/data"), full.names = TRUE)
risfiles=read_citations(files=c(files[1], files[2]), 
               cite_sources = c("Ovid", "WoS"))
head(risfiles)
names(risfiles)

About Text

For the About -> Use Cases

CiteSource provides users with the ability to deduplicate references while maintaining customizable metadata. Instead of the traditional deduplication method where records are removed and only one duplicate record is selected to be retained, duplicate records are instead merged into a single record. This single record maintains user-customized metadata in two fields, Source and Tag. In the process of merging records, select metadata fields are also automatically compared (currently DOI & Abstract) and the field from record with the most complete metadata is used to create the merged record. Below are a few examples of general use cases, followed by specific examples.

Source/Method Analysis
Analyze the number and percent of overlapping & unique citations between multiple .ris files

Database/Platform/Index
Methodology
Search string/strategy

Examples:

Databases/Database: GreenFile vs. CAB Direct vs. Aquatic Sciences and Fisheries Abstracts (ASFA) vs. Water Resources Abstracts
Platform/Indexes: Web of Science- Science Citation Index Expanded (YR-YR) vs. Core Collection vs. "ALL Databases" (YR-YR)
Search Engine/Database: Google scholar vs. Web of Science
Methodology/Methodology: Hand searching vs. citation chasing vs. naive string
String/strategy: ASFA string 1 vs. ASFA string 2 vs. ASFA string 3

Visual example placeholder

Stage/Topic Analysis
Analyze the number and percent of overlapping & unique citations between multiple .ris files to see changes over review stages (initial search results, post-screening, final included sources) OR to drill down and understand overlapping/unique content as it relates to topics or variables (for instance if citations are tagged during screening for a systematic map, the user can better understand how each databases contributed to the literature base - methodology, geography, etc.)

Review Phase (change in corpus over time/stages)
Topic based

Examples:
Databases/Database (Stage Analysis):

GreenFile (Search, Screen, Final)
Aquatic Sciences and Fisheries Abstracts (Search, Screen, Final)

Visual example placeholder

Metadata Enhancement
CiteSource provides you with the ability to create a single record and include prefered metadata for selected fields based on metadata attributes (filled or empty, length). At some point in the future users may be able to select Source data as prefered over these rules.

Basic logic for metadata selection (Filled/Empty + Length)

IF metadata from one record’s DOI field contains text and the others’ do not
THEN the metadata from the record with text will be used for the merge record

IF metadata from one record’s ABSTRACT field contains more text and the other records with ABSTRACT text
THEN the metadata from the record with more text will be used for the merge record

Deduplication warning message - all scheduled cores encountered errors

When deduplicating two files (see section for issue 54 in troubleshooting.rmd vignette), I get Warning: all scheduled cores encountered errors in user code while the duplicates are being identified. Not sure what happens there - @kaitlynhair could you have a look?

Blank cluster on bar plot due to citelabel files

Files without citesource tags should not appear in the bar chart. In this bar plot, the blank cluster represents the screened and final references which are only tagged in the citelabel field.

Bar plot facet specification

Allow users to specify facet variables. Currently, the bar plot is set up to facet on the source. We should provide an option to facet on the label.

Do we need a "deduplication tab" in the Shiny?

Originally there was a tab for "dedupliccation" - I was thinking if we can do without this and include this in the "file upload" tab as a final step (does the user even need to interact with the deduplication function...? Probably yes...). It might make it a bit of a smoother workflow between tabs if we do that (so all the getting the data in and tidy for plotting is done in a single tab).

Bar Plot function update

Getting the following error on the vignette (when using Sam's sample project data)

my_contributions <- plot_contributions(n_unique, center = TRUE)
Error in mutate_at(., vars(!!bars), ~forcats::fct_relevel(.x, bar_order)) :
could not find function "mutate_at"

It appears from the dplyr documentation that mutate_at has been replaced with mutate_across

Ordering factors for bar plot

provide the ability for users to specify the factor order on the bar plot. Currently, the bar plot reads final, screened, and search from left to right.

Clarify / change return value for dedup citations when manual_dedup = FALSE

Based on the documentation, manual_dedup returns A list of 2 dataframes - unique citations and citations to be manually deduplicated if option selected ... but that is rather ambiguous when manual_dedup = FALSE. I would have expected a single dataframe to be returned in that case, which would make processing easier - could we change to that and clarify the documentation? @kaitlynhair if you prefer to always return a list, maybe that can be expressed more clearly, and the empty list item should probably be NULL rather than FALSE?

Read in multiple reference files at one time

Read in multiple individual files and files of various types to produce a single long dataframe

Add interactivity to plots to download underlying data

Add interactivity to plots so that records underlying a certain box in the co-occurence matrix can be downloaded by clicking on that box (possibly also show them in an interactive table?)

Merge branches and clean up some issues

rename import parameters - cite_source, cite_string, cite_label
inherit synthesisr parameter specifications in Roxygen help
create a function to specify data for plot as a combination of the various label fields
consider how to show multi-label comparison (e.g. clustered bar chart as per shiny app)
@apaxton89: create grouped bar chart function (see bottom of page here: https://estech.shinyapps.io/citesource/)

Error in dedup_citations

I am trying to get the dedup() function to work with the imported dataset from read_citations() (in the Shiny - my local version but it doesnt work outside of the shiny either) and I am hitting an error. Even though there is a column called cite_label it gives this error:

dedup_citations(CiteSource)# name of the dataframe returned by read_citations()
[1] "formatting data..."
[1] "identifying potential duplicates..."
Error in `dplyr::select()`:
! Can't subset columns that don't exist.
x Column `cite_label` doesn't exist.

Any idea what I am missing here - probably something obvious but I am not getting anywhere with it! If I run the function step-by step it seems to work...

Bar plot sizing

When we have a larger number of sources the bar plot can appear pretty squeezed. Is it possible to specify a standard width for the clusters and/or bars?

Retain and merge metadata in deduplication

Parallelize read_citations

Currently, read_citations is fairly slow - once @kaitlynhair has figured out how to best parallelize deduplication in a platform-independent way, we should copy that over to read_citations.

Add metadata filters

Add the ability for the user to filter records. This could provide the ability to analyze various other data points on what was found/included

Year of publication (ability to select a range)
Open Access/non Open Acess (metadata would need to be preprocessed - some databases(WoS?) include this information in the metadata, but not all - IF included in the metadata may require crosswalk between various resources...)
Traditional publications (articles/book chapters) / Gray lit (gov docs, theses, reports, etc.)

Wrong number of citations after deduplication

In the working_example vignette, 50 entries are imported from EconLit - but after deduplication, there are 52 entries that have EconLit in their cite_source. (For some reason, @TNRiley got 53 in his test.)

dedup_results$unique %>% filter(stringr::str_detect(cite_source, "Econ")) %>% count(cite_source) %>% summarise(sum(n))

Add labels to bar in comparison bar plot

I've been testing in this in my real world example (working great and so interesting!!). Given the large number of records from some sources, some of the bars in the bar plot become almost invisible. Can we add the number of records as a label above (and below) the bars in this plot?

Create data import/export vignette

Currently, we export a RIS with all labels, sources etc - in a future version, users should be able to re-upload that without assigning tags and deduplicating again.

Possible implementation: add prefix to one of the custom RIS fields in export, check whether that exists in upload and then ask user whether they want to skip other import steps.

Extension: allow users to reupload the file and then add additional sources.

Bar Plot - condition length error

Kaitlyn and I spent about an hour last week trying to work this one through, she is getting a warning, and I'm getting an error. We checked our package versions and verified that they matched during this process.

When running the kh_example in the vignettes

plot_contributions(n_unique, center = TRUE, bar_order = c('search','screen','final'))

returns

Error in if (bar_order != "any") data_sum <- data_sum %>% dplyr::ungroup() %>% :
the condition has length > 1

The same error is returned when I run the working_example

my_contributions <- plot_contributions(n_unique, center = TRUE,
bar_order = c('search', 'Screened', 'Final'))

it returns

Error in if (bar_order != "any") data_sum <- data_sum %>% dplyr::ungroup() %>% :
the condition has length > 1

Overlap heatmap - view % to the first decimal

Putting together the vignette language I noticed that when comparing EconLit to WoSE in the % overlap heatmap shows a 16% - 0% overlap. This makes sense as the overlap of EconLit sources is 8/50 = 16%, however, due to the large number of sources in WoSE we get back the 0% (8/2550 = .3%).

I think we should consider changing the calculation to provide the % down to the first decimal, thoughts?

Heatmap with % overlap

We've discussed a heatmap with % overlap between sources, but hadn't added an issue for it yet.

@LukasWallrich provided this example early on

Source name

When uploading two .ris with the following names

WoS
WoS_Early

The result on the heatmap (can't remember if this also occurred on the upset plot was)

source_WoS
source_E

We had run this with 10+ files at first so it was difficult to understand what the source_E was, but we believe it is due to the replication of the WoS and that it is just displaying the unique string after the _

Add set size to heatmaps / co-occurence matrices

By default, the co-occurence matrices are now sorted based on the number of records, with the largest source on top (but the user can turn that off so that the ordering in the data is preserved). In addition, it might be nice to show the size of each source as in the example below? However, that is not trivial, so not something I can implement at the moment. If we want it, the best way might be to use the superheat package.

Add bibtex export

Currently we have csv and bibtex export - add a bibtex version

CiteSource webpage

Submitting for enhancement (related to gh-pages branch?)

Write out final RIS file

Once deduplication is done, a final RIS file should be written out. All sources that the record was found in should be included in the database field - possibly as a JSON, or simply comma separated?

Input file labelling

How to label a source in a file:

File name (default)
Allow users to label individual RIS fields manually
Extract from RIS fiel

Tracking / Capturing of Search Terms

CiteSource should include at a record level the database and search terms / string used

Allow manual deduplication

As a supplement to automatic deduplication, users could be allowed to select records to retain - the revtools package side-by-side comparison works well and could be integrated, but we'd need to figure out how to deal with more than two sources.

https://github.com/mjwestgate/revtools/blob/master/R/screen_duplicates_ui.R

Circular bar chart showing source contribution

I thought we could try to produce a circular bar chart like this one (https://www.r-graph-gallery.com/circular-barplot.html) to visualise contributions and overlap for each source.

The metric would be number of records and the bars would be number of shared databases per record for each source.

So each source has n(sources) bars, with each bar representing records overlapping with n other sources from 1:n(sources).

What I'm not certain on is whether it should be bar height or bar width that is the dependent variable (n records). Width seems more intuitive, so each source would show its contribution in volume of records with the circumference of the circle (bigger sources being a greater span around), and the width of each bar would be the volume overlapping n other sources. Sources contributing a lot of unique records would then have a flat profile, and those contributing little would be higher. Larger sources would be wider and smaller narrower.

The alternative is bar height, where each source has an equal circumference of the circle, but bigger sources are higher overall, and more unique sources have a steeper profile in the direction of uniqueness. That might be harder to spot uniqueness because steepness between n=0 shared and n=1 shared could be hard to differentiate but would mean very different things (lots of unique records, versus lots overlapping with 1 other source).

installation Error using remotes package

Trying to install CiteSource using the remotes package - both Sarah and I are now getting this error

Sarah had previously been able to install using the remotes package

remotes::install_github("ESHackathon/CiteSource")

installing source package 'CiteSource' ...
** using staged installation
** R
Error in parse(outFile) :
C:/Users/trevor.riley/AppData/Local/Temp/1/Rtmp6r6m51/R.INSTALLe10675f881/CiteSource/R/plots.R:222:1: unexpected '}'
221: }
222: }
^
ERROR: unable to collate and parse R files for package 'CiteSource'
removing 'C:/Users/trevor.riley/AppData/Local/Programs/R/R-4.2.1/library/CiteSource'

Select a license

At the end of the meeting we discussed licensing - currently the project has defaulted to MIT but I think in general terms we were looking to move to GPLv3. This would be useful to decide on before any code is committed as difficult to change later as everyone would need to agree.

Internal data formatting process

Go from long data format for dataframe (stacked results or full results before deduplication)

Wide data format with one record per row and sources as columns

Vignette

I added a vignette folder and file. I think I'm still not clear on what we've settled on in terms of importing, etc. but is this general workflow looking reasonable?

About the package

CiteSource provides users with the ability to deduplicate references while maintaining customizable metadata. Instead of the traditional deduplication method where records are removed and only one record is selected to be retained, CiteSource retains each duplicate record while merging metadata into a single main record. This main record maintains user-customized metadata in two fields, "Source" and "Tag". In the merging process, select metadata fields are also automatically compared (currently DOI & Abstract) and the most complete metadata is used in the main record.

Installation

Use the following code to install CiteSource.

Import files from multiple sources

Currently, users can import multiple RIS files into CiteSource, which will be labelled with source information such as database, platform, and a search ID. The latter can be used to specify search parameters.


my_records <- read_citations(c("asfa.ris", "econlit.ris", "greenfile.ris"),
                             database = c("ASFA", "EconLit", "GreenFILE"),
                             plaform = c("ProQuest", "EBSCO", "EBSCO"),
                             search_ids = c("Search1", "Search2", "Search3"))

Deduplicate while maintaining source information

CiteSource allows users to merge duplicates while maintaining information ...


unique_citations <- dedup_citations(my_records)

data_sources <- source_comparison(unique_citations)

Source or method analysis

When teams are selecting databases for inclusion in a review it can be extremely difficult to determine the best resources and determine the ROI in terms of the time it takes to apply searches. This is especially true in environmental research where research is often cross-disciplinary. By tracking where/how each citation was found, the evidence synthesis community could in turn track the efficacy of various databases and identify the most relevant resources as it relates to their research topic. This idea can be extended to search string comparison as well as strategy and methodology comparison.

Plot overlap as a heatmap matrix


my_heatmap <- plot_source_overlap(data_sources, 
                                  plot_type = "percentages")
my_heatmap

Plot overlap as an upset plot


my_upset_plot <- plot_source_overlap_upset(data_sources)
my_upset_plot

Review stage analysis

Once the title and absract screening has been complete OR once the final papers the final literature has been selected, users can analyze the contributions of each Source/Method to better understand its impact on the review. By using the "Source" data along with the "Tag" data, users can analyze the number of overlapping/unique records from each source or method.

Assess contribution of sources by review stage

Documentation and output

Generate a search summary table

Export deduplicated files

Set up package foundations (DESCRIPTION, unit testing etc)

Expand plot options - faceting and data specification

e.g. to compare sources by stages

Shiny/GUI

Focusing on the shiny interface around the following use cases as I believe that this covers what we've talked about.

Single merged record
The ability to create a single record and include prefered metadata for selected fields based on the metadata attributes (fillded or not, length) is its own unique use case. This was the "metadata enhancement" use case that was in the google doc.

Examples:
IF one abstract contains text and the other does not, choose the metadata from the complete record for the final merged record.
IF one records' author field contains more text and the other does not, choose the metadata from the more complete record

High level anlysis
Databases/Platforms/Indexes analysis

crossover
uniqueness
(No string information included)

Examples:
Databases/Database: GreenFile vs. CAB Direct vs. Aquatic Sciences and Fisheries Abstracts (ASFA) vs. Water Resources Abstracts
Platform/Indexes: Web of Science- Science Citation Index Expanded (YR-YR) vs. Core Collection vs. "ALL Databases" (YR-YR)
Internal Publisher : ProQuest - (ASFA) vs. ProQuest Earth - Atmospheric & Aquatic Science Datab (EAAS)
Search Engine/Database: Google scholar vs. ASFA OR Web of Science etc...

Mid level
Single database - Multi String/strategy analysis

compare search results against known seed articles/post title abstract/final included articles
(there is a lot of potential to analyze best cases in string development, etc. using this)

Example:
ASFA
Search 1 vs Search 2 vs Search 3

Deeper level (this is an area I have a hard envisioning specific use cases for and see as very niche)
Multiple database - multiple string
Same use cases as mid level, but the ability to analyze across databases as well

Examples:
ASFA string 1 vs EAAS string 1
ASFA string 1 vs EAAS string 2
ASFA string 3 vs EAAS string 2
ASFA string 1, 3 vs EAAS string 2

Summary data for flow diagrams

number of results
sources of databases
duplicates removed
number of unique removed vs. crossover (I'm sure there could be some cool research question as to quality and availability?)
etc.

File upload limited in size (Shiny)

By default, Shiny limits file uploads to 5MB per file. You can modify this limit by using the shiny.maxRequestSize option. For example, adding options(shiny.maxRequestSize=30*1024^2) to the top of server.R would increase the limit to 30MB.

Add real progress bars on app

Particularly for file upload and deduplication, the app needs a while - loading bars would be great to keep the user engaged.

Co-occurrence matrix

We need to produce a co-occurrence matrix with sources as columns/rows and the number of shared records across source pairs.

Not sure if co-occurrence or frequency matrix is the right word - can't find anything that shows frequencies rather than correlations or p-values, but maybe because it's very basic...

But something that looks like this:

Add GH Action to auto-deploy Shiny

... once Shiny app is merged and semi-stable.

Record level search summary table

We need to produce a record-level search summary table (see example in Table 3 here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7772975/pdf/jmla-109-1-97.pdf).

Embed source data into record level RIS data

Associate source information (source database + search string/search name identifier) with records within a single RIS import.

This could be added as a categorical identifier in an additional column in the main data.frame, and could be embedded in the RIS outputs within custom field 1 (C1).

For the search history data, we embed within search_history_start{...}search_history_end, so we could use the same but just paste a snippet from the main search history JSON string - i.e. just the search id or an array of the database name and search string. If we embed the search history in the first record of every RIS file then there's probably no need to repeat information and bloat the RIS, so I would suggest inserting search_source_start{...}search_source_end into C1 (alongside the search history for the first record).

The content to insert could be, for example:
"record_info": { "id_1": "https://www.doi.org/10.1897/687-asdg9-88.10" }

This could be linked internally to the id within the search history JSON and externally to a DOI or other URL. If absent, it could be replaced with the file name:

"record_info": { "id_1": filename }

Add bar plot facet for total count of unique/overlap

Adding a facet to the bar plot that provides a count for unique and overlapping records. Currently, we do not have a visualization that provides a total count across all sources.

Add option to name bottom axis in comparison bar plot

Would it be possible to add an argument to the plot contributions function to name the bottom axis something other than cite_label? For example, plot_contributions(n_unique, center = TRUE, x_axis = "Stage of Review"). This is not super important, maybe something for future.