Extraction pipelines vaccine hesitancy

Repository for the Lagrange Scholarship Project about Vaccine Hesitancy - Extraction pipelines

In this project, we developed two high precision rule-based extraction pipelines able to classify text with respect to vaccination behaviors and experiences. The items we tracked are (i) adherence to the recommended or alternative vaccination schedule and (ii) mentions of positive or negative experiences with adverse events following immunization (AEFI).

The two pipelines share the same workflow and work at the level of sentences. They are made up by a filter and a classifier. The filter identifies sentences which contain information relevant to the item under consideration by using a combination of rules based on the occurrence of certain keyword with specific syntactic dependencies, while the classifier assigns the appropriate label to the sentence.

The rules of the pipelines are handcrafted and developed by inspecting a dataset composed by comments related to vaccination, collected from a popular parenting forum (BabyCenter.com https://community.babycenter.com/).

Due to the Terms of Use of the forum, we can not make the dataset of user posts and comments available. We release only the resulting interaction network.

Requirements

python   (3.7.4)

spacy    (2.2.3)
pandas   (0.25.1)
numpy    (1.17.2)
nltk     (3.4.5)
networkx (2.3)
pickle

To load spacy language model:

>>> python -m spacy download en_core_web_sm-2.2.5 --direct

Structure of the repository

Experiences_AEFI contains the keywords used to filter sentences relevant to experiences with adverse events following immunization
Vaccination_schedule contains the keywords used to filter sentences relevant to vaccination scheduling
data contains the interaction network
output contains the results of the two pipelines
test contains a list of sentences and the corresponding dependency trees. It is useful to test if the dependency parser of SpaCy returns the expected parsing
utils contains files useful for the pipelines

AEFI_pipeline_functions.py contains the script thad defines the extraction pipeline of experiences of adverse reactions following immunization.
Dependency_tree_functions.py contains the scripts to represent the dependency parser of a sentence trough a network (using the networkx library). In addition, there are functions to search information by naviganting the dependency tree
Experiences AEFI : commentclassification.ipynb is the notebook in which the pipeline of experiences of adverse reactions following immunization is applied to the sample of comments located in the data folder
Schedule_pipeline_functions.py contains the scripts defining the vaccination scheduling pipeline
Vaccination schedule : comment classification.ipynb is the notebook in which the pipeline is applied to the sample of comments located in the data folder
test_dependency_parsing.ipynb is the notebook in which the dependency parser is tested and compared with the expected behavior
text_elaboration.py contains the scripts for basic text preprocessing

loreb92 / extraction_pipelines_vaccine_hesitancy Goto Github PK