A tool for domain experts to find recent and relevant public discourse on topics they are familiar with
The increasing accessibility of scientific articles and surrounding public discourse is generally beneficial to society. A tradeoff to this increased public consumption of knowledge (in formats traditionally meant for domain experts) is the rise of misinformation. Articles in scientific journals often describe specific facts and precise outcomes under specific conditions, and their validity and generalizability are usually only understood by a few experts. On the other hand, social media such as Reddit and Twitter allow anyone (often anonymously) to post articles and comment about their contents, and it is in these forums that misunderstandings and wrong information are conveyed and spread. The wide audience of these forums, coupled with increased public interest on scientific topics (for example, in relation to the Covid-19 pandemic), has made it imperative that experts be able to find and engage such posts.
This project, pitched for Brainhack Toronto 2021, seeks to create a live feed of active and relevant public discussions on widely used social media forums. While the initial focus of this project is to detect discussions that revolve around brain imaging, the tools to be developed here should in principle be useful for other scientific fields.
A tentative implementation of the project is as follows:
For Brainhack Toronto 2021, we'll communicate through a Brainhack Toronto discord channel.
For Brainhack 2021:
- Post and comment detection on Reddit via Reddit API
- Abstract and keyword detection based via Crossref API
- Classification of posts relevancy based on abstract keywords
- Post data to central repository (Firebase)
- Web application to view recent posts
Future ideas:
- Extend to other forums (Twitter?)
- Sentiment detection of discussions
- Mobile applications to display discussion feed
- Analysis of scientific information spread across social networks
-
Web Application (not deployed yet)
This web application allows users to search reddit posts in all or a specific subreddit, search the most recent posts or posts in a given time windown, or search posts with specific keywords.
python run_app_reddit_search.py
-
Reddit Posts Search & Store
Store a user's search results into a PostgreSQL database.
For demonstraction, run the following codes:
python demo_reddit_search.py
-
Reddit Post Recommender
For a given reddit post, this recommender recommends the top 5 most similar posts based on the content of the post title
See the Jupyter Notebook
reddit_recommender.ipynb
-
Reddit Post Topic (Flair Tag) Classification
This classification models predicts the topic (flair tag) of reddit posts based on the contnet of the post title.
For simplicity and demonstration, the present model performs a binary classification on posts with Biology and Environment flair tags.
See the Jupyter Notebook
reddit_classification.ipynb
Contributors of all backgrounds and experiences are welcome.
-
Python
For simplicity and consistency, you can create a conda environment using the following command:
conda create \ --name brainfeed \ python=3.7
After creating the environment, you can activate it by running
conda activate brainfeed
. -
Python packages
You will need the following packages:
habanero
(for Crossref)PRAW
(for Reddit)firebase-admin
(for Firebase / Firestore)spyder
(optional, a Python IDE)
You can install the required packages with this command, after activating the conda environment:
pip install habanero==1.0.0 praw==7.5.0 firebase-admin==5.1.0 conda install spyder=5.1.5
-
A Reddit account
Setup a Reddit account, and create a script app by clicking the "Create app" button here. More details on this can be found at: https://github.com/reddit-archive/reddit/wiki/OAuth2
-
A Firebase project
Create here: https://console.firebase.google.com/
-
Clone this repository
git clone [email protected]:yohanyee/brainfeed.git
-
Activate the conda environment
conda activate brainfeed
-
Copy the
praw.ini_TEMPLATE_DO_NOT_ENTER_INFO_HERE
file to your config directory and rename it topraw.ini
(see https://praw.readthedocs.io/en/stable/getting_started/configuration/prawini.html). Then, fill in your Reddit authentication information, following Reddit guidelines for the user_agent field. Make sure to not have this publicly visible. -
Initialize the Firebase SDK (create a service account and download the private key)
See https://firebase.google.com/docs/admin/setup/#initialize-sdk
-
Add an environment variable called
GOOGLE_APPLICATION_CREDENTIALS
pointing to the location of this private key (which should not be publicly visible)export GOOGLE_APPLICATION_CREDENTIALS="/home/user/.config/service-account-file.json"
-
Reddit API: https://www.reddit.com/dev/api
- Python API for Reddit (PRAW): https://praw.readthedocs.io/en/stable/index.html
-
Twitter API: https://developer.twitter.com/en/docs/twitter-api
-
Crossref API: https://www.crossref.org/documentation/retrieve-metadata/
- Python API for Crossref (Habanero): https://github.com/sckott/habanero
-
Altmetric API: https://www.altmetric.com/products/altmetric-api/
-
Firebase documentation: https://firebase.google.com/docs
-
Firebase Admin SDK reference: https://firebase.google.com/docs/reference/admin
-
Firebase Admin Python SDK: https://github.com/firebase/firebase-admin-python
-
Firestore quickstart guide: https://firebase.google.com/docs/firestore/quickstart#python
-