Create networks of subreddits.
Before using:
- install Stanford SNAP (https://snap.stanford.edu/snappy/)
- install BeautifulSoup (pip install beautifulsoup4)
- install Markdown (https://pypi.python.org/pypi/Markdown)
Data:
- http://files.pushshift.io/reddit/subreddits/
- Download; place in
data
; leave assubreddits.gz
- Download; place in
- http://files.pushshift.io/reddit/submissions/
- Download bz2 files for desired submissions; leave zipped; place in their own folder in
data
- Download bz2 files for desired submissions; leave zipped; place in their own folder in
- http://files.pushshift.io/reddit/comments/
- Download bz2 files for comments from same time frame as submissions; leave zipped; place in their own folder in
data
- Download bz2 files for comments from same time frame as submissions; leave zipped; place in their own folder in
Outputs:
python 00_parse_subreddits.py
-> graph with approx. 1M nodes representing subreddits. No edges. Node attributes:created_utc
(int
): UTC timestamp of creation datesubscribers
(int
): # of subscribers to subredditdescription
(str
): Plaintext of subreddit descriptionid
(str
):t5_
+id
is equal to thename
field.lang
(str
): Subreddit language. Most subreddits are in English (en
).name
(str
): Subreddit's identifier in the PushShift dataset.public_description
(str
): Plaintext of brief subreddit description.submit_text
(str
): Text shown to users about to make a submission.title
(str
): HTML title tag.url
(str
): Subreddit name, e.g./r/politics
desc_subreddits
(str
): Space-separated list of subreddits mentioned in this subreddit'sdescription
field.
01_Submissions and Comments to Tables.ipynb
will use the output ofparse_subreddit.py
and the PushShift comment/submission data to write all relevant submission and comment data into tab-separated .txt files. Submission and comment files contain: author; subreddit/post/comment/ids; upvotes, downvotes, scores, and gold; creation timestamps; and associated text for NLP.02_Tables to Post-Comment TNEANets.ipynb
will use the .txt files to create one TNEANet per subreddit containing nodes for all of the comments and posts captured from that subreddit. Node attributes:score
(int
): Comment or post scoregilded
(int
): Number of times post or comment received Reddit Goldcreated_utc
(int
): Post or comment creation timestamp; seconds after Jan 01 1970 00:00 UTCauthor
(str
): Reddit username of post or comment author, in lowercasetext
(str
): Title of post (plaintext) or text of comment (markdown)id
(str
): Reddit ID of the post (starts witht3_
) or comment (starts witht1_
)
03_Tables to User-User Graphs.ipynb
will use the .txt files to create two TNEANets and two TNGraphs representing comments between users on Reddit. The TNEANets contain one directed edge from comment author to comment recipient (parent commenter or parent poster) for each comment captured in the .txt table. The TNGraphs disallow multi-edges, so if user A has made multiple comments in response to user B, there will only be one A->B edge. The_nodelete
TNEANet does not create a node for the[deleted]
placeholder user, nor does it add edges for any comments whose author or recipient is[deleted]
or whose text is[deleted]
or[removed]
. The_nodelete
TNGraph does not create a node for the[deleted]
placeholder user, but it does add edges for comments whose text is[deleted]
or[removed]
.- TNEANet node attributes:
username
(str
): The user the node represents
- TNEANet edge attributes:
score
(int
)gilded
(int
)created_utc
(int
)comment_id
(str
)subreddit
(str
)
- TNEANet node attributes: