This repository holds code for a sentiment analysis of Bitcoin-related publications in traditional media (large, popular, print-based sources). The codebase consists of two tools: the web-scraping suite and the functional code which applies the sentiment analysis libraries to the corpus.
The scraper has a basic terminal-based interactive component through which a user can choose a scraping source and keywords. The available sources are NTY, CNN, BBC & Reuters. This tool performs a keyword search on the source, collects article hyperlinks, and then extracts article text from the specific webpages, respectively.
To derive sentiment scores from the scraped text data, two 'out-of-the-box', unsupervised methods are employed: VADER and TextBlob sentiment libraries. These differ from one another somewhat but provide a similar result in applying pre-trained sentiment polarity values for words known to the model within a given article. Both methods have basic functionality for taking context into account (i.e. negation and so forth).
2053 articles with keyword = ‘bitcoin’
- BBC (318 stories – avg. length 451 words)
- NYT (402 stories – avg. length 1011 words)
- CNN (720 stories – avg. length 379 words)
- Reuters (602 stories – avg. length 544 words)
Time range from 2011 to May, 2019
- BBC: June, 2011 – May, 2019
- NYT: January, 2012 – May, 2019
- CNN: August, 2012 – May, 2019
- Reuters: April, 2012 – May, 2019
Each article receives a sentiment polarity score. Articles are then aggregated in rolling time windows (monthly & fortnightly) to create smooth time series, which are plotted against Bitcoin prices. Additionally, visualisations on publishing frequency and comparison plots between VADER and TextBlob scores are also produced.
Sentiment scores are calculated using both VADER and TextBlob polarity scoring. For each data source below, the first plot shows the VADER scores, while the second shows those derived via TextBlob. Cursory causality testing using Granger Causality Tests indicated that, amongst these selected sources, BTC price is more a driver of news sentiment than the other way around.
The general comparison to note between the two methods is that TextBlob produces consistently lower values than VADER in its polarity scoring. What is interesting, however, is that this variance between the two methods is not equal from source to source (e.g. Reuters scores from both methods are much more similar than those from other sources). This indicates that a particular style of journalism might react with these unsupervised sentiment scorers with more volatility than others – presumably based on word choice.