Giter VIP home page Giter VIP logo

nlp_equitymarkets_10k's Introduction

Alpha Generation and NLP Analysis in the Equity Markets

Applying NLP framework to study intrinsic risk associated with 10K filings risk disclosures from S&P500 companies.

Overview

  • Scraping 10-K filings for equity return alpha and applying an NLP framework to study intrinsic risk associated with volatility in corporate risk disclosures,
  • In each 10K annual filing, companies include information about the most significant risks that apply to the company or to its securities.
  • Firms are mandated by the SEC to report information such as “Risk Factors” in their 10-K and detailed insection Item 1A.
  • Risks highlighted from year to year include both business and market-related nature. Items found in this section could highlight broader market risk as well as unsystematic risk.
  • The respective financial statements are scrapped for financial terms and analyzed for sentiment.
  • The sentiment is placed into a list using a bag of words approach and analyzed screened for "negative" term frequency as found in it's annual 10K filings.
  • The respective change in a firms' “riskiness” of its business and/or industry is calculated with TF-IDF (term frequency inverse document frequency) for negative language(as per financial dictionaries applied from University research settings). The cosine symmetry between each firm's corpus of 10K filing negative terms is calculated for each respective year and sorted for relative risk.
  • This analysis utilized research in textual analysis for financial applications and data provided by Loughran-McDonald lexicon, which assigns a value to words based on the financial services industry context.
  • In this analysis, I reviewed the S&P 500 companies and sought to uncover unique risk profiles for each company and its respective GICS sector.
  • The end-use is designed to provide alpha application by assessing intrinsic risk, some of which may be underappreciated by the market.
  • In further analysis, I will look to see where this can add value as a factor in pricing options, particularly in comparing implied vols and historic volatility.

Code and Resources Used

Python Version: 3.7
edgar Version: edgar 5.4.1

Environment: Google Colab, Atom IDE, Jupyter notebook

Packages: pandas, numpy, statsmodels, sklearn, matplotlib, NLTK, spaCy, beautifulsoup, edgar, yfinance

Data Cleaning

  • Scraped the wiki with BeautifulSoup for the S&P companies, tickers, CIK identifier and GICS Sector information - List of S&P 500 companies
  • Read the wiki html table into a pandas dataframe
  • Reformat CIK identifier to a readable version for the sec-edgar scrapping module
  • Create a dictionary for the company name and respective CIK code in order to correctly iterate through the edgar module parameters.
  • Parsed the text scrapped from edgar library and used regex to find sections of the financial statements pertaining to the Risk Factors(1A)
  • Returned just the spanned text from section 1A and where files were not formatted correctly or the document version was an ammendment and that section didn't exist, the company was pulled from the dataset.

EDA

Below are a few highlights:

alt text alt text

Model Building

  • Grouped the dataset by the datapoints per year for each stock
  • Tokenized and stemmed each token from the document to find the YoY change with spaCy
  • Used TfidfVectorizer and applied a financial dictionary derived using the Loughran-McDonald dictionary.
  • Applied cosine similarity on the TDIF vectorized dataset for the yearly comparison for each respective company's parsed 10K filing.

Model Performance

The cumulative return of the long non-negative changing sentiment firms and short the increased negative sentiment changing firms was nearly 10% over the 5-year investment horizon.

alt text

nlp_equitymarkets_10k's People

Contributors

gregjasonroberts avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.