Giter VIP home page Giter VIP logo

michalgm / corpwatchapi Goto Github PK

View Code? Open in Web Editor NEW
48.0 5.0 11.0 2.09 MB

The CorpWatch API uses automated parsers to extract the subsidiary relationship information from Exhibit 21 of companies' 10-K filings with the SEC and provides a free, well-structured interface for programs to query and process the data.

Home Page: http://api.corpwatch.org/

License: GNU General Public License v3.0

Perl 99.69% Shell 0.31%

corpwatchapi's Introduction

Corpwatch API Backend

This is the backend data processing code that fetches, parses, processes, and stores SEC 10-k filings for use by the CorpWatch API: http://api.corpwatch.org/

Please note that this project is currently unfunded, and that the developers have not actively worked on the code since 2010

Setup

These instructions have not fully been tested. If you find something is missing, please file an issue.

Install Dependencies

Choose one of the following.

If you have superuser access...

# Install dependencies - here's a (possibly incomplete) list of packages required by the project to run on `Ubuntu 18.04`.
apt-get update
apt-get install mysql-server libwww-mechanize-perl libdbi-perl libcompress-raw-zlib-perl libdatetime-format-builder-perl libdatetime-format-iso8601-perl libdatetime-format-strptime-perl libhtml-element-extended-perl libparallel-forkmanager-perl libhtml-tableextract-perl libhtml-treebuilder-xpath-perl liblwp-protocol-https-perl liblwp-mediatypes-perl libtext-unidecode-perl libtime-modules-perl libxml-simple-perl libdbd-mysql-perl

# Create a mysql database schema (see below on how to override default db settings):
sudo mysql -u root -e "create database edgarapi_live; grant all on edgarapi_live.* to 'edgar'@'localhost' identified by 'edgar'"

If you prefer to install dependencies in your home directory...

  1. Install Miniconda and then execute:
    conda create -n cwapi              # Create a conda environment
    conda activate cwapi               # Activate into this environment
    conda install -c anaconda mysql -y # Install MySQL
  2. Set up MySQL server.
    1. Add environment variables. You can customize the paths as you want:
      export MYSQL_HOME=~/ex21/data/mysql
      export MYSQL_UNIX_PORT=$MYSQL_HOME/socket
      export MYSQL_TCP_PORT=3306
    2. Initialize MySQL server:
      mysqld --user=$(whoami) --datadir=$MYSQL_HOME --basedir=$MYSQL_HOME --log-error=$MYSQL_HOME/mysql.err --pid-file=$MYSQL_HOME/mysql.pid --socket=$MYSQL_HOME/socket --port=3306 --initialize 
      The last line will contain a randomly-generated password for the root user in MySQL:
      2019-09-12T06:04:42.207964Z 1 [Note] A temporary password is generated for root@localhost: fob!Ao6cpgGD
      
      If you are experiencing error /usr/lib64/libstdc++.so.6: version GLIBCXX_3.4.21' not found, you need to first set one more environment variable. First, find a C++ standard library file in your conda installation directory. In my case of using miniconda3at~/miniconda3`, I execute:
      find ~/miniconda3/ -name "libstdc++.so*"
      which gives me a list of file paths, many of which are residing in ~/miniconda3/pkgs/libstdcxx-ng-8.2.0-hdf63c60_1/lib. Thus, I set the environment variable as:
      export LD_LIBRARY_PATH=~/miniconda3/pkgs/libstdcxx-ng-8.2.0-hdf63c60_1/lib:$LD_LIBRARY_PATH
      and then repeat the mysqld command.
    3. Spin up MySQL server:
      mysqld
      Now open another terminal (tmux pane).
    4. MySQL requires root password to be changed upon first login. Do this by:
      mysqladmin --socket=$MYSQL_HOME/socket --user root --password fob!Ao6cpgGD password
    and follow the prompts. 5. Connect to MySQL server as root:
    mysql --socket=$MYSQL_HOME/socket --verbose -u root -p
    and then enter the password you just set to. 6. Under the mysql> prompt, execute the following commands:
    CREATE DATABASE edgarapi_live; --  Create a database for CorpWatch API to use
    CREATE USER 'edgar'@'localhost' IDENTIFIED BY 'edgar'; -- Add a user to the MySQL server
    GRANT ALL PRIVILEGES ON edgarapi_live.* TO 'edgar'@'localhost';
    FLUSH PRIVILEGES;
    exit;
  3. Set up Perl.
    1. Install Perlbrew via \curl -L https://install.perlbrew.pl | bash.
    2. Follow instructions to add the line to your ~/.bashrc, then restart the shell.
    3. Prepare Perl environment:
      perlbrew install    perl-5.16.0
      perlbrew switch     perl-5.16.0
      perlbrew lib create perl-5.16.0@cwapi # Like creating a virtual env, but for Perl.
      perlbrew use        perl-5.16.0@cwapi
      perlbrew install-cpanm
      cpanm install HTML::TableExtract HTML::Parser HTML::TreeBuilder HTML::ElementTable DBI DBD::mysql JSON Time::ParseDate Term::ProgressBar Parallel::ForkManager Text::Unidecode IO::Socket::SSL

Usage

You can parse EX-21s from scratch, or you can pre-populate your MySQL database with the CorpWatch DB dumps.

Pre-populate with CorpWatch DB dumps

Simply,

wget http://api.corpwatch.org/documentation/db_dump/corpwatch_api_tables_mysql.tar.gz
tar -xvzf corpwatch_api_tables_mysql.tar.gz # Unzip the archive file
pv corpwatch_api_tables_mysql/*.sql | mysql --socket=$MYSQL_HOME/socket -u edgar -p edgarapi_live

and enter password for MySQL user edgar (in this example of mine, it's edgar) when prompted.

Alternatively, you can...

Parse Filings from Scratch

# Populate the schema with the table definitions:
mysql -u edgar -p edgarapi_live < mysql_database_structure.sql

# Import the static data tables:
mysql -u edgar -p edgarapi_live < data_tables.sql

# Run the update script and be very patient
./update_data.sh

By default, the code will fetch all filings from Q1 2003 to the present. This can be modified by adjusting @years_available in common.pl

Environment Variables

The following environment variables can be used to configure the db connection, or just set your schema/user to use the defaults:

Environment Variable Default Value
EDGAR_DB_HOST localhost
EDGAR_DB_NAME edgarapi_live
EDGAR_DB_USER edgar
EDGAR_DB_PASSWORD edgar

Data processing overview

  • cleanup_state.pl - Clean up database state, remove any orphaned data from incomplete runs, etc.
  • fetch_10ks.pl - download SEC filings
  • fetch_filer_headers.pl - fetches html header files for filings, to be parsed by parse_headers.pl
  • parse_headers.pl - extracts company meta data from the headers of 10-K filings
  • update_cik_name_lookup.pl - downloads a list of former and alternative names for companies and stores them in the table cik_name_lookup
  • relationship_wrapper.pl - manages the execution of multiple copies of the section21Header processing script in order to get around a memory leak in a perl library, and at the same time take advantage of multiple processors on the host machine. The script it executes, sec21_headers.pl, is the core of the subsidiary parser - it processes the Exhibit 21 filings to try to pull out subsidiary names, locations, and hierarchy using a bunch of crazy regexs and stopwords
  • clean_relationships.pl - cleans the subsidiary relationships data that has been parsed from the 10-K Section 21 filings. It also cleans the names in the filers and cik_name_lookup table. The names of companies in each of the tables are normalized so that they can be matched, and the location codes are mapped to UN country codes where possible.
  • populate_companies.pl - repopulate the companies_* tables using the information that has been parsed from the filings.

Table descriptions (incomplete)

Primary Tables

Table Name Description
companies meta information about company entities (defines cw_id)
company_locations address or location tags for companies
company_names company name variants
company_relations parent-child relationships between companies
filings info about filing records

Intermediate tables used in processing

Table Name Description
filers companies that appeared as filers on 10-K forms
filing_tables information about the parsing of the filings
relationships raw relationships as parsed from Section 21 filings
croc_companies lists of companies from Crocodyl, matched to cik ids
cik_name_lookup master list names and CIK ids from EDGAR
not_company_names strings that appear in parsed data that are definitely NOT companies

Static data tables

Table Name Description
sic_codes definitions of ~500 SIC industry codes
sic_sectors definition of middle level SIC industry sectors
stock_codes ticker symbol and name for 3354 companies
un_countries official list of UN country names and codes
un_country_aliases alternate country and location names
un_country_subdivisions list of UN states, provinces, etc
region_codes translation table for SEC to UN country and region codes
unlocode other locations (metropolitan areas, etc)
word_freq table of frequencies of words appearing in company names to help fuzzy matching

Known issues

  • There is a query in parse_headers.pl (second query after 'Hiding bad ciks' that takes a very long time to run, even with minimal data. I'm guessing that we need to populate the tables with some additional data maybe to minimize the number of joins, as it seems to not long on a populated database
  • sec21_headers.pl outputs a lot of errors relating to the html table extraction library. It seems to still be parsing the tables, though I have not tested thoroughly.

corpwatchapi's People

Contributors

michalgm avatar skyebend avatar tslmy avatar

Stargazers

 avatar  avatar  avatar Jay Horton avatar alexander erofeev avatar Jarl Robert Pedersen Heer avatar Zane Selvans avatar Alex Engel avatar Yoav Tzfati avatar  avatar David Andreoletti avatar Derek Snow avatar  avatar Michael Tiemann avatar Tl Yim avatar Paul Stey avatar Gregory Foster avatar Stephen Abbott Pugh avatar Maurice Dalton avatar Hannah Jade avatar Shyam Peri avatar Foundation avatar Kevin Dorfer avatar Ricardo avatar  avatar juaquins avatar WcW avatar  avatar Ben Creasy avatar Elizabeth S avatar Tharsis Souza avatar Sergio De Raco avatar  avatar  avatar Yun avatar Arthur Pham avatar  avatar Jesse avatar Angel Peralta avatar ѕυвƒιѕѕιση avatar Quinton Pike avatar Pramendra Singh Dadhwal avatar Thao Vo avatar Tamir Al avatar Clarence Chio avatar  avatar Michał Bojanowski avatar  avatar

Watchers

James Cloos avatar  avatar  avatar  avatar VASP Holdings avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.