Giter VIP home page Giter VIP logo

gamechanger-crawlers's Introduction

Mission Vision Icons

Introduction

Over 15 thousand documents govern how the Department of Defense (DoD) operates. The documents exist in different repositories, often exist on different networks, are discoverable to different communities, are updated independently, and evolve rapidly. No single ability has ever existed that would enable navigation of the vast universe of governing requirements and guidance documents, leaving the Department unable to make evidence-based, data-driven decisions. Today GAMECHANGER offers a scalable solution with an authoritative corpus comprising a single trusted repository of all statutory and policy driven requirements based on Artificial-Intelligence (AI) enabled technologies.

Mission Vision Icons

Vision

Fundamentally changing the way in which the DoD navigates its universe of requirements and makes decisions

Mission

GAMECHANGER aspires to be the Department’s trusted solution for evidence-based, data-driven decision-making across the universe of DoD requirements by:

  • Building the DoD’s authoritative corpus of requirements and policy to drive search, discovery, understanding, and analytic capabilities
  • Operationalizing cutting-edge technologies, algorithms, models and interfaces to automate and scale the solution
  • Fusing best practices from industry, academia, and government to advance innovation and research
  • Engaging the open-source community to build generalizable and replicable technology

License & Contributions

See LICENSE.md (including licensing intent - INTENT.md) and CONTRIBUTING.md

How to Setup Local Env for Development

The following should be done in a MacOS or Linux environment (including WSL on Windows)

  1. Install Google Chrome and ChromeDriver
  2. Install Miniconda or Anaconda (Miniconda is much smaller)
  3. Create a gamechanger crawlers python3.6 environment:
    conda create -n gc-crawlers python=3.6
  4. Clone the repo and change into that dir:
    git clone https://github.com/dod-advana/gamechanger-crawlers.git
    cd gamechanger-crawlers
  5. Activate the conda environment and install requirements:
    conda activate gc-crawlers
    pip install --upgrade pip setuptools wheel
    pip install -r ./docker/minimal-requirements.txt
  6. That's it.

Quickstart Guide: Running a Crawler

  1. Follow the environment setup guide above if you have not already
  2. Change to the gamechanger crawlers directory and export the repository path to the PYTHONPATH environment variable:
    cd /path/to/gamechanger-crawlers
    export PYTHONPATH="$(pwd)"
  3. Create an empty directory for the crawler file outputs:
    CRAWLER_DATA_ROOT=/path/to/download/location
    mkdir -p "$CRAWLER_DATA_ROOT"
  4. Create an empty previous manifest file:
    touch "$CRAWLER_DATA_ROOT/prev-manifest.json"
  5. Run the desired crawler spider from the gamechanger-crawlers directory (in this example we will use the executive_orders_spider.py):
    scrapy runspider dataPipelines/gc_scrapy/gc_scrapy/spiders/executive_orders_spider.py \
      -a download_output_dir="$CRAWLER_DATA_ROOT" \
      -a previous_manifest_location="$CRAWLER_DATA_ROOT/prev-manifest.json" \
      -o "$CRAWLER_DATA_ROOT/output.json"
  6. After the crawler finishes running, you should have all files downloaded into the crawler output directory

gamechanger-crawlers's People

Contributors

ademouy avatar amaruca141 avatar antsega avatar ashermuse avatar brandonherzog avatar dakotahavel avatar domcritchlow avatar kroop-olivia-bah avatar matthew-kersting avatar melkiga avatar mstyslinger avatar takao8 avatar vat99 avatar vctrstrm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gamechanger-crawlers's Issues

Exception when running dod issuances crawler

TypeError missing args. What to use for these args, or maybe I miss something?

Will parsed data be made publicly available? Save effort for contributors?

/Users/ilya/.virtualenvs/torch/lib/python3.8/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.esd.whs.mil'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  warnings.warn(
Traceback (most recent call last):
  File "/Users/ilya/pycharm_projects/rabin/gamechanger-crawlers/dataPipelines/gc_crawler/dod_issuances/cli.py", line 51, in <module>
    run()
  File "/Users/ilya/.virtualenvs/torch/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/ilya/.virtualenvs/torch/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/ilya/.virtualenvs/torch/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/ilya/.virtualenvs/torch/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/ilya/pycharm_projects/rabin/gamechanger-crawlers/dataPipelines/gc_crawler/dod_issuances/cli.py", line 46, in run
    for json_doc in results:
  File "/Users/ilya/pycharm_projects/rabin/gamechanger-crawlers/dataPipelines/gc_crawler/exec_model.py", line 89, in iter_validated_output_json
    for json_doc in self.iter_output_json():
  File "/Users/ilya/pycharm_projects/rabin/gamechanger-crawlers/dataPipelines/gc_crawler/exec_model.py", line 84, in iter_output_json
    for doc in self.iter_output_docs():
  File "/Users/ilya/pycharm_projects/rabin/gamechanger-crawlers/dataPipelines/gc_crawler/exec_model.py", line 79, in iter_output_docs
    for doc in self._parser.parse_docs_from_page(page_link, page_text):
  File "/Users/ilya/pycharm_projects/rabin/gamechanger-crawlers/dataPipelines/gc_crawler/dod_issuances/models.py", line 145, in parse_docs_from_page
    doc = Document(
TypeError: __init__() missing 3 required positional arguments: 'display_doc_type', 'display_org', and 'display_source'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.