Giter VIP home page Giter VIP logo

vabusinesses.org's Introduction

Virginia Businesses

Website for Virginia State Corporation Commission data.

Build Status Dependency Vulnerability Analysis

Running locally

./docker-run.sh to start, ./docker-stop.sh to stop.

Running tests

E2E and functional tests are in /deploy/tests/, and can all be run with /deploy/tests/run-all.sh. From outside of the Docker container, they should be invoked with /run-tests.sh.

vabusinesses.org's People

Contributors

dependabot[bot] avatar snyk-bot avatar ttavenner avatar waldoj avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

djeraseit

vabusinesses.org's Issues

Create an API

At a minimum, it should be able to:

  • return JSON for a given corporation ID
  • search against business names

Rotate between Elasticsearch indices

Right now, we have downtime baked into the indexing process. That's because we clear out the index, and then repopulate it. Seems like it would make more sense to populate a parallel index and, when that's finished, drop the live one and rename the parallel one (e.g, vabusinesses and vabusinesses_wip: drop vabusinesses and rename vabusinesses_wip to vabusinesses).

Add functionality to search by municipality

People should be able to narrow their search by municipality.

  • get a list of all municipalities (just cities and counties right now) in Virginia
  • obtain GeoJSON for every one of those municipalities
  • pre-index all of those shapes
  • document somewhere how to re-index those shapes
  • add an Elasticsearch interface to search using that GeoJSON
  • add all Virginia towns, too

Set up a parallel industry ID

The SCC's industry ID is nearly useless. Create a process to provide a more granular identifier: a lookup table, plugins for different data sources, etc.

Eliminate "99999999999" shares from 2_corporate

As a placeholder (for what, I don't know), 2_corporate lists 99999999999 shares for businesses. Elasticsearch complains about this, and fails to import the record. I don't know why Elasticsearch complaints, but it's just as well—they don't really have 100 billion shares.

When the value of this field equals 99999999999, just replace it with no value.

Rename the Elasticsearch index

Right now it's named business, singular, while we use businesses, plural, everywhere else. Rename the index to prevent problems arising from this.

Document the API

Set up a page on vabusinesses.org that explains the API, lists its methods, provides sample queries, and demonstrates its utility.

Get Elasticsearch to handle high numbers of shares

There are many 2_corporate businesses with very high numbers of shares. Frankly, I don't believe the claimed figures—I think it's an error on the SCC's part. We're seeing very specific numbers, like 98,900,777,000, 98,900,900,000, and 6,250,000,000. Elasticsearch's error is this:

{"create":{"_index":"business","_type":"2","_id":"nehGiQvZTPisjZzQzPOHpA","status":400,"error":"MapperParsingException[failed to parse [total_shares]]; nested: NumberFormatException[For input string: \"06250000000\"]; "}}

The nut of this is:

failed to parse [total_shares]]; nested: NumberFormatException

Figure out why Elasticsearch doesn't like this number and fix it.

( Related: #38.)

Registered names has single-character state

8_registered_names.json and 8_registered_names.csv only includes the first character of the state name. But I can't see why—the table map for that column looks like this:

- name:        res-state
  alt_name:    state
  description: State of Requestor
  group:       address
  type:        A
  start:       332
  length:      2
  search: 
    match:     exact

Figure out why the second character is getting lopped off and fix it.

Errors when adding some Elasticsearch maps

2:

{"error":"MapperParsingException[No handler for type [int] declared on field [total_shares]]","status":400}

3:

{"error":"MapperParsingException[No type specified for property [coordinates]]","status":400}

4:

{"error":"MapperParsingException[No handler for type [int] declared on field [shares_auth]]","status":400}

9:

{"error":"MapperParsingException[No type specified for property [coordinates]]","status":400}

Some Roanoke businesses show up in the city and the county

"Windy Lane Associates Limited Partnership" (L014822) is showing up in both Roanoke County and Roanoke city's data. (That isn't true for all Roanoke city businesses, so we can rule out a failure to carve out the city from the county boundary data.) Its address is two miles from the city/county boundary, so it seems unlikely that this is just some minor error in the boundary alignment for the two.

Figure out what's going on here and fix it.

Provide a name for downloaded CSV files

Right now it's download.csv which is obviously not good.

Challenge: suggest a name to the browser without forcing a download. That is, if the user agent can native display CSV files, there's no reason to force it to download the file to the desktop.

Elasticsearch's PHP client returns NULL

Actually running a search returns absolutely nothing—not true, not false, but NULL. But Elasticsearch's logging reports that the search was run and the correct results were returned by Elasticsearch.

The bit in question is this:

$results = $client->search($params);
var_dump($results);

That displays NULL every time. Same if I run var_dump($client->search($params)) (so we know it's not some variable weirdness). var_dump($client) looks just fine, too—nothing appears to be wrong there. I've restarted Elasticsearch, renamed $client, pared down $params until it included only the index name, and yelled. None of these things have made any difference. Midway through the debugging process, I upgraded from v1.0.2 (or something like that) to v1.2.0, the current release.

At this point, I'm 90% sure that I'm not doing something stupid, but that there's a bug in elasticsearch-php. I'm going to step away from this for a while, return to it with a fresh perspective, and if it persists, I'll open an issue on the elasticsearch-php repository. The closest issue that I can find is multisearch returning NULL, but I'm dubious that's related.

Downloads are limited to ~9,000 records

Because we're holding the records in memory, there's a finite number of records that we can output, and that number is rather small. Figure out how to reduce this. Ideally, we'd be streaming JSON straight out of Elasticsearch and to the browser, rather than passing it through as a PHP array and back to JSON again.

Log update and indexing output

Right now, we're appending the usual > /dev/null 2>&1 to the invocations of update.sh and index.sh in the crontab, but that makes it impossible to debug problems. (For instance, #37 would be much improved by having Elasticsearch's output.) It seems like we should preserve the output to a log file.

Make sure not to append indefinitely. Elasticsearch generates hundreds of MB of output, with no way to dial back the verbosity, so that could get out of control. Instead, wipe the log each time, and write to the file anew.

Bedford is missing

Bedford reverted from a city to a town, and it no longer seems to have a GNIS ID. As a result, we haven't indexed its geodata, for lack of an identifier. Figure out what to do about this.

Let people just search businesses

Let people search not just data types (LLC, Inc., etc.), but give them human-readable terms (e.g., "businesses", "registered agents"), which extends the search to the correct (one or more) file types.

Parse YAML

Create a YAML parser that will turn it into a PHP array and a JS object.

Try to fix stupid dates

There are a lot of expiry dates that are invalid dates, which causes Elasticsearch to reject the entire record. In the field expiration_date we're seeing dates like 2025-00-00, 9999-00-99, 2054-12-34, and 2057-09-31, which are all really stupid in their own ways.

At a minimum, we need to perform a sanity check on dates and, if they're invalid replace them with null values. Better, try to adjust them to something rational. For instance, if the year is more than X years in the future, replace the date with a null value (thus eliminate dates in the year 9999). Or if the year has a valid month but an invalid day, then replace the day with some default value (e.g., 01). Ditto for the month.

I hope there's some kind of a Python library that solves this.

Provide CSV downloads

We can't provide CSV downloads now, because each of the three types of business JSON records are different. So we wind up with nonsense results, because the columns differ. The solution is probably to winnow down the columns that are included, providing only those that all three types of records share.

Create a browse interface

Once records are being indexed by Elasticsearch (#3), create an interface to allow people to browse through the records.

Create index types for all data types

At the moment, the business index contains no types. I don't understand why, because the bulk script specifies the type, which should create it if it doesn't exist. Anyhow, create an Elasticsearch type for each data type, so that it's possible to search for just one type of data.

Also, I'm dubious that everything is being indexed within search, given that the the index looks like this:

{
   "business":{
      "lp":{
         "properties":{
            "corp-asmt-ind":{
               "type":"string"
            },
            "corp-city":{
               "type":"string"
            },
            "corp-id":{
               "type":"string"
            },
            "corp-inc-date":{
               "type":"string"
            },
            "corp-ind-code":{
               "type":"string"
            },
            "corp-merger-ind":{
               "type":"string"
            },
            "corp-name":{
               "type":"string"
            },
            "corp-per-dur":{
               "type":"string"
            },
            "corp-po-eff-date":{
               "type":"string"
            },
            "corp-ra-city":{
               "type":"string"
            },
            "corp-ra-eff-date":{
               "type":"string"
            },
            "corp-ra-loc":{
               "type":"string"
            },
            "corp-ra-name":{
               "type":"string"
            },
            "corp-ra-state":{
               "type":"string"
            },
            "corp-ra-status":{
               "type":"string"
            },
            "corp-ra-street1":{
               "type":"string"
            },
            "corp-ra-street2":{
               "type":"string"
            },
            "corp-ra-zip":{
               "type":"string"
            },
            "corp-state":{
               "type":"string"
            },
            "corp-state-inc":{
               "type":"string"
            },
            "corp-status":{
               "type":"string"
            },
            "corp-status-date":{
               "type":"string"
            },
            "corp-stock-class":{
               "type":"string"
            },
            "corp-stock-ind":{
               "type":"string"
            },
            "corp-stock-share-auth":{
               "type":"string"
            },
            "corp-street1":{
               "type":"string"
            },
            "corp-street2":{
               "type":"string"
            },
            "corp-total-shares":{
               "type":"string"
            },
            "corp-zip":{
               "type":"string"
            }
         }
      }
   }
}

I think only the very last type of data indexed is actually being retained—everything else is getting blown away.

Greg's Garage is missing

There's a business—"Greg's Garage, LLC" (S258665)—that simply isn't listed in Elasticsearch. It's present in cisbemon.txt, and it's present in 9_llc.csv, but Elasticsearch has no record of it. This could be indicative of a larger problem, and so it's particularly important to figure out what's gone wrong here.

My guess is that it's an indexing problem, but Elastisearch's import process is so verbose that it's not like I could spot an error if it threw one.

Display file sizes

Some of these are enormous—people should know what they're getting into.

Deal with invalid shares_auth values

In 4_amendments, we're seeing shares_auth values that are not numbers, as the field requires. We're seeing values like PREFER, PREF2 P, PVPREF, and CONVP. I have no idea of what these means. Figure out whether we need to replace these values with null values or whether we need to preserve these records, and eliminate the numeric constraint on this field.

Cache variables in memcached

  • last updated date
  • $tables (the YAML itself)
  • $sort_order (extracted from YAML)
  • $valid_fields (extracted from YAML)
  • the list of file types, numbers, and their associated files

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.