openva / vabusinesses.org Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 1.0 982 KB

Website for Virginia State Corporation Commission data.

Home Page: https://vabusinesses.org/

License: MIT License

Shell 34.84% PHP 59.72% Dockerfile 1.24% Smarty 4.20%

business corporation virginia

vabusinesses.org's Introduction

Virginia Businesses

Website for Virginia State Corporation Commission data.

Running locally

./docker-run.sh to start, ./docker-stop.sh to stop.

Running tests

E2E and functional tests are in /deploy/tests/, and can all be run with /deploy/tests/run-all.sh. From outside of the Docker container, they should be invoked with /run-tests.sh.

vabusinesses.org's People

Contributors

Stargazers

Watchers

Forkers

djeraseit

vabusinesses.org's Issues

Create an API

At a minimum, it should be able to:

return JSON for a given corporation ID
search against business names

Set up a continuous deployment system

End the practice of manual deployments.

Come up with a proper domain name

Rotate between Elasticsearch indices

Right now, we have downtime baked into the indexing process. That's because we clear out the index, and then repopulate it. Seems like it would make more sense to populate a parallel index and, when that's finished, drop the live one and rename the parallel one (e.g, vabusinesses and vabusinesses_wip: drop vabusinesses and rename vabusinesses_wip to vabusinesses).

Create a single-record display

Add functionality to search by municipality

People should be able to narrow their search by municipality.

get a list of all municipalities (just cities and counties right now) in Virginia
obtain GeoJSON for every one of those municipalities
pre-index all of those shapes
document somewhere how to re-index those shapes
add an Elasticsearch interface to search using that GeoJSON
add all Virginia towns, too

Set up a parallel industry ID

The SCC's industry ID is nearly useless. Create a process to provide a more granular identifier: a lookup table, plugins for different data sources, etc.

Specify exact match on Elasticsearch's corporate ID fields

At the moment, partial or even close-enough matches return results, and that's never desirable. Modify the index to require exact matches.

Allow searches to be restricted to active businesses

Identify and flag real estate holding companies

I'm not sure if this is even possible, but it's well worth trying.

Allow searching specific fields

ID all restaurants

Using Code 4 HR's restaurant inspection data.

Eliminate "99999999999" shares from 2_corporate

As a placeholder (for what, I don't know), 2_corporate lists 99999999999 shares for businesses. Elasticsearch complains about this, and fails to import the record. I don't know why Elasticsearch complaints, but it's just as well—they don't really have 100 billion shares.

When the value of this field equals 99999999999, just replace it with no value.

Rename search URL

/search.php is obviously no good

Rename the Elasticsearch index

Right now it's named business, singular, while we use businesses, plural, everywhere else. Rename the index to prevent problems arising from this.

Store businesses' locality

Rather than calculating the locality each time, store it in Elasticsearch.

Document the API

Set up a page on vabusinesses.org that explains the API, lists its methods, provides sample queries, and demonstrates its utility.

Create a search interface

Once records are being indexed by Elasticsearch (#3), create a search interface.

Get Elasticsearch to handle high numbers of shares

There are many 2_corporate businesses with very high numbers of shares. Frankly, I don't believe the claimed figures—I think it's an error on the SCC's part. We're seeing very specific numbers, like 98,900,777,000, 98,900,900,000, and 6,250,000,000. Elasticsearch's error is this:

{"create":{"_index":"business","_type":"2","_id":"nehGiQvZTPisjZzQzPOHpA","status":400,"error":"MapperParsingException[failed to parse [total_shares]]; nested: NumberFormatException[For input string: \"06250000000\"]; "}}

The nut of this is:

failed to parse [total_shares]]; nested: NumberFormatException

Figure out why Elasticsearch doesn't like this number and fix it.

( Related: #38.)

Registered names has single-character state

8_registered_names.json and 8_registered_names.csv only includes the first character of the state name. But I can't see why—the table map for that column looks like this:

- name:        res-state
  alt_name:    state
  description: State of Requestor
  group:       address
  type:        A
  start:       332
  length:      2
  search: 
    match:     exact

Figure out why the second character is getting lopped off and fix it.

Errors when adding some Elasticsearch maps

{"error":"MapperParsingException[No handler for type [int] declared on field [total_shares]]","status":400}

{"error":"MapperParsingException[No type specified for property [coordinates]]","status":400}

{"error":"MapperParsingException[No handler for type [int] declared on field [shares_auth]]","status":400}

{"error":"MapperParsingException[No type specified for property [coordinates]]","status":400}

Some Roanoke businesses show up in the city and the county

"Windy Lane Associates Limited Partnership" (L014822) is showing up in both Roanoke County and Roanoke city's data. (That isn't true for all Roanoke city businesses, so we can rule out a failure to carve out the city from the county boundary data.) Its address is two miles from the city/county boundary, so it seems unlikely that this is just some minor error in the boundary alignment for the two.

Figure out what's going on here and fix it.

Provide a name for downloaded CSV files

Right now it's download.csv which is obviously not good.

Challenge: suggest a name to the browser without forcing a download. That is, if the user agent can native display CSV files, there's no reason to force it to download the file to the desktop.

Elasticsearch's PHP client returns NULL

Actually running a search returns absolutely nothing—not true, not false, but NULL. But Elasticsearch's logging reports that the search was run and the correct results were returned by Elasticsearch.

The bit in question is this:

$results = $client->search($params);
var_dump($results);

That displays NULL every time. Same if I run var_dump($client->search($params)) (so we know it's not some variable weirdness). var_dump($client) looks just fine, too—nothing appears to be wrong there. I've restarted Elasticsearch, renamed $client, pared down $params until it included only the index name, and yelled. None of these things have made any difference. Midway through the debugging process, I upgraded from v1.0.2 (or something like that) to v1.2.0, the current release.

At this point, I'm 90% sure that I'm not doing something stupid, but that there's a bug in elasticsearch-php. I'm going to step away from this for a while, return to it with a fresh perspective, and if it persists, I'll open an issue on the elasticsearch-php repository. The closest issue that I can find is multisearch returning NULL, but I'm dubious that's related.

Group search result fields

We have the group field in the YAML.

Downloads are limited to ~9,000 records

Because we're holding the records in memory, there's a finite number of records that we can output, and that number is rather small. Figure out how to reduce this. Ideally, we'd be streaming JSON straight out of Elasticsearch and to the browser, rather than passing it through as a PHP array and back to JSON again.

Log update and indexing output

Right now, we're appending the usual > /dev/null 2>&1 to the invocations of update.sh and index.sh in the crontab, but that makes it impossible to debug problems. (For instance, #37 would be much improved by having Elasticsearch's output.) It seems like we should preserve the output to a log file.

Make sure not to append indefinitely. Elasticsearch generates hundreds of MB of output, with no way to dial back the verbosity, so that could get out of control. Instead, wipe the log each time, and write to the file anew.

Include corporate ID in restaurant inspections

Create a data file and a process to allow Code 4 HR to include a corporate ID in every inspection report.

Bedford is missing

Bedford reverted from a city to a town, and it no longer seems to have a GNIS ID. As a result, we haven't indexed its geodata, for lack of an identifier. Figure out what to do about this.

Decide what to name each Elasticsearch index

At present, they're numbered, but that's not a good system. Of course, we have corresponding names for each file, but they're hardly slug-length.

Let people just search businesses

Let people search not just data types (LLC, Inc., etc.), but give them human-readable terms (e.g., "businesses", "registered agents"), which extends the search to the correct (one or more) file types.

Add BRI logo to home page

Just use the RS graphic.

Set up a weekly update cron job

Data is updated at just after 1:00 AM on Wednesday mornings.

Parse YAML

Create a YAML parser that will turn it into a PHP array and a JS object.

Optimize Elasticsearch queries

Right now we're just using Elasticsearch's bare search options. Set up some decent defaults to get better results.

Retrieve place names from Elasticsearch, not a flat file

We're storing place names and GNIS IDs in a JSON file right now, which complicates things unnecessarily. Retrieve this list from Elasticsearch instead, caching it in Memcached.

Try to fix stupid dates

There are a lot of expiry dates that are invalid dates, which causes Elasticsearch to reject the entire record. In the field expiration_date we're seeing dates like 2025-00-00, 9999-00-99, 2054-12-34, and 2057-09-31, which are all really stupid in their own ways.

At a minimum, we need to perform a sanity check on dates and, if they're invalid replace them with null values. Better, try to adjust them to something rational. For instance, if the year is more than X years in the future, replace the date with a null value (thus eliminate dates in the year 9999). Or if the year has a valid month but an invalid day, then replace the day with some default value (e.g., 01). Ditto for the month.

I hope there's some kind of a Python library that solves this.

Provide database dump for download

Perhaps use http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html. People should be able to replicate the environment more easily.

Provide CSV downloads

We can't provide CSV downloads now, because each of the three types of business JSON records are different. So we wind up with nonsense results, because the columns differ. The solution is probably to winnow down the columns that are included, providing only those that all three types of records share.

Create a browse interface

Once records are being indexed by Elasticsearch (#3), create an interface to allow people to browse through the records.

Create index types for all data types

At the moment, the business index contains no types. I don't understand why, because the bulk script specifies the type, which should create it if it doesn't exist. Anyhow, create an Elasticsearch type for each data type, so that it's possible to search for just one type of data.

Also, I'm dubious that everything is being indexed within search, given that the the index looks like this:

{
   "business":{
      "lp":{
         "properties":{
            "corp-asmt-ind":{
               "type":"string"
            },
            "corp-city":{
               "type":"string"
            },
            "corp-id":{
               "type":"string"
            },
            "corp-inc-date":{
               "type":"string"
            },
            "corp-ind-code":{
               "type":"string"
            },
            "corp-merger-ind":{
               "type":"string"
            },
            "corp-name":{
               "type":"string"
            },
            "corp-per-dur":{
               "type":"string"
            },
            "corp-po-eff-date":{
               "type":"string"
            },
            "corp-ra-city":{
               "type":"string"
            },
            "corp-ra-eff-date":{
               "type":"string"
            },
            "corp-ra-loc":{
               "type":"string"
            },
            "corp-ra-name":{
               "type":"string"
            },
            "corp-ra-state":{
               "type":"string"
            },
            "corp-ra-status":{
               "type":"string"
            },
            "corp-ra-street1":{
               "type":"string"
            },
            "corp-ra-street2":{
               "type":"string"
            },
            "corp-ra-zip":{
               "type":"string"
            },
            "corp-state":{
               "type":"string"
            },
            "corp-state-inc":{
               "type":"string"
            },
            "corp-status":{
               "type":"string"
            },
            "corp-status-date":{
               "type":"string"
            },
            "corp-stock-class":{
               "type":"string"
            },
            "corp-stock-ind":{
               "type":"string"
            },
            "corp-stock-share-auth":{
               "type":"string"
            },
            "corp-street1":{
               "type":"string"
            },
            "corp-street2":{
               "type":"string"
            },
            "corp-total-shares":{
               "type":"string"
            },
            "corp-zip":{
               "type":"string"
            }
         }
      }
   }
}

I think only the very last type of data indexed is actually being retained—everything else is getting blown away.

Create a favicon and apple-touch-icon

Greg's Garage is missing

There's a business—"Greg's Garage, LLC" (S258665)—that simply isn't listed in Elasticsearch. It's present in cisbemon.txt, and it's present in 9_llc.csv, but Elasticsearch has no record of it. This could be indicative of a larger problem, and so it's particularly important to figure out what's gone wrong here.

My guess is that it's an indexing problem, but Elastisearch's import process is so verbose that it's not like I could spot an error if it threw one.

Display file sizes

Some of these are enormous—people should know what they're getting into.

Index JSON files with Elasticsearch

Once we have Crump atomizing all JSON data as per-record files, create a shell script to index these files with Elasticsearch.

Deal with invalid shares_auth values

In 4_amendments, we're seeing shares_auth values that are not numbers, as the field requires. We're seeing values like PREFER, PREF2 P, PVPREF, and CONVP. I have no idea of what these means. Figure out whether we need to replace these values with null values or whether we need to preserve these records, and eliminate the numeric constraint on this field.

last updated date
$tables (the YAML itself)
$sort_order (extracted from YAML)
$valid_fields (extracted from YAML)
the list of file types, numbers, and their associated files