-
Run the
batch-geocoding
terraform in the gcp-terraform monorepo -
Make sure the boot image is current.
boot_disk { initialize_params { image = "arcgis-server-geocoding" } }
-
Update
deployment.yml
from the terraform repo with the private/internal ip address of the compute vm created above. If it is different than what is in yourdeployment.yml
, runkubectl apply -f deployment.yml
again to correct the kubernetes cluster.
A Windows virtual machine runs ArcGIS Server with the geocoding services to support the geocoding jobs.
A docker container containing a python script to execute the geocoding with data uploaded to google cloud.
-
Use the CLI to split the address data into chunks. By default they will be created in the
data\partitioned
folder.python -m cli create partitions --input-csv=../data/2022.csv --separator=\| --column-names=category --column-names=partial-id --column-names=address --column-names=zone
The CSV will contain 4 fields without a header row. They will be pipe delimited without quoting. They will be in the order
system-area
,system-id
,address
,zip-code
. This CLI command will mergesystem-area
andsystem-id
into anid
field and renamezip-code
tozone
. -
Use the CLI to upload the files to the cloud so they are accessible to the kubernetes cluster containers
python -m cli upload
-
Use the CLI to create
yml
job files to apply to the kubernetes cluster nodes to start the jobs. By default the job specifications will be created in thejobs
folder.python -m cli create jobs
To start the job, you must apply the jobs/job_*.yml
to the cluster. Run this command for each job.yml
file that you created.
kubectl apply -f job.yaml
cloud logging log viewer
-
python geocoding process
resource.type="k8s_container" resource.labels.project_id="ut-dts-agrc-geocoding-dev" resource.labels.location="us-central1-a" resource.labels.cluster_name="cloud-geocoding" resource.labels.namespace_name="default" resource.labels.pod_name:"geocoder-job-" resource.labels.container_name="geocoder-client"
-
web api process
resource.type="k8s_container" resource.labels.project_id="ut-dts-agrc-geocoding-dev" resource.labels.location="us-central1-a" resource.labels.cluster_name="cloud-geocoding" resource.labels.namespace_name="default" resource.labels.pod_name:"webapi-api-" resource.labels.container_name="webapi-api"
kubernetes workloads
Download the csv output from cloud storage and place them in data/geocoded-results
. gsutil
can be run from the root of the project to download all the files.
gsutil -m cp "gs://ut-dts-agrc-geocoding-dev-result/*.csv" ./../data/geocoded-results
It is a good idea to make sure all the addresses that were not found were not caused by something else.
python -m cli post-mortem
This will create the following files
all_errors.csv
: all of the unmatched addresses from the geocoded resultsapi_errors.csv
: a subset ofall_errors.csv
where the message is not a normal api responseall_errors_job.csv
: all of the unmatched addresses from the geocoded results but in a format that can be processed by the cluster.incomplete_errors.csv
: typically errors that have null parts. This should be inspected because other errors can get mixed in herenot_found.csv
: all the addresses that 404'd as not found by the api.post-mortem normalize
will run these addresses through sweeper.
It is recommended to run all_errors_job.csv
and post-mortem
those result to get a more accurate geocoding job picture. Make sure to update the job to allow for --ignore-failures
or it will most likely fast fail.
-
Create the job for the
postmortem
and upload the data to geocode the error results again.python -m cli create jobs --input-jobs=./../data/postmortem --single=all_errors_job.csv
-
Upload the data for the job
python -m cli upload --single=./../data/postmortem/all_errors_job.csv
-
Apply the job in the kubernetes cluster
kubectl apply -f ./../jobs/job_all_errors_job.yml
-
When that job has completed you can download the results with
gsutil
gsutil cp -n "gs://ut-dts-agrc-geocoding-dev-result/*-all_errors_job.csv" ./../data/geocoded-results
-
Finally, rebase the results back into the original data with the cli
python -m cli post-mortem rebase --single="*-all_errors_job.csv"
Now, the original data is updated with this new runs results to fix any hiccups with the original geocode attempt.
The second post mortem round is to see if we can correct the addresses of the records that do not match using the sweeper project.
-
We need to remove the results of the first round so they do not get processed. Delete the
*-all_errors_job.csv
from thedata/geocoded-results
folder. -
Post mortem the results to get the current state.
python -m cli post-mortem
-
Try to fix the unmatched addresses with sweeper.
python -m cli post-mortem normalize
-
Create a job for the normalized addresses
python -m cli create jobs --input-jobs=./../data/postmortem --single=normalized.csv
-
Upload the data for the job
python -m cli upload --input-folder=./../data/postmortem --single=normalized.csv
-
Apply the job in the kubernetes cluster
kubectl apply -f ./../jobs/job_normalized.yml
-
When that job has completed you can download the results with
gsutil
gsutil cp "gs://ut-dts-agrc-geocoding-dev-result/*-normalized.csv" ./../data/geocoded-results
-
Rebase the results back into the original data with the cli
python -m cli post-mortem rebase --single="*-normalized.csv" --message="sweeper modified input address from original"
-
Finally, remove the normalized csv and run post mortem one last time to get synchronize reality
python -m cli post-mortem
The geocode results will be enhanced from spatial data. The cli is used to create the gdb for this processing. The layers are defined in enhance.py
and are copied from the OpenSGID.
-
The geocoded results will need to be renamed to be compatible with a file geodatabase.
python -m cli rename
-
Create the enhancement geodatabase
python -m cli create enhancement-gdb
-
Enhance the csv's in the
data\geocoded-results
folder. Depending on the number of enhancement layers, you will end up with apartition_number_step_number.csv
.python -m cli enhance
-
Merge all the data back together into one
data\results\all.csv
python -m cli merge
- RDP into the machine and install the most current locators. (google drive link or gcp bucket)
- Create a Cloud NAT and Router to get internet access to the machine
- Might as well install windows updates
- Update the locators (There is a geolocators shortcut on the desktop to where they are)
- I typically compare the dates and grab the
.loc
and.lox
from the web api machine and copy them over
- I typically compare the dates and grab the
- Save a snapshot
arcgis-server-geocoding-M-YYYY
- Source disk is
cloud-geocoding-v1
- Region
is us-central1
- Delete the other snapshots besides the original
- Source disk is
- Create an image from the snapshot
arcgis-server-geocoding-M-YYY
The geocoding job docker image installs the geocode.py
file into the container and installs the python dependencies for the script. When the job.yml
files are applied to the cluster, the geocode.py
script is executed which starts the geocoding job.
Any time the geocode.py
file is modified or you want to update the python dependencies, the docker image needs to be rebuilt and pushed to gcr. With src/docker-geocode-job
as your current working directory...
docker build . --tag webapi/geocode-job &&
docker tag webapi/geocode-job:latest gcr.io/ut-dts-agrc-geocoding-dev/api.mapserv.utah.gov/geocode-job:latest &&
docker push gcr.io/ut-dts-agrc-geocoding-dev/api.mapserv.utah.gov/geocode-job:latest
To locally test the geocode.py try a command like
python geocode.py geocode partition_7.csv --from-bucket=../../../data/partitioned --output-bucket=./ --testing=true