brubsby / solarpaneldatawrangler Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 3.0 3.77 MB

License: GNU General Public License v3.0

Python 88.24% Mako 0.73% Dockerfile 1.87% Shell 9.16%

solarpaneldatawrangler's People

Contributors

Stargazers

Watchers

Forkers

clockwerx terkwood dpadeletti

solarpaneldatawrangler's Issues

Run_inference script dies every night at 3am due to Connection reset by peer

Below is the stack trace, always seems to happen right at 3am, maybe when mapbox updates their servers. I've added one automatic retry, but maybe I need to try exponential backoff and more retries. Just tracking the issue in case anybody has any guidance.

I was going to try to do a retry method similar to this, but the mapbox package hides away all the requests details to where it's not possible to specify the session etc.

  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 360, in _error_catcher
    yield
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 442, in read
    data = self._fp.read(amt)
  File "/usr/lib/python3.5/http/client.py", line 448, in read
    n = self.readinto(b)
  File "/usr/lib/python3.5/http/client.py", line 488, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.5/socket.py", line 575, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.5/ssl.py", line 929, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.5/ssl.py", line 791, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.5/ssl.py", line 575, in read
    v = self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/models.py", line 750, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 494, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 459, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 378, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_inference.py", line 54, in <module>
    image = np.array(imagery.stitch_image_at_coordinate((tile.column, tile.row)))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 190, in stitch_image_at_coordinate
    images.append(get_image_for_coordinate((column, row),))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 178, in get_image_for_coordinate
    image = gather_and_persist_imagery_at_coordinate(slippy_coordinate, final_zoom=FINAL_ZOOM)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 156, in gather_and_persist_imagery_at_coordinate
    retina=(ZOOM_FACTOR > 0))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/mapbox/services/static.py", line 94, in image
    res = self.session.get(uri)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/sessions.py", line 686, in send
    r.content
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/models.py", line 828, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/models.py", line 753, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Random OSError while running inference

Unsure what caused it, will track here. Looks like maybe a file got corrupted? Should probably regenerate imagery at this tile if this happens.

Traceback (most recent call last):
  File "run_entire_process.py", line 68, in <module>
    run_inference.run_classification(args.classification_checkpoint, args.segmentation_checkpoint, BATCHES_BETWEEN_DELETE)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/run_inference.py", line 113, in run_classification
    image = np.array(imagery.stitch_image_at_coordinate((tile.column, tile.row)))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 201, in stitch_image_at_coordinate
    images.append(get_image_for_coordinate((column, row),))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 189, in get_image_for_coordinate
    image = gather_and_persist_imagery_at_coordinate(slippy_coordinate, final_zoom=FINAL_ZOOM)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 166, in gather_and_persist_imagery_at_coordinate
    slices_per_side=grid_size)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 85, in slice_image
    out = double_image_size(out)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 100, in double_image_size
    return image.resize((image.size[0] * 2, image.size[0] * 2), filter)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/PIL/Image.py", line 1804, in resize
    self.load()
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/PIL/ImageFile.py", line 238, in load
    len(b))
OSError: image file is truncated (82 bytes not processed)

Pycharm getting slow and crashing (out of memory) when I have 4 million images saved

I have the imagery directory excluded from project but Pycharm still runs out of memory even when I'm not running inference. I can either try to fix this or move the data directory somewhere else and make it configurable.

Also I could periodically delete imagery if it has less than a threshold of confidence, but I need to save a 3x3 around the image as well if it's above the threshold, so that complicates things.

Investigate human verification pipelines

Could be as simple as sending out an excel sheet of locations, or as complicated as rolling our own verification software.

Some existing solutions:
MapRoulette
OpenGridMap

Merge DeepSolar code into SPDW

This will probably make deploying the code less fragile for contributors and simplify things.

Query OSM solar panel locations

Query OSM existing solar panel locations and persist for excluding locations from search/human verification

OSM pv filtering

Currently we save PV locations from OSM into the database but don't use them for filtering. One possible location for filtering would be in maproulette.py so that we can not create tasks for image tiles that contain PV nodes.

Correct SQLAlchemy session handling

Currently sessions are created for every method call to the solardb.py, which is probably too frequently, especially for methods that get called with higher frequency.

Create a single script to run the whole process

This issue isn't as mandatory, but it would be nice to have one script that could run all parts of the project from beginning to end. It may not be feasible but could aid in usability as lots of the first prototype had to be manually fixed.

Positive classification grouping

In the DeepSolar supplemental information pdf, there's some pseudo-code outlining an algorithm to group tiles together if they are thought to be contiguous by the model. This would be helpful for reducing the amount of human verification tasks necessary.

However, I'm not sure how best to store this information. Maybe a nullable foreign key on slippy_tiles to a panel_groups table.

Add tests

I can't currently imagine exactly what the tests will look like, but we should write some.

Add section in documentation pointing to existing human verification tasks

I realized that people who find the project solely by this repo might not follow my twitter or open climate fix, maybe it could be a good idea to include a section pointing people to other ways they can help, and an explanation about why we're doing this.

DeepSolar inference

Modify DeepSolar to run inference on arbitrary tile batches (currently just runs on test set). Also tracked in DeepSolar repo #1

Add index for querying by centroid_distance

Querying large amounts of rows sorted by centroid_distance (which is how we plan to gather data) is really slow. So we need to add an index on this column.

Make it clear what methods are public/private

Possibly by grouping things into classes, using dunder method naming, etc.

Update readme to walkthrough entire process

A walkthrough/quickstart section of the README would be helpful for people who want to run this code for their own city. Or just as a guide to getting started with contributing to the code.

Add docstrings throughout

I didn't originally add docstrings throughout this codebase, but Sean Sall suggested it might help readability with multiple contributors. So feel free to add one docstring or many and submit a pull request against this issue!

General Optimization

Optimize slower parts of this code (I've tried to parallelize the inner grid calculation and persistence but after much effort it was still not working correctly)

Create script to compare existing OSM pv to summary counts

For a given polygon, we could source statistics from various places about solar pv adoption, and compare this to the existing OSM pv to find areas with the biggest disparity, and target these first.

Possible sources of summary data:
https://console.cloud.google.com/marketplace/details/bigquery-public-data/project-sunroof

Make clustering restartable

Currently clustering is restartable, but doesn't add new positive cases to old clusters, just makes new clusters.

Centroid distance not calculated correctly in my db instance

The lowest value for centroid distance in Austin is 542119.486376057, should be close to 0 instead.

Investigate Tasking Manager over MapRoulette

MapRoulette is pretty good as a verification tool for our task, but we could still try out other ones, including Tasking Manager 3. I've reached out to a couple people to try to get project manager status, if this doesn't work out we could always host our own instance.

Improve gather_city_shapes.gather

Improve gather_city_shapes.gather by looking at all results returned and choosing the first one that is a polygon or multi-polygon. Currently it just picks the first result and has other methods for detecting incorrect shapes.

Add way to filter search area based off nightlight intensity

In the DeepSolar supplemental information pdf, they detail a method they used to filter search area to urban areas which is pretty clever. They used Nasa nightlight imagery as a proxy for population and building density, and selected all areas in the US with an intensity over 128/255. They then go on to calculate that this covered 95% of all solar panels in the US via their random US satellite image sampling numbers.

I think the most difficult part of doing this would be mapping each pixel precisely to a lat/lon range, but other than that it seems tractable. This also seems like a very computationally cheap way to slowly expand the search radius when compared to polygon math.

Performance slows after batch deleting imagery during inference

Output from my machine just now when batch deleting imagery during inference

23.02 tiles/s | 18.29 avg tiles/s
18.17 tiles/s | 18.29 avg tiles/s
Starting extraneous imagery cleanup/deletion
Calculation for expanded positive coords for Cambridge, Massachusetts completed
Deleted 0 non-solar panel containing imagery tiles for Cambridge, Massachusetts
Calculation for expanded positive coords for San Antonio, Texas completed
Deleted 75783 non-solar panel containing imagery tiles for San Antonio, Texas
Deleted 0 non-solar panel containing imagery tiles for San Antonio, Texas
Deletion finished
1.08 tiles/s | 18.20 avg tiles/s
8.43 tiles/s | 18.15 avg tiles/s
11.74 tiles/s | 18.12 avg tiles/s
12.93 tiles/s | 18.10 avg tiles/s
23.13 tiles/s | 18.12 avg tiles/s

A couple guesses of why this is:

maybe changing a bunch of data in sqlite means that indexes have to be rebuilt for certain queries
maybe stopping the API calls to Mapbox for a period of time breaks some sort of ongoing connection to their servers (even though they're separate requests)
maybe I'm deleting too much imagery before it is done being used, and it has to query a bunch right off the bat (logging statistics about how many API queries are done in a batch might be useful for seeing if this is the case)

Create cleanup job to delete untracked imagery

I think due to some combination of SIGINT-ing processes and bugs, it's possible to collect a lot of imagery that isn't tracked in the database, therefore it would be helpful to have a job that gets all imagery filenames, removes the ones that are tracked in the database, and then deletes the rest.

Investigate other imagery sources

NAIP 0.6 m data might be sufficient for classification (0.5 m data supposedly the cutoff)

Lots of sources here from Jack Kelly et al.

Optimize run_inference preprocessing/inference pipeline

Currently this is the slowest part of the process, taking up the majority of the time per tile. However CPU utilization bounces around and never reaches 100% for all cores, and gpu utilization goes between 30% and 1%.

A couple idea for areas of improvement:

currently only runs one image in each inference batch, possible speedup if using bigger batches
no multiprocessing for image resizing
feed_dict used for inference (I've read this is not optimal here)

Dockerize deployment of the project

This will help new users of the project easily deploy it as it continue to get more complicated.

With the addition of rtree it's not as simple as pip install -r requirements.txt anymore