brubsby / solarpaneldatawrangler Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU General Public License v3.0
License: GNU General Public License v3.0
Below is the stack trace, always seems to happen right at 3am, maybe when mapbox updates their servers. I've added one automatic retry, but maybe I need to try exponential backoff and more retries. Just tracking the issue in case anybody has any guidance.
I was going to try to do a retry method similar to this, but the mapbox package hides away all the requests details to where it's not possible to specify the session etc.
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 360, in _error_catcher
yield
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 442, in read
data = self._fp.read(amt)
File "/usr/lib/python3.5/http/client.py", line 448, in read
n = self.readinto(b)
File "/usr/lib/python3.5/http/client.py", line 488, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.5/ssl.py", line 929, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.5/ssl.py", line 791, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.5/ssl.py", line 575, in read
v = self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/models.py", line 750, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 494, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 459, in read
raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 378, in _error_catcher
raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_inference.py", line 54, in <module>
image = np.array(imagery.stitch_image_at_coordinate((tile.column, tile.row)))
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 190, in stitch_image_at_coordinate
images.append(get_image_for_coordinate((column, row),))
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 178, in get_image_for_coordinate
image = gather_and_persist_imagery_at_coordinate(slippy_coordinate, final_zoom=FINAL_ZOOM)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 156, in gather_and_persist_imagery_at_coordinate
retina=(ZOOM_FACTOR > 0))
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/mapbox/services/static.py", line 94, in image
res = self.session.get(uri)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/sessions.py", line 686, in send
r.content
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/models.py", line 828, in content
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/models.py", line 753, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
Unsure what caused it, will track here. Looks like maybe a file got corrupted? Should probably regenerate imagery at this tile if this happens.
Traceback (most recent call last):
File "run_entire_process.py", line 68, in <module>
run_inference.run_classification(args.classification_checkpoint, args.segmentation_checkpoint, BATCHES_BETWEEN_DELETE)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/run_inference.py", line 113, in run_classification
image = np.array(imagery.stitch_image_at_coordinate((tile.column, tile.row)))
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 201, in stitch_image_at_coordinate
images.append(get_image_for_coordinate((column, row),))
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 189, in get_image_for_coordinate
image = gather_and_persist_imagery_at_coordinate(slippy_coordinate, final_zoom=FINAL_ZOOM)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 166, in gather_and_persist_imagery_at_coordinate
slices_per_side=grid_size)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 85, in slice_image
out = double_image_size(out)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 100, in double_image_size
return image.resize((image.size[0] * 2, image.size[0] * 2), filter)
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/PIL/Image.py", line 1804, in resize
self.load()
File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/PIL/ImageFile.py", line 238, in load
len(b))
OSError: image file is truncated (82 bytes not processed)
I have the imagery directory excluded from project but Pycharm still runs out of memory even when I'm not running inference. I can either try to fix this or move the data directory somewhere else and make it configurable.
Also I could periodically delete imagery if it has less than a threshold of confidence, but I need to save a 3x3 around the image as well if it's above the threshold, so that complicates things.
Could be as simple as sending out an excel sheet of locations, or as complicated as rolling our own verification software.
Some existing solutions:
MapRoulette
OpenGridMap
This will probably make deploying the code less fragile for contributors and simplify things.
Query OSM existing solar panel locations and persist for excluding locations from search/human verification
Currently we save PV locations from OSM into the database but don't use them for filtering. One possible location for filtering would be in maproulette.py so that we can not create tasks for image tiles that contain PV nodes.
Currently sessions are created for every method call to the solardb.py, which is probably too frequently, especially for methods that get called with higher frequency.
This issue isn't as mandatory, but it would be nice to have one script that could run all parts of the project from beginning to end. It may not be feasible but could aid in usability as lots of the first prototype had to be manually fixed.
In the DeepSolar supplemental information pdf, there's some pseudo-code outlining an algorithm to group tiles together if they are thought to be contiguous by the model. This would be helpful for reducing the amount of human verification tasks necessary.
However, I'm not sure how best to store this information. Maybe a nullable foreign key on slippy_tiles to a panel_groups table.
I can't currently imagine exactly what the tests will look like, but we should write some.
I realized that people who find the project solely by this repo might not follow my twitter or open climate fix, maybe it could be a good idea to include a section pointing people to other ways they can help, and an explanation about why we're doing this.
Querying large amounts of rows sorted by centroid_distance (which is how we plan to gather data) is really slow. So we need to add an index on this column.
Possibly by grouping things into classes, using dunder method naming, etc.
A walkthrough/quickstart section of the README would be helpful for people who want to run this code for their own city. Or just as a guide to getting started with contributing to the code.
I didn't originally add docstrings throughout this codebase, but Sean Sall suggested it might help readability with multiple contributors. So feel free to add one docstring or many and submit a pull request against this issue!
Optimize slower parts of this code (I've tried to parallelize the inner grid calculation and persistence but after much effort it was still not working correctly)
For a given polygon, we could source statistics from various places about solar pv adoption, and compare this to the existing OSM pv to find areas with the biggest disparity, and target these first.
Possible sources of summary data:
https://console.cloud.google.com/marketplace/details/bigquery-public-data/project-sunroof
Currently clustering is restartable, but doesn't add new positive cases to old clusters, just makes new clusters.
The lowest value for centroid distance in Austin is 542119.486376057, should be close to 0 instead.
MapRoulette is pretty good as a verification tool for our task, but we could still try out other ones, including Tasking Manager 3. I've reached out to a couple people to try to get project manager status, if this doesn't work out we could always host our own instance.
Improve gather_city_shapes.gather by looking at all results returned and choosing the first one that is a polygon or multi-polygon. Currently it just picks the first result and has other methods for detecting incorrect shapes.
In the DeepSolar supplemental information pdf, they detail a method they used to filter search area to urban areas which is pretty clever. They used Nasa nightlight imagery as a proxy for population and building density, and selected all areas in the US with an intensity over 128/255. They then go on to calculate that this covered 95% of all solar panels in the US via their random US satellite image sampling numbers.
I think the most difficult part of doing this would be mapping each pixel precisely to a lat/lon range, but other than that it seems tractable. This also seems like a very computationally cheap way to slowly expand the search radius when compared to polygon math.
Output from my machine just now when batch deleting imagery during inference
23.02 tiles/s | 18.29 avg tiles/s
18.17 tiles/s | 18.29 avg tiles/s
Starting extraneous imagery cleanup/deletion
Calculation for expanded positive coords for Cambridge, Massachusetts completed
Deleted 0 non-solar panel containing imagery tiles for Cambridge, Massachusetts
Calculation for expanded positive coords for San Antonio, Texas completed
Deleted 75783 non-solar panel containing imagery tiles for San Antonio, Texas
Deleted 0 non-solar panel containing imagery tiles for San Antonio, Texas
Deletion finished
1.08 tiles/s | 18.20 avg tiles/s
8.43 tiles/s | 18.15 avg tiles/s
11.74 tiles/s | 18.12 avg tiles/s
12.93 tiles/s | 18.10 avg tiles/s
23.13 tiles/s | 18.12 avg tiles/s
A couple guesses of why this is:
I think due to some combination of SIGINT-ing processes and bugs, it's possible to collect a lot of imagery that isn't tracked in the database, therefore it would be helpful to have a job that gets all imagery filenames, removes the ones that are tracked in the database, and then deletes the rest.
NAIP 0.6 m data might be sufficient for classification (0.5 m data supposedly the cutoff)
Lots of sources here from Jack Kelly et al.
Currently this is the slowest part of the process, taking up the majority of the time per tile. However CPU utilization bounces around and never reaches 100% for all cores, and gpu utilization goes between 30% and 1%.
A couple idea for areas of improvement:
This will help new users of the project easily deploy it as it continue to get more complicated.
With the addition of rtree it's not as simple as pip install -r requirements.txt
anymore
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.