Giter VIP home page Giter VIP logo

solarpaneldatawrangler's People

Contributors

clockwerx avatar sallamander avatar typicaltyler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

solarpaneldatawrangler's Issues

Run_inference script dies every night at 3am due to Connection reset by peer

Below is the stack trace, always seems to happen right at 3am, maybe when mapbox updates their servers. I've added one automatic retry, but maybe I need to try exponential backoff and more retries. Just tracking the issue in case anybody has any guidance.

I was going to try to do a retry method similar to this, but the mapbox package hides away all the requests details to where it's not possible to specify the session etc.

  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 360, in _error_catcher
    yield
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 442, in read
    data = self._fp.read(amt)
  File "/usr/lib/python3.5/http/client.py", line 448, in read
    n = self.readinto(b)
  File "/usr/lib/python3.5/http/client.py", line 488, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.5/socket.py", line 575, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.5/ssl.py", line 929, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.5/ssl.py", line 791, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.5/ssl.py", line 575, in read
    v = self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/models.py", line 750, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 494, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 459, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/urllib3/response.py", line 378, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_inference.py", line 54, in <module>
    image = np.array(imagery.stitch_image_at_coordinate((tile.column, tile.row)))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 190, in stitch_image_at_coordinate
    images.append(get_image_for_coordinate((column, row),))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 178, in get_image_for_coordinate
    image = gather_and_persist_imagery_at_coordinate(slippy_coordinate, final_zoom=FINAL_ZOOM)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 156, in gather_and_persist_imagery_at_coordinate
    retina=(ZOOM_FACTOR > 0))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/mapbox/services/static.py", line 94, in image
    res = self.session.get(uri)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/sessions.py", line 686, in send
    r.content
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/models.py", line 828, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/requests/models.py", line 753, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Random OSError while running inference

Unsure what caused it, will track here. Looks like maybe a file got corrupted? Should probably regenerate imagery at this tile if this happens.

Traceback (most recent call last):
  File "run_entire_process.py", line 68, in <module>
    run_inference.run_classification(args.classification_checkpoint, args.segmentation_checkpoint, BATCHES_BETWEEN_DELETE)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/run_inference.py", line 113, in run_classification
    image = np.array(imagery.stitch_image_at_coordinate((tile.column, tile.row)))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 201, in stitch_image_at_coordinate
    images.append(get_image_for_coordinate((column, row),))
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 189, in get_image_for_coordinate
    image = gather_and_persist_imagery_at_coordinate(slippy_coordinate, final_zoom=FINAL_ZOOM)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 166, in gather_and_persist_imagery_at_coordinate
    slices_per_side=grid_size)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 85, in slice_image
    out = double_image_size(out)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/imagery.py", line 100, in double_image_size
    return image.resize((image.size[0] * 2, image.size[0] * 2), filter)
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/PIL/Image.py", line 1804, in resize
    self.load()
  File "/home/tyler/PycharmProjects/SolarPanelDataWrangler/venv/lib/python3.5/site-packages/PIL/ImageFile.py", line 238, in load
    len(b))
OSError: image file is truncated (82 bytes not processed)

Pycharm getting slow and crashing (out of memory) when I have 4 million images saved

I have the imagery directory excluded from project but Pycharm still runs out of memory even when I'm not running inference. I can either try to fix this or move the data directory somewhere else and make it configurable.

Also I could periodically delete imagery if it has less than a threshold of confidence, but I need to save a 3x3 around the image as well if it's above the threshold, so that complicates things.

OSM pv filtering

Currently we save PV locations from OSM into the database but don't use them for filtering. One possible location for filtering would be in maproulette.py so that we can not create tasks for image tiles that contain PV nodes.

Correct SQLAlchemy session handling

Currently sessions are created for every method call to the solardb.py, which is probably too frequently, especially for methods that get called with higher frequency.

Create a single script to run the whole process

This issue isn't as mandatory, but it would be nice to have one script that could run all parts of the project from beginning to end. It may not be feasible but could aid in usability as lots of the first prototype had to be manually fixed.

Positive classification grouping

In the DeepSolar supplemental information pdf, there's some pseudo-code outlining an algorithm to group tiles together if they are thought to be contiguous by the model. This would be helpful for reducing the amount of human verification tasks necessary.

However, I'm not sure how best to store this information. Maybe a nullable foreign key on slippy_tiles to a panel_groups table.

Add tests

I can't currently imagine exactly what the tests will look like, but we should write some.

DeepSolar inference

Modify DeepSolar to run inference on arbitrary tile batches (currently just runs on test set). Also tracked in DeepSolar repo #1

Update readme to walkthrough entire process

A walkthrough/quickstart section of the README would be helpful for people who want to run this code for their own city. Or just as a guide to getting started with contributing to the code.

Add docstrings throughout

I didn't originally add docstrings throughout this codebase, but Sean Sall suggested it might help readability with multiple contributors. So feel free to add one docstring or many and submit a pull request against this issue!

General Optimization

Optimize slower parts of this code (I've tried to parallelize the inner grid calculation and persistence but after much effort it was still not working correctly)

Make clustering restartable

Currently clustering is restartable, but doesn't add new positive cases to old clusters, just makes new clusters.

Investigate Tasking Manager over MapRoulette

MapRoulette is pretty good as a verification tool for our task, but we could still try out other ones, including Tasking Manager 3. I've reached out to a couple people to try to get project manager status, if this doesn't work out we could always host our own instance.

Improve gather_city_shapes.gather

Improve gather_city_shapes.gather by looking at all results returned and choosing the first one that is a polygon or multi-polygon. Currently it just picks the first result and has other methods for detecting incorrect shapes.

Add way to filter search area based off nightlight intensity

In the DeepSolar supplemental information pdf, they detail a method they used to filter search area to urban areas which is pretty clever. They used Nasa nightlight imagery as a proxy for population and building density, and selected all areas in the US with an intensity over 128/255. They then go on to calculate that this covered 95% of all solar panels in the US via their random US satellite image sampling numbers.

I think the most difficult part of doing this would be mapping each pixel precisely to a lat/lon range, but other than that it seems tractable. This also seems like a very computationally cheap way to slowly expand the search radius when compared to polygon math.

Performance slows after batch deleting imagery during inference

Output from my machine just now when batch deleting imagery during inference

23.02 tiles/s | 18.29 avg tiles/s
18.17 tiles/s | 18.29 avg tiles/s
Starting extraneous imagery cleanup/deletion
Calculation for expanded positive coords for Cambridge, Massachusetts completed
Deleted 0 non-solar panel containing imagery tiles for Cambridge, Massachusetts
Calculation for expanded positive coords for San Antonio, Texas completed
Deleted 75783 non-solar panel containing imagery tiles for San Antonio, Texas
Deleted 0 non-solar panel containing imagery tiles for San Antonio, Texas
Deletion finished
1.08 tiles/s | 18.20 avg tiles/s
8.43 tiles/s | 18.15 avg tiles/s
11.74 tiles/s | 18.12 avg tiles/s
12.93 tiles/s | 18.10 avg tiles/s
23.13 tiles/s | 18.12 avg tiles/s

A couple guesses of why this is:

  • maybe changing a bunch of data in sqlite means that indexes have to be rebuilt for certain queries
  • maybe stopping the API calls to Mapbox for a period of time breaks some sort of ongoing connection to their servers (even though they're separate requests)
  • maybe I'm deleting too much imagery before it is done being used, and it has to query a bunch right off the bat (logging statistics about how many API queries are done in a batch might be useful for seeing if this is the case)

Create cleanup job to delete untracked imagery

I think due to some combination of SIGINT-ing processes and bugs, it's possible to collect a lot of imagery that isn't tracked in the database, therefore it would be helpful to have a job that gets all imagery filenames, removes the ones that are tracked in the database, and then deletes the rest.

Optimize run_inference preprocessing/inference pipeline

Currently this is the slowest part of the process, taking up the majority of the time per tile. However CPU utilization bounces around and never reaches 100% for all cores, and gpu utilization goes between 30% and 1%.

A couple idea for areas of improvement:

  • currently only runs one image in each inference batch, possible speedup if using bigger batches
  • no multiprocessing for image resizing
  • feed_dict used for inference (I've read this is not optimal here)

Dockerize deployment of the project

This will help new users of the project easily deploy it as it continue to get more complicated.

With the addition of rtree it's not as simple as pip install -r requirements.txt anymore

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.