Giter VIP home page Giter VIP logo

aiocogeo's Introduction

aiocogeo

GHA codecov pypi license

Installation

pip install aiocogeo

# With S3 filesystem
pip install aiocogeo[s3]

Usage

COGs are opened using the COGReader asynchronous context manager:

from aiocogeo import COGReader

async with COGReader("http://cog.tif") as cog:
    ...

Several filesystems are supported:

  • HTTP/HTTPS (http://, https://)
  • S3 (s3://)
  • File (/)

Metadata

Generating a rasterio-style profile for the COG:

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/lzw_cog.tif") as cog:
    print(cog.profile)

>>> {'driver': 'GTiff', 'width': 10280, 'height': 12190, 'count': 3, 'dtype': 'uint8', 'transform': Affine(0.6, 0.0, 367188.0,
       0.0, -0.6, 3777102.0), 'blockxsize': 512, 'blockysize': 512, 'compress': 'lzw', 'interleave': 'pixel', 'crs': 'EPSG:26911', 'tiled': True, 'photometric': 'rgb'}

Lower Level Metadata

A COG is composed of several IFDs, each with many TIFF tags:

from aiocogeo.ifd import IFD
from aiocogeo.tag import Tag

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/lzw_cog.tif") as cog:
    for ifd in cog:
        assert isinstance(ifd, IFD)
        for tag in ifd:
            assert isinstance(tag, Tag)

Each IFD contains more granular metadata about the image than what is included in the profile. For example, finding the tilesize for each IFD:

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/lzw_cog.tif") as cog:
    for ifd in cog:
        print(ifd.TileWidth.value, ifd.TileHeight.value)

>>> 512 512
    128 128
    128 128
    128 128
    128 128
    128 128

More advanced use cases may need access to tag-level metadata:

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/lzw_cog.tif") as cog:
    first_ifd = cog.ifds[0]
    assert first_ifd.tag_count == 24

    for tag in first_ifd:
        print(tag)

>>> Tag(code=258, name='BitsPerSample', tag_type=TagType(format='H', size=2), count=3, length=6, value=(8, 8, 8))
    Tag(code=259, name='Compression', tag_type=TagType(format='H', size=2), count=1, length=2, value=5)
    Tag(code=257, name='ImageHeight', tag_type=TagType(format='H', size=2), count=1, length=2, value=12190)
    Tag(code=256, name='ImageWidth', tag_type=TagType(format='H', size=2), count=1, length=2, value=10280)
    ...

Image Data

The reader also has methods for reading internal image tiles and performing partial reads. Currently only jpeg, lzw, deflate, packbits, and webp compressions are supported.

Image Tiles

Reading the top left tile of an image at native resolution:

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/webp_cog.tif") as cog:
    x = y = z = 0
    tile = await cog.get_tile(x, y, z)

    ifd = cog.ifds[z]
    assert tile.shape == (ifd.bands, ifd.TileHeight.value, ifd.TileWidth.value)

Partial Read

You can read a portion of the image by specifying a bounding box in the native crs of the image and an output shape:

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/webp_cog.tif") as cog:
    assert cog.epsg == 26911
    partial_data = await cog.read(bounds=(368461,3770591,368796,3770921), shape=(512,512))

Internal Masks

If the COG has an internal mask, the returned array will be a masked array:

import numpy as np

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/naip_image_masked.tif") as cog:
    assert cog.is_masked

    tile = await cog.get_tile(0,0,0)
    assert np.ma.is_masked(tile)

Configuration

Configuration options are exposed through environment variables:

  • INGESTED_BYTES_AT_OPEN - defines the number of bytes in the first GET request at file opening (defaults to 16KB)
  • HEADER_CHUNK_SIZE - chunk size used to read header (defaults to 16KB)
  • ENABLE_BLOCK_CACHE - determines if image blocks are cached in memory (defaults to TRUE)
  • ENABLE_HEADER_CACHE - determines if COG headers are cached in memory (defaults to TRUE)
  • HTTP_MERGE_CONSECUTIVE_RANGES - determines if consecutive ranges are merged into a single request (defaults to FALSE)
  • BOUNDLESS_READ - determines if internal tiles outside the bounds of the IFD are read (defaults to TRUE)
  • BOUNDLESS_READ_FILL_VALUE - determines the value used to fill boundless reads (defaults to 0)
  • LOG_LEVEL - determines the log level used by the package (defaults to ERROR)
  • VERBOSE_LOGS - enables verbose logging, designed for use when LOG_LEVEL=DEBUG (defaults to FALSE)
  • AWS_REQUEST_PAYER - set to requester to enable reading from S3 RequesterPays buckets.
  • ZOOM_LEVEL_STRATEGY - mimics GDAL's ZOOM_LEVEL_STRATEGY creation option:
    • AUTO or 50 (default) upsamples or downsamples the zoom level whose resolution is closest to the desired resolution.
    • LOWER or 100 always upsamples the zoom level immediately below the desired resolution (requesting less data).
    • UPPER or 0 always downsamples the zoom level immediately above the desired resoluion (requesting more data).
    • Another integer from 0 through 100: if the desired resolution more this percentage of the way from the zoom level immediately below to the zoom level immediately above, then upsample the zoom level immediately below, else downsample the zoom level immediately above. For example, 1 is the same as UPPER unless the COG's resolution is very close to the zoom level below e.g. due to floating point imprecision.

Refer to aiocogeo/config.py for more details about configuration options.

CLI

$ aiocogeo --help
Usage: aiocogeo [OPTIONS] COMMAND [ARGS]...

Options:
  --install-completion [bash|zsh|fish|powershell|pwsh]
                                  Install completion for the specified shell.
  --show-completion [bash|zsh|fish|powershell|pwsh]
                                  Show completion for the specified shell, to
                                  copy it or customize the installation.

  --help                          Show this message and exit.

Commands:
  create-tms  Create OGC TileMatrixSet.
  info        Read COG metadata.

aiocogeo's People

Contributors

dmahr1 avatar geospatial-jeff avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aiocogeo's Issues

Fix reading of byte formatted tags with text

Example:
Tag(code=305, name='Software', tag_type=TagType(format='c', size=1), count=21, length=21, value=(b'T', b'r', b'i', b'm', b'b', b'l', b'e', b' ', b'G', b'e', b'r', b'm', b'a', b'n', b'y', b' ', b'G', b'm', b'b', b'H', b'\x00'))

fix tag value type divergence

Tag values are currently typed as Union[Any, Tuple[Any]]. This causes lots of downstream issues because the type is unclear. It would make the code much cleaner if we removed the Union and only used a single type for tag values. This would also let us add mypy to pre-commit.

remove `run_in_background`

Decompression and postprocessing will never really block the main thread, so its causing more harm than good.

Add rasterio/tiler extra

Add an extra which includes code to do dynamic tiling with aiocogeo (ex. pip install aiocogeo[tiler])

  • This would be an extra because rasterio is required for coordinate system logic, and I don't want to include it as a core dependency.
  • We should aim to implement a similar interface to rio_tiler.io.base.BaseReader.

Checkout COGDumper

๐Ÿ‘‹ @geospatial-jeff
The subject looks interesting ๐Ÿ˜„

Not sure what's your idea but if we want to go full async maybe we can use some of the code from https://github.com/mapbox/COGDumper to go GDAL Free ...

COGDumper is not smart and doesn't do any spatial stuff but if we add pyproj we might be able to do;

  • get geospatial info from the COG
  • fetch only the internal tile (and overview tiles) for a specific .read request.

Read COG metadata (IFD/tags)

I'm thinking something like:

class Tag:
    ...

class IFD:
    tags: Dict[str, Tag]


class COGTiff:
    fpath: str
    ifds: List[IFD]    

    async def __aenter__(self):
        # Request first 16kb, parse ifd and their tags.
        ifds = <ifd with tags>
        return self(ifds)

Usage would be:

async with COGTiff('https://coolsat.com/cog.tif') as cog:
    await cog.read_tile()

We can reuse most of the code from COGDumper but I'd like to make it more object orientated to make the interface a little easier to use.

I like the AbstractReader used by COGDumper, for now lets focus on making it work for http files and then introduce pluggable readers.

Define tags as semi-private

Made some improvements with #4 and #9 but I'm still not a huge fan.

I think it would be best to switch tags to semi-private attributes and expose important metadata through properties like here. At minimum we should have properties for the rasterio profile. A few reasons:

  1. I don't think most users care about the metadata contained on each Tag object (or even care about all of the defined tags)
  2. Keeping Tag defined on the IFD still resolves #9.
  3. Properties of course will be more user friendly to use (ifd.width instead of ifd.ImageWidth.value).
  4. Making Tag semi-private prevents confusion (ex. ifd.Compression vs ifd.compression is confusing)

rio-tiler integration

Work began in #68 to support tiling with aiocogeo. The next step is to extend rio-tiler's BaseReader instead of defining our own class so aiocogeo can be (kind of) compatible with applications that already use rio-tiler.

Run cpu bound code in background

There is a lot of cpu bound code which blocks the main thread at higher concurrencies like decompression, resampling, and numpy operations. We should look to use something like starlette.concurrency.run_in_threadpool or aiofiles.os.wrap which both use asyncio.loop.run_in_executor to run code in the background without blocking the main thread.

It would be worth benchmarking the difference between a ProcessPoolExecutor and ThreadPoolExecutor (process would definitely be faster but by how much?).

Support more compressions

It would be great to add support for other compressions. Cross referencing the compressions supported by imagecodecs to rio-cogeo profiles, we should support:

  • lzma
  • packbits
  • lerc

We should also support no compression, although I don't think this is very common.

Make header size configurable with environment variable

https://github.com/geospatial-jeff/async-cog-reader/blob/e3b613717291be7d247359480bd8e2f2cd2fe60a/async_cog_reader/constants.py#L3

GDAL docs:

Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading it will progressively increase the chunk size up to 2 MB to improve download performance. Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN configuration option can be set to impose the number of bytes read in one GET call at file opening (can help performance to read Cloud optimized geotiff with a large header).

Ref: https://gdal.org/user/virtual_file_systems.html#virtual-file-systems

Refactor partial read for internal masks

Doing a partial read when an internal mask is present is different enough from no mask to warrant refactoring the partial read into two methods. This should also make it easier to support internal masks when merging range requests (#29)

Cache only COG header

https://cogeotiff.slack.com/archives/C01DE57GLHE/p1603130953009500

Summary

Consider a case where N unique tile requests are made to a single COG. Despite the ENABLE_CACHE environment variable being enabled, all requests would be cache misses. Thus at least 2 * N range requests would need to be made to the COG. But if the COG header were cached separately, then only 1 + N range requests would need to be made.

Details

I plan to incorporate aiocogeo within a traditional tile server middleware that handles regular z/x/y.png requests. These currently read PNG tiles that are stored as a tile pyramid in bucket storage. This dated architecture is space inefficient but very performant. I'm hoping to achieve the space savings of COGs (via YCbCr JPEG compression + GDAL mask bands) without a meaningful increase in latency. One way to eliminate that latency is by caching the header in redis or another fast cache available to many servers. For example:

  • Client requests Mercator tile (z, x, y) for cog.tif in cloud bucket storage. Networking layer routes it to COG server 1 (one of many COG servers).
    • COG server 1 checks redis (or another cache) for cog.tif header, but it's not found. CACHE MISS.
    • COG server 1 makes range request to cog.tif header.
    • COG server 1 caches cog.tif header in redis.
    • COG server 1 makes range request for (z, x, y) tile data.
    • COG server 1 performs postprocessing and returns (z, x, y) tile data to client.
  • Client requests Mercator tile (z, x + 1, y) for cog.tif in cloud bucket storage. Networking layer routes it to COG server 2.
    • COG server 2 checks redis for cog.tif header and it's found. CACHE HIT.
    • COG server 2 makes range request for (z, x + 1, y) tile data.
    • COG server 2 performs postprocessing and returns (z, x + 1, y) tile data to client.

The two italicized operations for the first request are not necessary for the second request.

Read COG tile

Once #2 is ready to go, we need a method to use IFD/tag metadata to read a given tile. COGDumper uses cogdumper.cog_tiles.COGTiff.read_tile which looks to just a single XYZ tile based on the tile's coordinate with respect to the (top left?) of the image from the appropriate overview.

As @vincentsarago pointed out, we could use pyproj to:

  • get geospatial info from the COG
  • fetch only the internal tile (and overview tiles) for a specific .read request.

I think it would be nice to implement something similar to rasterio.windows where we can use pyproj to map a particular bounding box to the corresponding XYZ tiles in the COG, but I'm definitely open to other ideas. This brings some questions.

  • How this will work with rio-tiler-v2 -- if at all. If the COGTiff class can implement a similar interface to rasterio.io.DatasetReader it could be passed in as the src_dst to rio_tiler.reader._read but I'm not sure if that is feasible.

Boundless reads

Confirm that boundless reads work (reading a map tile which isn't fully covered by image tiles).

Should just have to add exception handling here to catch TileNotFoundError, and create a mask for the missing portion of the map tile.

Also let the user define what value is used to fill empty pixels.

cc @vincentsarago

Add cog validator

aiocogeo supports a much smaller subset of COG types than gdal, so it would be good to have a way to validate if an image can be read.

Improve IFD/tag composition

Tags are really just metadata about the IFD and its annoying to access them like:

ifd.tag['TagName'].value

Would be easier to do:

ifd.TagName.value

Merge consecutive Requests

linked to #21, GDAL merge consecutive requests (horizontal tiles, when band interleave I think) up to 2Mb (configurable).

Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading it will progressively increase the chunk size up to 2 MB to improve download performance. Starting with GDAL 2.3, the GDAL_INGESTED_BYTES_AT_OPEN configuration option can be set to impose the number of bytes read in one GET call at file opening (can help performance to read Cloud optimized geotiff with a large header).

Ref: https://gdal.org/user/virtual_file_systems.html#virtual-file-systems

Large tag value offsets

The first 16KB of the header should contain all IFDs, but large tag values which don't fit in the 12 bytes provided by each IFD for the tag's value may be stored anywhere in the file (even after image data) in which case we'll need to do another range request into the file fetch the tag value.

add logging

Would be really useful for debugging purposes to have more verbosity on reads.

I often use CPL_DEBUG and CPL_CURL_VERBOSE withing GDAL to see how much data and how many GET/LIST/HEAD request gdal is doing.

Side note: myabe having and internal variable to host this could be cool:

async with COGReader("https://async-cog-reader-test-data.s3.amazonaws.com/webp_cog.tif") as cog:
    x = y = z = 0
    tile = await cog.get_tile(x, y, z)

print(cog.requests)
{
    count: 3,   
    size: TotalSizeOfRequest
    get: [
       'offset1-offset2', sizeOfRequest1,
       'offset3-offset4',  sizeOfRequest2,
       'offset5-offset6',  sizeOfRequest3
   ]
} 

Add config option to enable/disable block cache

Add config option to enable/disable block cache (GDAL has this as well). Also the cache is causing tests to fail, so it would be nice to disable caching during tests. That specific test case works by itself but fails during build because the same tiles are requested and cached in a different test case which throws off the number of requests.

Aiocache supports this through cache_read and cache_write kwargs injected to the cache key generator (https://aiocache.readthedocs.io/en/latest/decorators.html#cached)

Define explicit IFD attributes for supported tags

When the interface is finished we should define explicit IFD attributes for the supported tiff tags, a few reasons:

  • Having a huge LUT containing a bunch of tags indicates that the library supports all of those tags when we really only want to support a small subset of tags defined in the TIFF spec which are necessary for partial reads.
  • As currently written, its not explicitly defined which tiff tags are attached to an IFD. This makes the code much harder to understand and maintain. A user/developer should be able to look at the IFD class definition and know exactly which tiff tags it supports and how they are accessed.

Caching merged range requests

Ref #23

I think there are a few options which could work:

  • Cache individual tiles after the ranged request. This has the benefit of caching the tile regardless of how it was requested (merged vs. unmerged), but adds complexity because we need to check if all of the tiles encapsulated by a specific merged request are cached before doing the request, skipping the merged request and pulling tiles directly from the cache if this is the case.
  • Cache the range request itself, using start/end as the cache key. This is easier to implement but wont cache the same tile across merged and unmerged requests. Another downside is we will only get a cache hit if the exact same range request is performed (ex. if you have two ranges A->D and B->D there will not be a cache hit even though 75% of the imagery is the same between the two requests).
  • Another solution is to cache with some sort of range key so we never request the same byte from the image more than once. This would of course be useful for every range request we perform and would be implemented on the lower-level Filesystem which is a nice design pattern, but I don't think aiocache has support for this.

There is also an argument to be made that choosing a caching stragegy which works across both merged/unmerged requests since (I think?) most users would be exclusively using either merged or un-merged range requests.

Add STAC filesystem

The STAC filesystem would search the item's assets for COGs and return potentially several http or s3 readers

Reduce memory usage

Aiocogeo uses ~4x more memory than rio tiler when reading a single tile:

Line #    Mem usage    Increment   Line Contents
================================================
    44    115.3 MiB    115.3 MiB   @profile
    45                             def main():
    46    125.7 MiB     10.4 MiB       asyncio.run(_aiocogeo())
    47    128.5 MiB      2.8 MiB       rio_tile()

The culprit is the call to skimage.resize when resampling the image:

Line #    Mem usage    Increment   Line Contents
================================================
   292    118.9 MiB    118.9 MiB       @profile
   293                                 def _postprocess(
   294                                     self, arr: NpArrayType, img_tiles: TileMetadata, out_shape: Tuple[int, int]
   295                                 ) -> NpArrayType:
   296                                     """Wrapper around ``_clip_array`` and ``_resample`` to postprocess the partial read"""
   297    118.9 MiB      0.0 MiB           return self._resample(
   298    126.5 MiB      7.6 MiB               self._clip_array(arr, img_tiles), img_tiles=img_tiles, out_shape=out_shape
   299                                     )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.