nci / gsky Goto Github PK

Distributed Scalable Geospatial Data Server

Makefile 0.15% Go 37.30% Smarty 24.40% M4 0.01% CSS 0.19% HTML 0.25% JavaScript 0.04% Shell 1.11% PLpgSQL 3.23% Dockerfile 0.10% C 0.21% C++ 33.01%

geospatial golang grpc postgis remote-sensing wcs wms wps

gsky's People

Contributors

Stargazers

Watchers

Forkers

prl900 stp900 g-bull bje900 edisonguo asivapra myfreebrain flyingliang icpac-igad mapsgeek haoran1012 nishadhka redshiftkeying ggichuru

gsky's Issues

OOM for large polygon

https://github.com/nci/gsky/blob/master/processor/tile_grpc.go#L27 sets the output channel to contain up to 100 rasters without blocking. In most WMS settings, due to small requested polygon size, there is virtually no chance we end up even nearly 100 rasters returned from gdal workers. But in WCS settings, users often want to study large region of interest. For example, a WCS request sent by @juan-guerschman to study Australia as a whole in full resolution, which is a fairly common scenario. http://gsky ip/ows?SERVICE=WCS&service=WCS&crs=EPSG:4326&format=GeoTIFF&request=GetCoverage&height=7451&width=9580&version=1.0.0&bbox=110,-45,155,-10&coverage=global:c6:monthly_frac_cover&time=2018-03-01T00:00:00.000Z

In this scenario, Gsky fills the channel irrespective of the physical memory size. If the server doesn't have enough memory, ows process will get hit by OOM kill.

A quick fix will be like this: we know as inputs the requested width and height. We then try to initialize the channel as a function of total free memory divided by (width x height). If memory is too low, we reject the request.

The above fix is a greedy algorithm that will not find optimum in terms of concurrent processing and memory allocation but might be good enough in practice. This problem is essentially a packing problem (https://en.wikipedia.org/wiki/Bin_packing_problem), which can be NP-hard. Finding a good long-term solution will be left to future work.

WMS returns PNG images when requesting JPEG and GIF formats

WMS 1.3.0 GetCapabilities document advertises that the GetMap request can return images in the image/png, image/jpeg and image/gif formats. GetMap requests with parameter FORMAT=image/jpeg or FORMAT=image/gif returns images in PNG format (and Content-Type: image/png in the response header).

High Availability and Performance Boost of Gsky Workers

There are rooms for improvements for the current GSKY gRPC and workers at multiple places for high availability and performance boost.
1. If a gdal-process dies, the gRPC server currently does not restart the gdal-process and there is no way to restart the dead process other than restarting the entire gPRC server because gRPC server establishes a unix domain socket when it starts the gdal-process.

2. There has been a proposal for c++-based implementation of workers to boost performance. I studied the source of the c++ implementation. Although it was long thought that the overhead of go workers was due to the thread context switch when calling to gdal C function from go, I found the performance gap between c++ and go workers may be due to an architectural difference, at least in theory. Let's compare and contrast the main architectural difference of current go and the proposed c++ workers:

Current go workers - FastCGI style
Essentially the gRPC server acts as process supervisor and forwards gRPC requests to a pool of gdal processes via unix domain sockets. This is highly similar to FastCGI in the web server world. The drawback is that we have to marshal and unmarshal protocol buffer messages of large quantity of the image data between the main server and the gdal processes via the unix domain sockets hence the overhead.
Proposed c++ workers - share listener fd then fork to share nothing
The main server creates a server socket listener and then forks (i.e. linux fork()) a pool of child processes sharing the listener fd. The gRPC in each child process then accepts the shared listener fd and directly processes gRPC requests and sends back responses. This essentially eliminates the needs of an extra CGI layer hence no marshaling/unmarshaling overhead for large amount of image data. The drawback is that we won't be able to easily plug in third party processes like the FastCGI style. In the long run, this will be problem because we want to execute arbitrary client's processes. But in the short term, this shouldn't be a concern.

Having said that, the c++ workers are still in it's infancy and the algorithmic correctness needs to be verified. For the foreseeable short run, it is possible to enhance the current go workers to minimise or even close the performance gap using the following socket sharding strategy:

Shared-nothing architecture with socket sharding
Linux Kernel 3.9 has SO_REUSEPORT for socket options. This feature allows load balancing TCP servers binding to same listening address and port in the kernel space. More importantly, one can launch identical gRPC servers directly bundled with a gdal process therefore eliminating the needs of IPC between the main process supervisory server and gdal processes. This is the essence of the architecture of the c++ workers. Concretely speaking, we proceed as follows:
a. We treat gdal process as internal processes, which means we call gdal processes like a normal go function call. And we bundle a gRPC server on top of it to take requests. If the process is a third party process, we fall back to FastCGI style.
b. Before we launch the gRPC server, we setup the socket to enable SO_REUSEPORT and self fork a pool of processes. Later when gRPC server starts, socket sharding will automatically happen behind the scene in the kernel space.

The socket sharding approach provides both high availability and performance boost due to the elimination of IPC between gRPC server and gdal processes. It also scales horizontally with number of cores/machines. At this stage, we consider this go implementation a reference implementation. Later when we actually move to c++ workers, we will have solid strong baseline to compare against.

mas_intersect_polygons() doesn't handle edge cases for time range overlaps

PostgresSQL OVERLAPS bounds are non-inclusive at the top end: start <= time < end. The po_stamp_min and po_stamp_max columns form a range of [t_min, t_max) in the MAS database. Let [t0, t1) denote the requested time range from gsky. We have two possible edge cases for time range overlaps:

t1 == t_min, therefore [t0, t1 + 1 second) will intersect [t_min, t_max) -> this is handled by the current mas_intersect_polygon() code
t0 == t_max. mas_intersect_polygon() will not be able to handle this case because of the t_max is on the half-open side of the range therefore no overlapping.

Pulling timestamps from MAS

Currently GSKY uses hard-coded time generators to generate timestamps for the WMS and WCS GetCapabilities http operations. This requires hard coding the generation rules for each dataset as different datasets have different timesteps. However, the timestamps are stored in MAS by the crawler. We can pull these timestamps from MAS instead of generating them by hand.

performance bottleneck of mas_intersects() for complicated polygons

@juan-guerschman discovered this performance issue when polygon is defined with large number of points. Note that this should not be considered corner cases, because this can happen frequently as far as region boundaries are concerned. For example, the boundary of a small town in Australia can be defined with thousands of polygon points. A typical town's boundary can reach about 55,000+ polygon points.

Therefore, exact polygon intersection is infeasible in such a case. Instead, one should seek for approximate algorithms to smooth out the polygon first before we intersect all the data files against it. A classic algorithm to smooth out the polygon is called Douglas-Peucker algorithm. PostGIS has corresponding function called ST_Simplify() that implements this algorithm. ST_Simplify() itself can be performance bottleneck if there are too many polygon points. Therefore, one should use ST_RemoveRepeatedPoints() to remove the points within a tolerance distance first before we call ST_Simplify(). An extensive discussion regarding this matter is here: https://www.standardco.de/squeezing-performance-out-maps-with-lots-of-polygons-leaflet-postgis

Note:

The tolerance level of ST_Simplify() can be set according to zooming levels. For WGS84, the table is tolerance vs zoom levels are here:
https://gist.github.com/lukasmartinelli/19132751535fae4f8175dfa45bb1459c

 zoom_level |    tolerance     
------------|------------------
          0 |  78271.516953125
          1 | 39135.7584765625
          2 | 19567.8792382812
          3 | 9783.93961914062
          4 | 4891.96980957031
          5 | 2445.98490478516
          6 | 1222.99245239258
          7 | 611.496226196289
          8 | 305.748113098145
          9 | 152.874056549072
         10 | 76.4370282745361
         11 | 38.2185141372681
         12 |  19.109257068634
         13 | 9.55462853431702
         14 | 4.77731426715851

The tolerance level of ST_RemoveReaptedPoints() is the distance in the unit of reference coordinate system. For example, for WGS84, the tolerance level can be set according to the precision of decimal degrees.

shard_creation.sh doesn't check for the existence of shard

Currently shard_creation.sh in mas/db doesn't check for the existence of the shard before creating it. This causes https://github.com/nci/gsky/blob/master/mas/db/shard_create.sh#L19 to run to wipe out everything in the existing schema by dropping all the existing tables. It will be desirable to check for the existence of the shard to ensure multiple runs of shard_creation.sh safe.

tile_grpc vulnerable to DDOS

Apparently tile_grpc has no timeout in place. Clients can potentially request large polygons in either WMS or WCS concurrently several times to overwhelm the grpc workers. If the client is designed to malicious, it wont take long to DDOS the grpc workers.

Suggest including “current” attribute in the WMS GetCaps Dimension element

Perhaps consider including the “current” attribute in the Dimension element for each layer, set to True.

Each layer in the WMS 1.3.0 GetCapabilities document has a Dimension element for time, with the “default” attribute set to “current” to indicate that the service should return the most recent data if the TIME parameter is not provided in GetMap requests (note that the spec is not explicit as to whether the “default” attribute must equal a literal time corresponding to a time value in the Dimension element text or whether the keyword “current” can be used to indicate the latest time in a list of times contained in the Dimension element text). The WMS 1.3.0 spec states that the “current” attribute in the Dimension element should be set to True or “1” to indicate that requests can be made for the most current data, i.e. TIME=current in the GetMap request. Given that omitting the TIME parameter in GetMap requests for this service is equivalent to including TIME=current, it would be a more complete GetCapabilities document if the “current” attribute set to True is included in the Dimension elements.

Streamlining crawling and mas ingestion

Currently the crawling and mas ingestion process require manual and ad-hoc steps. This is error prone for non-technical users. It would be good to streamline the crawl such that non-technical users need to only run a main script to recursively crawl a directory for certain types of files. It would be also good to streamline the process of mas ingestion such that non-technical users need to only supply a shard name and the crawler output files.

In addition, non-technical users will also benefit from documentations of crawler and mas to gain usage guidance.

Gsky sends separate MAS requests per namespace

https://github.com/nci/gsky/blob/master/processor/tile_indexer.go#L73 hits MAS end point for each namespace. This is unnecessary as MAS end point accepts packing multiple namespaces in a single request. https://github.com/nci/gsky/blob/master/mas/api/api.go#L72

GetCapabilities doesn't refresh dates

There have been dynamically generated dates such as pulling dates from MAS. But GSKY right now only loads the dates when loading the config files. Thus the WMS/WCS GetCapabilities will not return the up-to-date dates.

mas timestamps issues

The mas timestamps feature now has two bugs:

In utils/config.go, mas time generator uses since http parameter to indicate end date but mas api accepts until
Some datasets such as chirps have empty namespace. If namespace is empty/null, MAS should return all the timestamps regardless of namespaces.

Caching gRPC outputs shared between tiles

Neighbouring tiles often share a subset of underlying data files. The current Gsky code base treats each tile request independently and therefore doesn't cache the the gRPC outputs corresponding to the shared underlying data files. This is a pronounced performance problem in terria as terria virtually always requests a large region of neighbouring tiles. The problem is worsened for high resolution data and the zooming in operations as these operations span a fairly large amount of shared data files.

To fix this, one can compute a hash for several request parameters that uniquely identify the request in question before calling the gRPC worker (i.e. https://github.com/nci/gsky/blob/master/processor/tile_grpc.go#L188) and cache/return the gRPC outputs accordingly. For example, the key can be calculated like the following using a few request objects right before line 188.

h := fnv.New32a()
h.Write([]byte(BBox2WKT(geot)))
geoStamp := fmt.Sprintf("%s_%v_%v", g.Path, g.TimeStamp.UnixNano(), int64(h.Sum32()))

The actual cache solution is a strategic decision to make. We can implement:

in memory cache using a hash table that supports concurrent read/write
using local file system where file name is the key and file content is the cached results
using a dedicated cache server such as memcache and redis.

@bje- Any suggestions on caching strategies?

mas data ingestion operations require root

Currently the shard_x.sh scripts have "runuser postgres" statements in them. runuser requires setuid privilege which often means root. This is undesirable if the scripts are run by data collection/management teams as they're not going to have root like sysadmin does.

Towards Distributed WCS

The current WCS functionality does not scale beyond single ows node. The clients who use WCS often want high or even full resolution data. This can result in very large output images beyond the memory limit of single node. Therefore, we need to build facility that coordinates a cluster of ows nodes to compute large WCS requests. One variable implementation is as follows:

On the master ows node, we split the requested bounding box into a series of smaller bounding boxes. Doing so will both reduce memory pressure and boost performance due to concurrent processing among the splits on the gRPC nodes.
We then distribute portions of the splits to the worker ows nodes.
The master ows node waits for the completion of worker nodes and merge the results.

The following experimental results demonstrate the effectiveness of the above proposed strategy. The environment of this experiment is a single node with 8 physical cores and 16GB memory. The ows cluster consists of 4 ows processes running on the same node. The same node also has 8 gRPC workers. Our baseline ows is also built in this environment to ensure all the experimental conditions are the same in order have a fair comparison. The test data is Geoglam monthly fractional cover. The WCS request is as follows:
http://<gsky server ip>/ows/?SERVICE=WCS&service=WCS&crs=EPSG:4326&format=geotiff&request=GetCoverage&height=<height>&width=<width>&version=1.0.0&bbox=-179,-80,180,80&coverage=global:c6:monthly_frac_cover&time=2018-01-01T00:00:00.000Z

The bounding box -179, -80, 180, 80 virtually covers the entire planet. The resultant number of data files for such a bounding box is 816.

The following table shows the processing time between the baseline and the proposed solution. Processing time is the round-trip time of the WCS request.

Wdith	Height	Baseline (secs)	Proposed (secs)
2000	2000	12.191	10.822
5000	5000	44.216	13.364
10000	10000	OOM	19.650
20000	20000	OOM	38.755
40000	40000	OOM	104.930
80000	80000	OOM	336.439
121717	54247	OOM	345.774

Note:

OOM stands for out of memory.
Given bbox, 12717x54247 is the full resolution of the dataset i.e. the theoretical maximum width and height.

Capturing nodata value during crawling

It will be useful to include the nodata value during the crawling process. For example, we may skip the grpc entirely for the data files full of nodata. This will dramatically boost performance for datasets where ocean areas are marked as nodata.

WCS DescribeCoverage doesn't display netcdf file option

WCS DescribeCoverage doesn't display the netcdf file option even though it is accepted as valid input option.

E.g.,
http://130.56.242.16/ows?service=WCS&version=1.0.0&coverage=LS5:NBAR:TRUE&request=DescribeCoverage

<supportedFormats>
<formats>GeoTIFF</formats>
</supportedFormats>

http://130.56.242.16/ows?crs=EPSG%3A4326&service=WCS&format=NetCDF&request=GetCoverage&height=256&width=256&version=1.0.0&BBox=148%2C-37%2C151%2C-34&Coverage=LS7%3ANBAR%3ATRUE&time=1999-08-05T00%3A00%3A00.000Z

Inefficiently sending wcf output files over the network

Reading the entire temp geotiff/netcdf file into memory https://github.com/nci/gsky/blob/master/utils/output_encoders.go#L290 and then send the file contents over http connection is not efficient in terms of speed and memory. Instead, one should use io.copy() to copy the fd of the temp file into network socket. The implementation of io.copy uses sendfile() syscall golang/go#10948, which is far more efficient than manually reading the file and send it over the network.

shard_create.sh doesn't guard against different gpath for same shard code

The combination of shard code and gpath in the public.shards table must be unique. Now consider this scenario: when a user calls shard_create.sh with an existing shard code but the user supplied gpath is different from the gpath associated with the existing shard code in MAS database. Apparently, shard_create.sh assumes shard existed and skips shard creation. If the user then proceed with data ingestion, the data will be ingested into the same shard but with a different gpath. Therefore, the newly ingested data will never be retrieved later.

Ambiguity as to whether WMS 1.1.1 is supported

It is unclear whether WMS 1.1.1 supported. Issuing a GetCapabilities request without a VERSION parameter returns the following error message suggesting 1.1.1 is supported:

This server can only accept WMS requests compliant with version 1.1.1 and 1.3.0: /ows?request=GetCapabilities&service=WMS

Submitting a GetCapabilities request with VERSION=1.1.1 returns a 1.3.0 statement.

Unable to make WCS GetCoverage requests using OGC:CRS84

For GSKY service endpoint http://130.56.242.16/ows, WCS GetCoverage requests with CRS=OGC:CRS84 (a supported CRS in the DescribeCoverage statement) fail with (abridged) error message:

Request ..... should contain a valid ISO 'crs/srs' parameter.

Note that successful GetCoverage requests can be made using the EPSG:4326 CRS, provided in the /CoverageDescription/CoverageOffering/domainSet/spatialDomain/gml:RectifiedGrid element of the DescribeCoverage statement, although this CRS is not stated as supported.

Example request with CRS=OGC:CRS84 (fail):
http://130.56.242.16/ows?SERVICE=WCS&VERSION=1.0.0&REQUEST=GetCoverage&COVERAGE=global:c5:frac_cover&CRS=OGC:CRS84&BBOX=-1.8,-0.9,1.8,0.9&FORMAT=GeoTIFF&WIDTH=10&HEIGHT=10

Same request with CRS=EPSG:4326 (success):
http://130.56.242.16/ows?SERVICE=WCS&VERSION=1.0.0&REQUEST=GetCoverage&COVERAGE=global:c5:frac_cover&CRS=EPSG:4326&BBOX=-1.8,-0.9,1.8,0.9&FORMAT=GeoTIFF&WIDTH=10&HEIGHT=10

WMS 1.3.0 GetCaps XML Schema validation errors

WMS 1.3.0 GetCapabilities document has the following XML Schema validation errors:

/WMS_Capabilities/Service/ContactInformation/ContactPersonPrimary/ContactPerson element is missing
/WMS_Capabilities/Service/ContactInformation/ContactAddress/PostCode element is missing

WMS GetCaps POST request fails

WMS 1.3.0 GetCapabilities request using the POST method with the URL provided in the /WMS_Capabilities/Capability/Request/GetCapabilities/DCPType/HTTP/Post/OnlineResource element of the GetCaps document fails with message:

Not a valid OWS request. URL /ows does not contain a valid 'request' parameter.

POST body:

<?xml version="1.0" ?>
<ows:GetCapabilities xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ows="http://www.opengis.net/ows"
    xsi:schemaLocation="http://www.opengis.net/ows http://schemas.opengis.net/ows/1.0.0/owsGetCapabilities.xsd">
    <ows:AcceptVersions>
        <ows:Version>1.3.0</ows:Version>
    </ows:AcceptVersions>
</ows:GetCapabilities>

Does the POST request for the GetCapabilities operation need to be supported? If not, suggest removing POST OnlineResource element from the GetCaps statement.

gsky silently exits if address already in use

If two copies of gsky are started on the same port number, the server exits silently with no explanation of the problem. Furthermore, it exits with an exit code of 0.

Towards generic WPS polygon drill

Apparently, the WPS polygon drill facility is hard coded to work with Geoglam monthly fractional cover and CHIRPS 2.0 datasets only as well as hard coded csv output templates. As discussed with clients over multiple meetings, polygon drill needs to generalised to any datasets that GSKY can serve. This requires implementing the following changes into the current polygon drill facility:

Generalise WPS execute() in ows such that it can work with arbitrary number of data sources with arbitrary start/end dates and number of bands (i.e. namespaces). Essentially this is data layer with arbitrary number of bands. In contrast, a data layer can only have either 1 or 3 bands in order to render a colour image such as a png image.
Generalise drill_merger such that it can work with arbitrary number of bands (i.e. namespaces) instead of hard coded 4 bands currently.
Generalise drill_merger such that it can evaluate one or more mathematical expressions (e.g. a golang expression) for the bands on the fly. Currently geoglam computes total cover on the fly by taking a sum between two bands. If we generalise beyond geoglam, we will need to evaluate general expressions. For example, an expression of band sum phot_veg + nphot_veg will be parsed and evaluated by GSKY on the fly to compute geoglam total cover once this feature is implemented.
Generalise the csv output templates on a per data source per process basis. Each data source for each process will have different units of measurements, colours, titles, abstracts, number of csv columns, etc. Thus we will need one template per data source per process.
Generalise drill.go worker such that it can handle any data types that GSKY ows can handle and seek approximate algorithms for computing the statistics for large datasets.
Generalise ows.go such that users can specify start datetime and end datetime for the time period of polygon drill.

WMS not working in QGIS

When adding WMS layer in QGIS using full GetMap request (http://gsky.nci.org.au/ows?layers=LS7%3ANBAR%3ATRUE&styles=&crs=EPSG%3A4326&bgcolor=0xFFFFFF&height=256&width=256&version=1.3.0&bbox=-35.3075%2C149.12441666666666%2C-35.1%2C151&time=1999-08-05T00%3A00%3A00.000Z&transparent=FALSE), an error message is displayed when trying to add any available layer. Seems to be trying to make a GetLegend request based on the part of the error message visible (error message gets cut off... will try and put screenshot if able):

Error downloading http://gsky.nci.org.au/ows?service=WMS&request=GetLeg

(Note: tested from QGIS 2.14.8-Essen and 2.18.2 Las Palmas)

MAS API doesn't properly check postgres connection open state

Currently MAS API doesn't properly check for the actual connection open state, which will cause issues for the downstream code such as gsky tile_indexer.go. Essentially sql.Open() does lazy Open() in a sense that database connection isn't open until the first sql query. Thus to test if the connection is okay, we will have to actually issue a simple query assert the connection open state.

GSKY docker image

Despite the fact that GSKY ships with fully automated build scripts in terms of ./configure make make install. The build and installation of GSKY dependencies in a fresh environment can still be a daunting task. For end users who want to quickly check out GSKY, they need a docker image like many other open source projects do. For our developers, it would be also beneficial to containerize GSKY for a variety purposes such as automated testing, quick dev environment setup, and easy deployment.

Apart from the technical side that creates the docker image, we also need to be aware that NCI needs to

Create an organizational account on docker hub to host GSKY images and made them publicly accessible.
Find a location to host sample data for at least a minimum GSKY demo. There are a few ways we can host the sample data
a. Host the sample data on NCI servers
The problem with a. is the coordination of the ops team and a few human-involved approval processes.
b. Use github's LBF facility to track the sample data like other source code files
The free version only has 1G bandwidth per month, which is too limited.
c. Create a seed GSKY image bundled with sample data and push this image to docker hub
The following PR took this approach. A seed GSKY image tagged v0 bundles with the sample data.
Therefore, all the subsequent image built from the seed image will also have access to the sample
data.

Default TIME not working for WMS GetMap requests

Blank images are returned if the TIME parameter is not provided in GetMap requests for the twelve Landsat 5, 7 and 8 layers (at http://gsky.nci.org.au/ows). The “default” attribute in the layer’s Dimension element in the GetCapabilities statement has the value of “current”, indicating the service should return the most recent layer data if the TIME parameter is omitted.

Large WMS or WCS query doesn't scale beyond single gRPC worker node

The load balancer https://github.com/nci/gsky/blob/master/ows.go#L224 randomly picks a worker node during the initialization of wms or wcs pipeline. Thus the pipeline sticks to the same worker node during its lifetime. In case we have large volume of intersected files returned from tile_indexer for a large requested polygon, https://github.com/nci/gsky/blob/master/processor/tile_grpc.go#L46 will become a bottleneck as we only have connection to single worker node here. Large volume of intersected files can come from either a large requested polygon or aggregation from long period of time (e.g. cloud removal using the past 3 months of data)

In order to scale beyond single worker node, we need the load balancer within tile_grpc.go before line 46.

An extreme example of large polygon request is -179,-80,180,80 which covers the entire world.

http://gsky ip address/ows/geoglam?SERVICE=WCS&service=WCS&crs=EPSG:4326&format=GeoTIFF&request=GetCoverage&height=500&width=1000&version=1.0.0&bbox=-179,-80,180,80&coverage=global:c6:monthly_frac_cover&time=2018-03-01T00:00:00.000Z

The above query results in processing 271 fractional cover files with 3 bands each. The experiment has two worker nodes. Each worker node contains identical worker code and identical number of worker processes. With some quick-n-dirty experimentation, I have the following performance benchmark data:

5 runs with the original code (i.e. baseline in seconds):
7.281
7.235
7.179
7.201
7.212

5 runs with the proposed load balancer within tile_grpc.go before line 46:
We load balance two worker nodes in round-robin fashion
3.986
3.862
3.845
3.887
3.862

We can see about 50% linear speedup, which is to be expected. The code for the experimentation contains hard-coded worker addresses and hacks for a quick proof of concept. A proper code solution is required.

Polygon intersection fails in edge cases

This issue is uncovered by @juan-guerschman. In a typical WCS usage, users usually want to study a large region of interest. If the corresponding polygon is large enough, mas_intersects() fails to fully intersect all the files in question. This causes incorrect rendering on Gsky side. The reason mas_intersects() fails is because this function first computes a segmentation of the requested polygon before computing its convex hull in order for a good intersection. The segmentation can essentially be viewed as an interpolation method to smooth out the original polygon to facility convex hull computation. Thus the smoothness is a function of number of segments. By default it segments polygon with precision up to (xmax - xmin) / 2 (i.e. two segments). This works for small polygon but doesn't have enough precision for large polygon. Therefore we need number of segments configurable on a per layer basis.

For example, the following request gives incorrect rendering:
http://gsky ip/ows/?service=WCS&crs=EPSG:4326&format=GeoTIFF&request=GetCoverage&height=500&width=1000&version=1.0.0&bbox=-10,-50,180,50&coverage=global:c6:monthly_frac_cover&time=2018-03-01T00:00:00.000Z

The correct rendering should be:

size limit for WCS too low

Typical request for WCS from the geoglam instance is something like:

"I need a subset of the data for my study area in the native resolution (500 meters)"
The "study area" can be as large as a whole continent, or a whole country.

Current limit for WCS requests (~4MB) is too limiting and I doubt it will be of any practical use with the exception of very small subsets.

As an example, the whole of Australia will be an output of height=7451 & width=9580 (~70MB)

WMS GetLegendGraphic requests fail

The WMS GetCapabilities statement provides a LegendURL for all layers (at http://gsky.nci.org.au/ows), although GetLegendGraphic requests fail with a 500 status response for all layers except the two DEA Intertidal Extents layers. Error message is:

open : no such file or directory

ows trailing slash problem

If a user requests http:///ows, golang http server will automatically send a 302 redirect to the client side to redirect to http:///ows/. Note the trailing slash. This will cause incompatibility in URLs configured for current terriajs servers. This problem is also documented here: https://golang.org/pkg/net/http/ (type ServeMux section)

gRPC worker issues

There have been several issues/bugs in the gRPC workers since last major update.

The builtin processes do not outperform external processes over unix domain socket connections. This has been verified empirically. For WMS/WCS, a request of 2000x2000 image on average takes about 13 seconds using builtin processes while 12 seconds using external processes. For WPS, a request of 162 dataset files on average takes about 35 seconds using builtin processes while 14 seconds using external processes.
https://github.com/nci/gsky/blob/master/worker/gdalprocess/drill.go#L275 Surprisingly, gdal can give negative for offset x and y values. We will need to reset them to zero.
Apparently the load balancing for gRPC workers in tile_grpc.go and drill_grpc.go is done in a round robin fashion. Round robin can potentially have issues if each http request has very small number of gRPC calls. By round robin, all of the calls will go to the first few gRPC workers. To fix this, we need to use a random integer as the starting point (i.e. index of the first worker) then round robin starts from there.

Improving polygon drill grpc performance

Currently, drill_grpc.go uses default grpc buffer size which 4MB and the concurrency limit over worker node pool is fixed to 16. The combination of these two is a performance bottleneck as drilling over long time series of high dimensional data is a very compute intensive task.

WMS GetMap requests restricted to 256x256 pixels

For the four Low Tide and High Tide layers, and the twelve Landsat 5, 7 and 8 layers (at http://gsky.nci.org.au/ows), GetMap requests are only successful when WIDTH and HEIGHT values are set to 256. Other values result in highly distorted images.

Following is an example of a successful GetMap request for the "DEA Low Tide Composite 25m v2.0 true colour" layer, WIDTH and HEIGHT set to 256:

http://gsky.nci.org.au/ows?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetMap&LAYERS=hltc:low:tc&STYLES=&WIDTH=256&HEIGHT=256&FORMAT=image/png&CRS=EPSG:4326&DPI=120&MAP_RESOLUTION=120&FORMAT_OPTIONS=dpi:120&TRANSPARENT=TRUE&BBOX=-22.71167821964186828,150.01979382711141398,-22.26083150676329581,150.47064053998997224

Same request, with WIDTH and HEIGHT set to 500, image is corrupted:

http://gsky.nci.org.au/ows?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetMap&LAYERS=hltc:low:tc&STYLES=&WIDTH=500&HEIGHT=500&FORMAT=image/png&CRS=EPSG:4326&DPI=120&MAP_RESOLUTION=120&FORMAT_OPTIONS=dpi:120&TRANSPARENT=TRUE&BBOX=-22.71167821964186828,150.01979382711141398,-22.26083150676329581,150.47064053998997224

Note that the two Intertidal Extents Model layers do not have this problem.

WMS GetFeatureInfo requests fail

GetFeatureInfo requests fail for all layers (at http://gsky.nci.org.au/ows), returning a 502 status response with error message:

The proxy server received an invalid response from an upstream server

Following is an example GetFeatureInfo request that fails, generated from QGIS:

http://gsky.nci.org.au/ows?SERVICE=WMS&VERSION=1.3.0&REQUEST=GetFeatureInfo&BBOX=-38.82254232573848185,145.65095678205847207,-38.07574253210549386,146.43161651595693229&CRS=EPSG:4326&WIDTH=830&HEIGHT=794&LAYERS=item:stddev&STYLES=&FORMAT=image/png&QUERY_LAYERS=item:stddev&INFO_FORMAT=text/html&I=492&J=486&FEATURE_COUNT=10

GSKY documentation updates

GSKY documentation needs updates to reflect the current state of development. The updates include

README.md,
ows.go (comments on top of the file)
config_json.md
static/index.html

provide option for compressed geotiffs or NetCDFs in form WCS requests

Saves space and internet transfer, maybe even make default?

Towards streaming tile processing model

Currently Gsky processes an HTTP request (i.e. WMS and WCS) by storing all the tiles from gRPC calls first and then proceed to downstream of the pipeline such as tile_merger.go. This is a batch processing algorithm whose memory requirements will not scale beyond the physical server memory.

Observing that we can first group the tile requests by its corresponding region (i.e. polygon coordinates) and then divide the inputs into shards whose keys are regions. We then send the shards of tile requests to the gRPC workers.Once we obtain all the gPRC output results for a particular shard, this shard will be sent asynchronously to the merger algorithm and then become ready for GC.

This algorithm is a streaming tile processing model in contrast to the current batch processing model. This streaming processing model has the following advantages:

The streaming processing model that allows us to process the volume of data beyond the size of physical server memory. We also allow processing shards concurrently so that the theoretical performance of our streaming processing model is at least no worse batch processing model. In practice, we often observe better performance with streaming processing model for two reasons: a) the concurrency among polygon shards b) interleave merger computation with gRPC IO.
The concurrency of shards is controled by PolygonShardConcLimit. A typical range of value between 5 to 10 scales well for both small and large requests. By varying this shard concurrency value, we can trade off space and time.

I also set up the following experiment benchmark the extremeness of the streaming processing against the batch processing baseline.
There are three gRPC servers with identical hardware for 16G memory and 8 cpus. The Gsky ows server has 8G memory and 4 cpus. The baseline codebase is #91, which has the first iteration of performance improvements.

The WCS query has latlon of -179,-80,180,80 which covers the entire world. We vary height and width to simulate increasing demands of memory and cpu loads. One example query used in the experiment is as follows:
http://<gsky server>/ows/geoglam?SERVICE=WCS&service=WCS&crs=EPSG:4326&format=GeoTIFF&request=GetCoverage&height=6000&width=6000&version=1.0.0&bbox=-179,-80,180,80&coverage=global:c6:monthly_frac_cover&time=2018-03-01T00:00:00.000Z

Experimental results:

Baseline (batch tile processing model):
height x width   response time (seconds) 
500x1000         3.187s
2000x2000        6.181s
4000x4000        OOM
6000x6000        OOM
8000x8000        OOM
10000x10000      OOM

Streaming tile processing model:
height x width   response time (seconds) 
500x1000         2.814s
2000x2000        4.841s
4000x4000        15.773s
6000x6000        33.824s
8000x8000        57.941s
10000x10000      gRPC timed out

The experimental results clearly demonstrate the strength of the streaming tile processing model given an already strong baseline.

Invalid CRS defined in WMS GetCapabilities statement

The CRS code EPSG:WGS84(DD) specified by the GetCapabilities document in element /WMS_Capabilities/Capability/Layer/CRS is not supported by GetMap requests. The following error is returned in the response with 400 status:

…should contain a valid ISO 'crs/srs' parameter

This CRS code appears to have been adopted from GeoServer and is a throwback to an early version of WCS. The code does not conform to the OGC WMS 1.3.0 spec for the EPSG namespace, which states in section 6.7.3.3:

“An “EPSG” CRS label comprises the “EPSG” prefix, the colon, and a numeric code.”

raster scaling issue

This bug is uncovered by @juan-guerschman.
Apparently raster_scaler.go doesn't clip any negative values. This creates problems for converting into uint8 and therefore renders colour images incorrectly.

Improving Gsky crawler

There are two potential improvements in Gsky crawler:

Currnetly the crawler calls gdalinfo for each subdataset in the data file in sequence. This is an IO bottleneck for a data file with lots of subdatasets. We may use goroutine to concurrently call gdalinfo on these subdatasets.
Currently if there is an error in calling gdalinfo on a subdataset in a data file, the error message is suppressed. This is not good for troubleshooting the crawler results. We may output the error messages to stderr so that we can log the errors while not conflicting with stdout for crawling outputs.

Update ServiceProvider details

The ServiceProvider details in https://github.com/nci/gsky/blob/master/templates/WPS_GetCapabilities.tpl need to be updated.

Legend graphics should stay out of gsky code base

Apparently the legend graphics are stored under static/legend folders. But these legends are part of the published datasets and should not be considered part of GSKY code base. Therefore, it will be data vendor's responsibility to maintain these legends and probably GSKY admin's responsibility to maintain a location (i.e. a directory) on a server that stores these legends during server deployment process.

Forking small packages under nci

There are a few small utility packages used by gsky. These repos do not have large number of maintainers compared to big github projects. If the repos of these utility packages are lost, our build process will be interrupted. In a sense of code availability/security, therefore, we should fork them under nci. A good example is https://github.com/nci/go.procmeminfo

wps_exception.tpl not found

Currently WPS doesn't have wps_exception.tpl template file. Thus https://github.com/nci/gsky/blob/master/ows.go#L510 will certainly fail. In fact, we don't need the exception template but directly write any error messages to the client side.

GetCapabilities request VERSION parameter should be optional

GetCapabilities requests require the VERSION parameter to successfully return a GetCapabilities document. VERSION parameter is optional according to following excerpt from section 6.2.4 of the OGC WMS 1.3.0 specification:

In response to a GetCapabilities request (for which the VERSION parameter is optional)
that does not specify a version number, the server shall respond with the highest version it supports.