copcio / copcio.github.io Goto Github PK

View Code? Open in Web Editor NEW

108.0 19.0 14.0 4.53 MB

Geospatial, compressed, range-readable, LAZ-compatible point cloud format.

Home Page: https://copc.io

License: MIT License

Shell 100.00%

copcio.github.io's Introduction

Cloud Optimized Point Cloud Specification – 1.0

Version
Introduction
Notation
Implementation
Differences from EPT
Example Data
Reader Implementation Notes
Credits
Pronunciation
Discussion
Structural Changes to Draft Specification
COPC Software Implementations
Validation

Version

This document defines Cloud Optimized Point Cloud (COPC) version 1.0.

This document is available as a PDF at copc-specification-1.0.pdf.

Introduction

A COPC file is a LAZ 1.4 file that stores point data organized in a clustered octree. It contains a VLR that describe the octree organization of data that are stored in LAZ 1.4 chunks.

Data organization of COPC is modeled after the EPT data format, but COPC clusters the storage of the octree as variably-chunked LAZ data in a single file. This allows the data to be consumed sequentially by any reader than can handle variably-chunked LAZ 1.4 (LASzip, for example), or as a spatial subset for readers that interpret the COPC hierarchy. More information about the differences between EPT data and COPC can be found below.

Notation

Some of the file format is described using C-language fixed width integer types. Groups of entities are denoted with a C-language struct, though all data is packed on byte boundaries and encoded as little-endian values, which may not be the case for a C program that uses the same notation.

Implementation

Key aspects distinguish an organized COPC LAZ file from an LAZ 1.4 that is unorganized:

It MUST contain ONLY LAS PDRFs 6, 7, or 8 formatted data
It MUST contain a COPC info VLR
It MUST contain a COPC hierarchy VLR

LAS PDRFs 6, 7, or 8

COPC files MUST contain data with ONLY ASPRS LAS Point Data Record Format 6, 7, or 8. See the ASPRS LAS specification for details.

`info` VLR

User ID	Record ID
`copc`	`1`

The info VLR MUST exist.

The info VLR MUST be the first VLR in the file (must begin at offset 375 from the beginning of the file).

The info VLR is 160 bytes described by the following structure. reserved elements MUST be set to 0.

struct CopcInfo
{

  // Actual (unscaled) X coordinate of center of octree
  double center_x;

  // Actual (unscaled) Y coordinate of center of octree
  double center_y;

  // Actual (unscaled) Z coordinate of center of octree
  double center_z;

  // Perpendicular distance from the center to any side of the root node.
  double halfsize;

  // Space between points at the root node.
  // This value is halved at each octree level
  double spacing;

  // File offset to the first hierarchy page
  uint64_t root_hier_offset;

  // Size of the first hierarchy page in bytes
  uint64_t root_hier_size;

  // Minimum of GPSTime
  double gpstime_minimum;

  // Maximum of GPSTime
  double gpstime_maximum;

  // Must be 0
  uint64_t reserved[11];
};

`hierarchy` VLR

User ID	Record ID
`copc`	`1000`

The hierarchy VLR MUST exist.

Like EPT, COPC stores hierarchy information to allow a reader to locate points that are in a particular octree node. Also like EPT, the hierarchy MAY be arranged in a tree of pages, but SHALL always consist of at least ONE hierarchy page.

The VLR data consists of one or more hierarchy pages. Each hierarchy data page is written as follows:

The VoxelKey corresponds to the naming of EPT data files.

struct VoxelKey
{
  // A value < 0 indicates an invalid VoxelKey
  int32_t level;
  int32_t x;
  int32_t y;
  int32_t z;
}

An entry corresponds to a single key/value pair in an EPT hierarchy, but contains additional information to allow direct access and decoding of the corresponding point data.

struct Entry
{
  // EPT key of the data to which this entry corresponds
  VoxelKey key;

  // Absolute offset to the data chunk if the pointCount > 0.
  // Absolute offset to a child hierarchy page if the pointCount is -1.
  // 0 if the pointCount is 0.
  uint64_t offset;

  // Size of the data chunk in bytes (compressed size) if the pointCount > 0.
  // Size of the hierarchy page if the pointCount is -1.
  // 0 if the pointCount is 0.
  int32_t byteSize;

  // If > 0, represents the number of points in the data chunk.
  // If -1, indicates the information for this octree node is found in another hierarchy page.
  // If 0, no point data exists for this key, though may exist for child entries.
  int32_t pointCount;
}

The entries of a hierarchy page are consecutive. The number of entries in a page can be determined by taking the size of the page (contained in the parent page as Entry::byteSize or in the COPC info VLR as CopcData::root_hier_size) and dividing by the size of an Entry (32 bytes).

struct Page
{
    Entry entries[page_size / 32];
}

Differences from EPT

COPC has no ept.json. The information from ept.json is stored in the LAS file header and LAS VLRs.
COPC currently provides no support for ept-sources.json. File metadata support may be added in the future.
COPC only supports the LAZ point format and does not support binary point arrangements.
COPC chunks store only point data as LAZ. EPT, when stored as LAZ, uses complete LAZ files including the LAS header and perhaps VLRs.

Example Data

The venerable Autzen Stadium file commonly used in PDAL and other open source testing scenarios is available as a 80mb COPC file at https://github.com/PDAL/data/blob/master/autzen/autzen-classified.copc.laz

View it in your browser at https://viewer.copc.io/?copc=https://s3.amazonaws.com/hobu-lidar/autzen-classified.copc.laz
SoFi Stadium is available as a 2.3gb COPC file at https://hobu-lidar.s3.amazonaws.com/sofi.copc.laz.

View it in your browser at https://viewer.copc.io/?copc=https://s3.amazonaws.com/hobu-lidar/sofi.copc.laz

The data are courtesy of US Army Corps of Engineers Remote Sensing & GIS Center of Expertise / National Center for Airborne Laser Mapping
Millsite is available as a 1.9gb COPC file at https://s3.amazonaws.com/data.entwine.io/millsite.copc.laz .

View it in your browser at https://viewer.copc.io/?copc=https://s3.amazonaws.com/data.entwine.io/millsite.copc.laz

The data are from the USGS 3DEP Millsite Reservoir Collection

Reader Implementation Notes

COPC is designed so that a reader needs to know little about the structure of a LAZ file. By reading the first 589 bytes (375 for the header + 54 for the COPC VLR header + 160 for the COPC VLR), the software can verify that the file is a COPC file and determine the point data record format and point data record length, both of which are necessary to create a LAZ decompressor.

Readers should:

verify that the first four bytes of the file contain the ASCII characters "LASF".
verify that the 4 bytes starting at offset 377 contain the characters copc.
verify that the bytes at offsets 393 and 394 contain the values 1 and 0, respectively (this is the COPC version number, 1).
determine the point data record format by reading the byte at offset 104, masking off the two high bits, which are used by LAZ to indicate compression, and can be ignored.
determine the point data record length by reading two bytes at offset 105.

The octree hierarchy is arranged in pages. The COPC VLR provides information describing the location and size of root hierarchy page. The root hierarchy page can be used to traverse to child pages. Each entry in a hierarchy page either refers to a child hierarchy page, octree node data chunk, or an empty octree node. The size and file offset of each data chunk is provided in the hierarchy entries, allowing the chunks to be directly read for decoding.

Server Implementation Notes

For streaming of COPC files from HTTP servers to work, the server hosting COPC files needs to be configured to allow:

HTTP range requests
Cross-origin requests - if other hosts should be allowed to access the data (in web browsers)

Credits

COPC was designed in July–November 2021 by Andrew Bell, Howard Butler, and Connor Manning of Hobu, Inc.. Entwine and Entwine Point Tile were also designed and developed by Connor Manning of Hobu, Inc

Support

COPC development was supported by

| | | ------------------------------------ | ------------------------------------- | | | | |

Pronunciation

There is no official pronunciation of COPC. Here are some possibilities:

cah-pick – ka pIk
co-pick – kö pIk
cop-see – kap si
cop-pick – kap pIk
see oh pee see – si o pi si

\pagebreak

Discussion

Use Case

Cloud Optimized GeoTIFF has shown the utility and convenience of taking a dominant container format for geospatial raster data and optionally augmenting its organization to allow incremental "range-read" support over HTTP with it. With the mantra of "It's just a TIFF" allowing ubiquitous usage of the data content combined with the flexibility of supporting partial reads over the internet, COG has found a sweet spot. Its reward is the ongoing rapid conversion of significant raster data holdings to COG-organized content to enable convenient cloud consumption of the data throughout the GIS industry.

What is the COG for point clouds? It would need to be similar in fit and scope to COG:

Support incremental partial reads over HTTP
Provide good compression
Allow dimension-selective reads
Provide all metadata and supporting information
Support an EPT-style octree organization for data streaming

"Just a LAZ"

LAZ (LASZip) is the ubiquitous geospatial point cloud format. It is an augmentation of ASPRS LAS that utilizes an arithmetic encoder to efficiently compress the point content. It has seen a number of revisions, but the latest supports dimension-selective access and provides all of the metadata support that normal LAS provides. Importantly, multiple software implementations (laz-rs, laz-perf, and LASzip) provide LAZ compression and decompression, and laz-perf and laz-rs include compilation to JavaScript which is used by all JavaScript clients when consuming LAZ content.

Put EPT in LAZ

The EPT content organization supports LAZ in its current "exploded" organization. Exploded in this context means that each chunk of data at each octree level is stored as an individual LAZ file (or simple blob, or a zstd-compressed blob). One consequence of the exploded organization is large EPT trees of data can mean collections of millions of files. In non-cloud situations, EPT's cost when moving data or deleting it can be significant. Like the tilesets of late 2000s raster map tiles, lots of little files are a problem.

LAZ provides a feature that allows concatenation of the individual LAZ files into a single LAZ file. This is the concept of a dynamically-sized chunk table. It is a feature that Martin Isenburg envisioned for quad-tree organized data, but it could work the same for an octree.

Structural Changes to Draft Specification

Removed count from Page struct
Changed Record ID of COPC hierarchy EVLR from 1234 to 1000
Require reserved entries of the COPC VLR to have the value 0
Require the COPC VLR to be located immediately after the header at offset 375.
Increase the size of the COPC VLR data structure to 160 bytes.
Add laz_vlr_offset, laz_vlr_size, wkt_vlr_offset, wkt_vlr_size, eb_vlr_offset, eb_vlr_size to the COPC VLR, replacing 6 reserved entries.
PDRF must be 6, 7, or 8
Add extents VLR with UserID of copc and record ID of 10000.
VLR UserIDs switched from entwine to copc
Removed laz_vlr_offset, laz_vlr_size, wkt_vlr_offset, wkt_vlr_size, eb_vlr_offset, eb_vlr_size, root_hier_offset, root_hier_size from the COPC info VLR. Added 8 reserved entries.
Describe hierarchy entries for empty octree nodes.
Add back root_hier_offset and root_hier_size in COPC info VLR. Removed 2 reserved entries.
Remove extents VLR and put gpstime_minimum and gpstime_maximum in info VLR.

Validation

An online validator of COPC files is available at https://validate.copc.io. Simply drag-n-drop a (reasonably-sized) COPC file onto the page and it will validate the header information and optionally allow you to visualize it.
A C++ utility for verifying COPC metadata is available at https://github.com/hobuinc/copcverify

copcio.github.io's People

Contributors

Stargazers

Watchers

Forkers

ccinc timodwhit pka plimkilde rockrobotic oneoneeleven jreiberkyle raymondhe shnhrtkyk noahkimtai ysikeda dymaxion-ai wonder-sk alexbass05

copcio.github.io's Issues

entwine/1000

Instead of entwine/1234 for the VLR, it might be useful to use entwine/1000, which would allow us a convenient semver opportunity should we ever need to bump things.

Chunk Table position

It seems that the chunk table position in the illustration does not match how the LAZ file is actually organized.

In the illustration it's positionned between the VLRs and the points but in reality the chunk table is written after the points and between the VLRs and points there are 8 bytes that contains the offset to the start of the chunk table.

Dimension statistics

Nowadays the Entwine builder adds detailed dimension statistics to its schema (see here for example) including minimum, maximum, mean, stddev, and variance. Currently this is sort of an undocumented extension intended to be eventually be codified as an optional (in order to be backward-compatible) extension to the EPT specification. Would it be worth specifying a statistics VLR to capture this information? I think the "number of points by return" array in the LAS 1.4 header adds precedent for this kind of thing.

An example might be:

struct BucketItem
{
  double value; // Maybe int64_t here - presumably this is only for integral dimensions
  uint64_t count;
}
struct CopcStatistics
{
  double minimum;
  double maximum;
  double mean;
  double stddev;
  double variance;
  uint64 number_of_buckets; // 0 for most dimensions, but this one is nice for Classification counts
  BucketItem[] buckets;
}

These statistics would then be stored in the order that the dimensions appear in the point data record format header, followed by statistics for extra-bytes dimensions in the order that they appear. Is there enough demand for this type of information to put it in the spec? Enough to require it?

One example of their usage I found is wonder-sk/point-cloud-experiments#60: "QGIS implementation is able to read those stats and use them to set up renderers (min/max values are especially useful to correctly set ranges for rendering)".

If anyone has other concrete use-cases where such data would be used, that would be useful.

Update Structural Changes to Draft Specification

I think this is outdated:

Add laz_vlr_offset, laz_vlr_size, wkt_vlr_offset, wkt_vlr_size, eb_vlr_offset, eb_vlr_size to the COPC VLR, replacing 6 reserved entries.

Require VLR order

It would be nice if the "important" VLRs were all at the beginning of the file to reduce round-trips when first reading the data over a network connection. Suggest LASZIP VLR/COPC VLR/EB VLR.

Clarify VLR/EVLR representation

It seems to me that the COPC spec currently talks about VLRs but actually conflates VLRs and EVLRs, which may be problematic since the LAS spec clearly distinguishes between those.

Obviously, the info record cannot be an EVLR due to its location requirement, but it might be good to know whether hierarchy and extents must be represented as VLRs, EVLRs, or can be either.

(For anyone wondering, the Autzen sample currently has info and extents as "classic" VLRs, while its hierarchy is an EVLR.)

Remaining TODOs

We (rockrobotic) would love to help on getting the spec and work finalized. Would it be helpful to use the issue queue and assignees or just a task list here?

Require reserved/padding bytes in COPC VLR to be 0

span

There's a chance here to correct the choice of EPT by not carrying over span as the concept of resolution/density/spacing.

Entwine/EPT chose this term, which represents the grid width of the voxelization of the data nodes, because of its strict voxelized resolution guarantees at each node depth. However this isn't the only sampling method possible, and Poisson, randomized, FIFO, and other non-gridded distributions may be possible. Also, EPT had the specified limitation that it must be voxelized via a power of 2, which doesn't need to be true for COPC.

In general, span is a very confusing metric that needs a lot of explanation, even though what it represents is actually simple. See for comparison the definition of the Potree equivalent to this field by @m-schuetz:

spacing: (Number) Space between points at the root node. This value is halved at each octree level.

While this definition is directly convertible to a span, its meaning is much more concrete and easy to understand.

I would propose replacing the integral span field in the COPC header with a double-precision field representing the spacing or distance between points at the root node, to more clearly represent the density of the data at each depth without the indirection of span.

Possible terms include spacing, distance, resolution, and the description should indicate that this is not necessarily a 100% deterministic value: in a given node some points may end up being closer together or farther apart than this value depending on the sampling method.

Add a note about zero-point hierarchy nodes

It is possible to have a node level with zero points, but there are points below it. 0-sized chunks are not valid in LAZ. A note describing how to advertise this to hierarchy consumers is required.

Add WKT offset to COPC VLR

It would be nice if the WKT VLR could be located easily without searching. Adding an offset in the COPC VLR to the actual WKT data would support this.

How to validate a COPC file?

Hi! I'm working on converting las files to copc.laz format and have successfully generated the output files. However, I am looking for ways to validate the generated COPC files.

I came across this link https://github.com/copcio/copcio.github.io?tab=readme-ov-file#validation..But the https://validate.copc.io/ option doesn't show any output and I am not sure how to use the cpp version. So, is there any alternate ways to validate the COPC files? (maybe using Python API or pdal CLI similar to pdal info).

Thanks.

Lock Version 1.0.0

Are we at a point where we can lock the spec for v1.0.0?

All issues are resolved here (sans new example file). If we cut the 1.0.0 version, then other libs would be able to fully implement the spec and ensure future changes.

Sum error in implementation notes

In the implementation note you wrote

COPC is designed so that a reader needs to know little about the structure of a LAZ file. By reading the first 549 bytes (375 for the header + 54 for the COPC VLR header + 160 for the COPC VLR), the software can verify that the file is a COPC file and determine the point data record format and point data record length, both of which are necessary to create a LAZ decompressor.

375+54+160 = 589 not 549

Mime/Content Type?

Are there any thoughts/conventions for MIME/Content-Type values for COPC files?

LAS is registered as application/vnd.las
LAZ is registered as application/vnd.laszip

Maybe application/vnd.laszip; profile=copc ?

Related issues for COGTIFF: cogeotiff/cog-spec#13 opengeospatial/CloudOptimizedGeoTIFF#1

One week freeze to declare 1.0.0 release

This is the one-week call to bring up any significant issues with COPC before we declare it released. Please post any items on this ticket, and we can create new tickets should the need arise. With no significant findings or items, we will declare the release final on 29 NOV 2021.

Dimension type VLR?

@m-schuetz's post #19 (comment) about desiring dimension type information brings up the question of whether it might be more generally useful.

My thought on the topic is COPC files are LAS/LAZ, and the dimension types and mappings of LAS files are generally known. Extra byte dimensions are rarer, and the type mapping of those dimensions is indeed missing.

Can any reader client developers make a stronger case for this? @connormanning @CCInc ?

Redundant fields in COPC header

The COPC offsets below are redundant info which is already contained in the LAS VLRs directly:

  uint64_t laz_vlr_offset;      // File offset of the *data* of the LAZ VLR
  uint64_t laz_vlr_size;        // Size of the *data* of the LAZ VLR.
  uint64_t wkt_vlr_offset;      // File offset of the *data* of the WKT VLR if it exists, 0 otherwise
  uint64_t wkt_vlr_size;        // Size of the *data* of the WKT VLR if it exists, 0 otherwise
  uint64_t eb_vlr_offset;       // File offset of the *data* of the extra bytes VLR if it exists, 0 otherwise
  uint64_t eb_vlr_size;         // Size of the *data* of the extra bytes VLR if it exists, 0 otherwise

What is the use-case for this data being duplicated? It seems error-prone to have the same data in multiple spots, and also seems to go against the push behind COPC - to live as a spatially organized structure inside a format that we don't have to invent. Here we are reinventing (a subset of) VLR header information in our own header format.

As for implementing the spec, if you are using a COPC or LAS library like the ones I linked above, you have the VLR information packaged for you for free. And if not, you are going to be manually parsing out the LAS header, the bodies of the VLRs (even if not their headers due to this duplication), walking and parsing the hierarchy, and others. The few lines of VLR header parsing are comparatively trivial.

Both current implementations copc-lib and copc.js awaken the VLR headers on initialization and then ignore these COPC header values in favor of the VLR headers.

And as for reading the data, the magnitudes here are on the order of a few hundred bytes and would generally only be parsed once per reader instance, so I'm skeptical of a big performance impact.

So is there a concrete use-case for this data to exist? I would consider removing it to keep the format simple and less error-prone in the absence of a concrete need.

Standardize fIle name structure

What do we think about standardizing the file name[filename].copc.laz to represent this file at a high level/easily?

We know there will be a VLR that defines the type as a COPC file, but that doesn't help some processors quickly know how to act (https://github.com/potree/potree/blob/develop/src/Potree.js#L139-L211)

It would be possible to build in the handling that does the check but as I understand it, it would require us to read the header for the VLR and then parse that (not ideal) to do the switch.

questions on spec

in CopcData, why int64_t span and not uint64_t ?
in Entry, why int32_t byteSize and not uint32_t ? And given that it might be size of the child hierarchy page, and that Page::count is 64 bit, shouldn't that be a 64 bit integer ?
in Entry, why int32_t pointCount, and not uint32_t ?
in Entry, int32_t pointCount: why not have it a uint32_t and use 0 as a special value instead of -1 ?

COPC example file point format doesn't fit the specs

The current test file is in point format 3 which doesn't fit the specs.
See PDAL/data#3 for a proposed fix.

Example Dataset

For EPT, it is pretty easy to understand the structure and the parsing, but this will be a little more verbose, should we add a sample dataset or a link to a sample dataset?

Remove 'extents' VLR as a required COPC entity

extents by itself isn't particularly useful in practice. Knowing the min/max only helps you for a few dimensions, and if you're a renderer, you want much more than just that.

I think we should drop extents and put forward a definition for a copc stats VLR that is more fully featured but OPTIONAL. Importantly, the stats VLR could be something that could be added as an EVLR long after the file is created instead of during the COPC file creation process.

Principles of the stats VLR are pretty much as @connormanning suggested in #19, but we should do something specified as JSON, with a corresponding JSON Schema that allows implementations freedom to augment with auxiliary information if they desire.

Clarify that a COPC file is read only

If a copc file is modified i.e. read and streamed into an output file with some modifications to retain points of interest for example, or simply decompressed, all the copc VLR and EVLR should be removed otherwise the file becomes an invalid and corrupted copc file.

It may seems obvious but the existing tools are not copc aware and are supposed to preserves the VLR and EVLR when writing a file. In the current state many software are likely to produce corrupted copc files when processing a copc file. I'm thinking, for example, about las2las or laszip from lastools that preserve all the VLR and EVLR and are not copc aware yet. Many more existing tools should be impacted the same way. Consequently, in the current state, one can easily generate invalid copc files with an irrelevant EPT hierarchy.

I'm suggesting to explicitly mention somewhere in the specs that a copc file is read only and any writer that is not copc aware should always get rid of the copc VRLS and EVLR when the input is a copc file. That won't solve the problem in existing market but will at least show explicitly the information.