Giter VIP home page Giter VIP logo

copcio.github.io's Introduction

Cloud Optimized Point Cloud Specification – 1.0

©2021, Hobu, Inc. All rights reserved.

COPC Logo

Table of contents

  1. Version
  2. Introduction
  3. Notation
  4. Implementation
    1. LAS PDRF 6, 7, or 8
    2. info VLR
    3. hierarchy VLR
  5. Differences from EPT
  6. Example Data
  7. Reader Implementation Notes
  8. Credits
  9. Pronunciation
  10. Discussion
  11. Structural Changes to Draft Specification
  12. COPC Software Implementations
  13. Validation

Version

This document defines Cloud Optimized Point Cloud (COPC) version 1.0.

This document is available as a PDF at copc-specification-1.0.pdf.

Introduction

A COPC file is a LAZ 1.4 file that stores point data organized in a clustered octree. It contains a VLR that describe the octree organization of data that are stored in LAZ 1.4 chunks.

info VLR and the LAZ chunk table allow COPC readers to select and seek through the file.

Data organization of COPC is modeled after the EPT data format, but COPC clusters the storage of the octree as variably-chunked LAZ data in a single file. This allows the data to be consumed sequentially by any reader than can handle variably-chunked LAZ 1.4 (LASzip, for example), or as a spatial subset for readers that interpret the COPC hierarchy. More information about the differences between EPT data and COPC can be found below.

Notation

Some of the file format is described using C-language fixed width integer types. Groups of entities are denoted with a C-language struct, though all data is packed on byte boundaries and encoded as little-endian values, which may not be the case for a C program that uses the same notation.

Implementation

Key aspects distinguish an organized COPC LAZ file from an LAZ 1.4 that is unorganized:

  • It MUST contain ONLY LAS PDRFs 6, 7, or 8 formatted data
  • It MUST contain a COPC info VLR
  • It MUST contain a COPC hierarchy VLR

LAS PDRFs 6, 7, or 8

COPC files MUST contain data with ONLY ASPRS LAS Point Data Record Format 6, 7, or 8. See the ASPRS LAS specification for details.

info VLR

User ID Record ID
copc 1

The info VLR MUST exist.

The info VLR MUST be the first VLR in the file (must begin at offset 375 from the beginning of the file).

The info VLR is 160 bytes described by the following structure. reserved elements MUST be set to 0.

struct CopcInfo
{

  // Actual (unscaled) X coordinate of center of octree
  double center_x;

  // Actual (unscaled) Y coordinate of center of octree
  double center_y;

  // Actual (unscaled) Z coordinate of center of octree
  double center_z;

  // Perpendicular distance from the center to any side of the root node.
  double halfsize;

  // Space between points at the root node.
  // This value is halved at each octree level
  double spacing;

  // File offset to the first hierarchy page
  uint64_t root_hier_offset;

  // Size of the first hierarchy page in bytes
  uint64_t root_hier_size;

  // Minimum of GPSTime
  double gpstime_minimum;

  // Maximum of GPSTime
  double gpstime_maximum;

  // Must be 0
  uint64_t reserved[11];
};

hierarchy VLR

User ID Record ID
copc 1000

The hierarchy VLR MUST exist.

Like EPT, COPC stores hierarchy information to allow a reader to locate points that are in a particular octree node. Also like EPT, the hierarchy MAY be arranged in a tree of pages, but SHALL always consist of at least ONE hierarchy page.

The VLR data consists of one or more hierarchy pages. Each hierarchy data page is written as follows:

The VoxelKey corresponds to the naming of EPT data files.

struct VoxelKey
{
  // A value < 0 indicates an invalid VoxelKey
  int32_t level;
  int32_t x;
  int32_t y;
  int32_t z;
}

An entry corresponds to a single key/value pair in an EPT hierarchy, but contains additional information to allow direct access and decoding of the corresponding point data.

struct Entry
{
  // EPT key of the data to which this entry corresponds
  VoxelKey key;

  // Absolute offset to the data chunk if the pointCount > 0.
  // Absolute offset to a child hierarchy page if the pointCount is -1.
  // 0 if the pointCount is 0.
  uint64_t offset;

  // Size of the data chunk in bytes (compressed size) if the pointCount > 0.
  // Size of the hierarchy page if the pointCount is -1.
  // 0 if the pointCount is 0.
  int32_t byteSize;

  // If > 0, represents the number of points in the data chunk.
  // If -1, indicates the information for this octree node is found in another hierarchy page.
  // If 0, no point data exists for this key, though may exist for child entries.
  int32_t pointCount;
}

The entries of a hierarchy page are consecutive. The number of entries in a page can be determined by taking the size of the page (contained in the parent page as Entry::byteSize or in the COPC info VLR as CopcData::root_hier_size) and dividing by the size of an Entry (32 bytes).

struct Page
{
    Entry entries[page_size / 32];
}

Differences from EPT

  • COPC has no ept.json. The information from ept.json is stored in the LAS file header and LAS VLRs.
  • COPC currently provides no support for ept-sources.json. File metadata support may be added in the future.
  • COPC only supports the LAZ point format and does not support binary point arrangements.
  • COPC chunks store only point data as LAZ. EPT, when stored as LAZ, uses complete LAZ files including the LAS header and perhaps VLRs.

Example Data

Reader Implementation Notes

COPC is designed so that a reader needs to know little about the structure of a LAZ file. By reading the first 589 bytes (375 for the header + 54 for the COPC VLR header + 160 for the COPC VLR), the software can verify that the file is a COPC file and determine the point data record format and point data record length, both of which are necessary to create a LAZ decompressor.

Readers should:

  • verify that the first four bytes of the file contain the ASCII characters "LASF".
  • verify that the 4 bytes starting at offset 377 contain the characters copc.
  • verify that the bytes at offsets 393 and 394 contain the values 1 and 0, respectively (this is the COPC version number, 1).
  • determine the point data record format by reading the byte at offset 104, masking off the two high bits, which are used by LAZ to indicate compression, and can be ignored.
  • determine the point data record length by reading two bytes at offset 105.

The octree hierarchy is arranged in pages. The COPC VLR provides information describing the location and size of root hierarchy page. The root hierarchy page can be used to traverse to child pages. Each entry in a hierarchy page either refers to a child hierarchy page, octree node data chunk, or an empty octree node. The size and file offset of each data chunk is provided in the hierarchy entries, allowing the chunks to be directly read for decoding.

Server Implementation Notes

For streaming of COPC files from HTTP servers to work, the server hosting COPC files needs to be configured to allow:

Credits

COPC was designed in July–November 2021 by Andrew Bell, Howard Butler, and Connor Manning of Hobu, Inc.. Entwine and Entwine Point Tile were also designed and developed by Connor Manning of Hobu, Inc

Hobu, Inc.

Support

COPC development was supported by

| | | ------------------------------------ | ------------------------------------- | | USACE ERDC CRREL RS/GIS | | Microsoft Planetary Computer |

Pronunciation

There is no official pronunciation of COPC. Here are some possibilities:

  • cah-pick – ka pIk
  • co-pick – kö pIk
  • cop-see – kap si
  • cop-pick – kap pIk
  • see oh pee see – si o pi si

\pagebreak

Discussion

Use Case

Cloud Optimized GeoTIFF has shown the utility and convenience of taking a dominant container format for geospatial raster data and optionally augmenting its organization to allow incremental "range-read" support over HTTP with it. With the mantra of "It's just a TIFF" allowing ubiquitous usage of the data content combined with the flexibility of supporting partial reads over the internet, COG has found a sweet spot. Its reward is the ongoing rapid conversion of significant raster data holdings to COG-organized content to enable convenient cloud consumption of the data throughout the GIS industry.

What is the COG for point clouds? It would need to be similar in fit and scope to COG:

  • Support incremental partial reads over HTTP
  • Provide good compression
  • Allow dimension-selective reads
  • Provide all metadata and supporting information
  • Support an EPT-style octree organization for data streaming

"Just a LAZ"

LAZ (LASZip) is the ubiquitous geospatial point cloud format. It is an augmentation of ASPRS LAS that utilizes an arithmetic encoder to efficiently compress the point content. It has seen a number of revisions, but the latest supports dimension-selective access and provides all of the metadata support that normal LAS provides. Importantly, multiple software implementations (laz-rs, laz-perf, and LASzip) provide LAZ compression and decompression, and laz-perf and laz-rs include compilation to JavaScript which is used by all JavaScript clients when consuming LAZ content.

Put EPT in LAZ

The EPT content organization supports LAZ in its current "exploded" organization. Exploded in this context means that each chunk of data at each octree level is stored as an individual LAZ file (or simple blob, or a zstd-compressed blob). One consequence of the exploded organization is large EPT trees of data can mean collections of millions of files. In non-cloud situations, EPT's cost when moving data or deleting it can be significant. Like the tilesets of late 2000s raster map tiles, lots of little files are a problem.

LAZ provides a feature that allows concatenation of the individual LAZ files into a single LAZ file. This is the concept of a dynamically-sized chunk table. It is a feature that Martin Isenburg envisioned for quad-tree organized data, but it could work the same for an octree.

Structural Changes to Draft Specification

  • Removed count from Page struct
  • Changed Record ID of COPC hierarchy EVLR from 1234 to 1000
  • Require reserved entries of the COPC VLR to have the value 0
  • Require the COPC VLR to be located immediately after the header at offset 375.
  • Increase the size of the COPC VLR data structure to 160 bytes.
  • Add laz_vlr_offset, laz_vlr_size, wkt_vlr_offset, wkt_vlr_size, eb_vlr_offset, eb_vlr_size to the COPC VLR, replacing 6 reserved entries.
  • PDRF must be 6, 7, or 8
  • Add extents VLR with UserID of copc and record ID of 10000.
  • VLR UserIDs switched from entwine to copc
  • Removed laz_vlr_offset, laz_vlr_size, wkt_vlr_offset, wkt_vlr_size, eb_vlr_offset, eb_vlr_size, root_hier_offset, root_hier_size from the COPC info VLR. Added 8 reserved entries.
  • Describe hierarchy entries for empty octree nodes.
  • Add back root_hier_offset and root_hier_size in COPC info VLR. Removed 2 reserved entries.
  • Remove extents VLR and put gpstime_minimum and gpstime_maximum in info VLR.

Validation

  • An online validator of COPC files is available at https://validate.copc.io. Simply drag-n-drop a (reasonably-sized) COPC file onto the page and it will validate the header information and optionally allow you to visualize it.

  • A C++ utility for verifying COPC metadata is available at https://github.com/hobuinc/copcverify

copcio.github.io's People

Contributors

hobu avatar abellgithub avatar ccinc avatar plimkilde avatar pka avatar alexbass05 avatar jean-romain avatar wonder-sk avatar m-mohr avatar timodwhit avatar

Stargazers

Steve Vandervalk avatar  avatar Pim Klaassen avatar David Hersh avatar Clemens Ludwig avatar Fabrizio Minuti avatar Chris avatar J.A. avatar Miguel Moncada Isla avatar Tobz avatar Mike Mitchell avatar N A Farinha avatar  avatar Erdong avatar Tao Jianhang avatar にー兄さん / Kaito Tsutsumi avatar Mark Keller avatar Romain Janvier avatar JABRANE MOUAD avatar Gustavo Guevara avatar Felipe avatar Felipe Carvalho avatar Chris Morabito avatar EdwardGao avatar nakamura-ta avatar Adam Aposhian avatar  avatar Taylor Denouden avatar  avatar Yogesh  avatar zhangqiang avatar Jin Lee avatar Kellan Cartledge avatar Pol Tarro avatar Kevin Beam avatar  avatar Xiaochuan Ye avatar Jason Wohlgemuth avatar  avatar Tim Finer avatar Jacky Jiang avatar  avatar Youssef Harby avatar Lucas Longour avatar George Boot avatar Dennis Smith avatar  avatar Bryan Fuentes avatar Tommaso Trotto avatar Alex Rigler avatar Bill avatar  avatar Kiri avatar Hidenori avatar Josh Mize avatar Moshe Jonathan Gordon Radian avatar Romain d'Esparbès avatar Matt Csencsits avatar Hirofum Hayashi avatar Philip Worrall avatar teddy avatar Franky1 avatar Parth Singhal avatar Robert Djurasaj avatar Qiusheng Wu avatar Jordan Rising avatar Thibault Durand avatar Amaury Zarzelli avatar David Shean avatar David Breeding avatar  avatar Dusty Argyle avatar Alessandro Petrozzelli avatar Fernando Serrano avatar Julien Laurenceau avatar Matt McCormick avatar Daniel Ortega Carcamo avatar Damián Silvani avatar Ko Nagase avatar Francesco Colella avatar Justin Lewis avatar Cole Howard avatar Vinayak Kulkarni avatar Vincent Sarago avatar Chris Hills avatar Kristen McIntyre avatar Pengfei Xuan avatar Guilhem Villemin avatar 陈昱行|Chen Yu-hang avatar  avatar Markus Schütz avatar Kevin N. Murphy avatar  avatar Mike Pianka avatar Robert Coup avatar Jim Young avatar Daniel Silk avatar dinomirMT avatar steve avatar  avatar

Watchers

 avatar  avatar  avatar Fernando Serrano avatar  avatar Tom van Tilburg avatar  avatar Connor Manning avatar  avatar Christopher Crosby avatar Maarten Pronk avatar tmontaigu avatar David Caron avatar Jonathan Chemla avatar Kristen McIntyre avatar  avatar Anton Wagensonner avatar David Hersh avatar  avatar

copcio.github.io's Issues

entwine/1000

Instead of entwine/1234 for the VLR, it might be useful to use entwine/1000, which would allow us a convenient semver opportunity should we ever need to bump things.

Chunk Table position

It seems that the chunk table position in the illustration does not match how the LAZ file is actually organized.

In the illustration it's positionned between the VLRs and the points but in reality the chunk table is written after the points and between the VLRs and points there are 8 bytes that contains the offset to the start of the chunk table.

Dimension statistics

Nowadays the Entwine builder adds detailed dimension statistics to its schema (see here for example) including minimum, maximum, mean, stddev, and variance. Currently this is sort of an undocumented extension intended to be eventually be codified as an optional (in order to be backward-compatible) extension to the EPT specification. Would it be worth specifying a statistics VLR to capture this information? I think the "number of points by return" array in the LAS 1.4 header adds precedent for this kind of thing.

An example might be:

struct BucketItem
{
  double value; // Maybe int64_t here - presumably this is only for integral dimensions
  uint64_t count;
}
struct CopcStatistics
{
  double minimum;
  double maximum;
  double mean;
  double stddev;
  double variance;
  uint64 number_of_buckets; // 0 for most dimensions, but this one is nice for Classification counts
  BucketItem[] buckets;
}

These statistics would then be stored in the order that the dimensions appear in the point data record format header, followed by statistics for extra-bytes dimensions in the order that they appear. Is there enough demand for this type of information to put it in the spec? Enough to require it?

One example of their usage I found is wonder-sk/point-cloud-experiments#60: "QGIS implementation is able to read those stats and use them to set up renderers (min/max values are especially useful to correctly set ranges for rendering)".

If anyone has other concrete use-cases where such data would be used, that would be useful.

Require VLR order

It would be nice if the "important" VLRs were all at the beginning of the file to reduce round-trips when first reading the data over a network connection. Suggest LASZIP VLR/COPC VLR/EB VLR.

Clarify VLR/EVLR representation

It seems to me that the COPC spec currently talks about VLRs but actually conflates VLRs and EVLRs, which may be problematic since the LAS spec clearly distinguishes between those.

Obviously, the info record cannot be an EVLR due to its location requirement, but it might be good to know whether hierarchy and extents must be represented as VLRs, EVLRs, or can be either.

(For anyone wondering, the Autzen sample currently has info and extents as "classic" VLRs, while its hierarchy is an EVLR.)

Remaining TODOs

We (rockrobotic) would love to help on getting the spec and work finalized. Would it be helpful to use the issue queue and assignees or just a task list here?

span

There's a chance here to correct the choice of EPT by not carrying over span as the concept of resolution/density/spacing.

Entwine/EPT chose this term, which represents the grid width of the voxelization of the data nodes, because of its strict voxelized resolution guarantees at each node depth. However this isn't the only sampling method possible, and Poisson, randomized, FIFO, and other non-gridded distributions may be possible. Also, EPT had the specified limitation that it must be voxelized via a power of 2, which doesn't need to be true for COPC.

In general, span is a very confusing metric that needs a lot of explanation, even though what it represents is actually simple. See for comparison the definition of the Potree equivalent to this field by @m-schuetz:

spacing: (Number) Space between points at the root node. This value is halved at each octree level.

While this definition is directly convertible to a span, its meaning is much more concrete and easy to understand.

I would propose replacing the integral span field in the COPC header with a double-precision field representing the spacing or distance between points at the root node, to more clearly represent the density of the data at each depth without the indirection of span.

Possible terms include spacing, distance, resolution, and the description should indicate that this is not necessarily a 100% deterministic value: in a given node some points may end up being closer together or farther apart than this value depending on the sampling method.

Add a note about zero-point hierarchy nodes

It is possible to have a node level with zero points, but there are points below it. 0-sized chunks are not valid in LAZ. A note describing how to advertise this to hierarchy consumers is required.

Add WKT offset to COPC VLR

It would be nice if the WKT VLR could be located easily without searching. Adding an offset in the COPC VLR to the actual WKT data would support this.

How to validate a COPC file?

Hi! I'm working on converting las files to copc.laz format and have successfully generated the output files. However, I am looking for ways to validate the generated COPC files.

I came across this link https://github.com/copcio/copcio.github.io?tab=readme-ov-file#validation..But the https://validate.copc.io/ option doesn't show any output and I am not sure how to use the cpp version. So, is there any alternate ways to validate the COPC files? (maybe using Python API or pdal CLI similar to pdal info).

Thanks.

Lock Version 1.0.0

Are we at a point where we can lock the spec for v1.0.0?

All issues are resolved here (sans new example file). If we cut the 1.0.0 version, then other libs would be able to fully implement the spec and ensure future changes.

Sum error in implementation notes

In the implementation note you wrote

COPC is designed so that a reader needs to know little about the structure of a LAZ file. By reading the first 549 bytes (375 for the header + 54 for the COPC VLR header + 160 for the COPC VLR), the software can verify that the file is a COPC file and determine the point data record format and point data record length, both of which are necessary to create a LAZ decompressor.

375+54+160 = 589 not 549

One week freeze to declare 1.0.0 release

This is the one-week call to bring up any significant issues with COPC before we declare it released. Please post any items on this ticket, and we can create new tickets should the need arise. With no significant findings or items, we will declare the release final on 29 NOV 2021.

Dimension type VLR?

@m-schuetz's post #19 (comment) about desiring dimension type information brings up the question of whether it might be more generally useful.

My thought on the topic is COPC files are LAS/LAZ, and the dimension types and mappings of LAS files are generally known. Extra byte dimensions are rarer, and the type mapping of those dimensions is indeed missing.

Can any reader client developers make a stronger case for this? @connormanning @CCInc ?

Redundant fields in COPC header

The COPC offsets below are redundant info which is already contained in the LAS VLRs directly:

  uint64_t laz_vlr_offset;      // File offset of the *data* of the LAZ VLR
  uint64_t laz_vlr_size;        // Size of the *data* of the LAZ VLR.
  uint64_t wkt_vlr_offset;      // File offset of the *data* of the WKT VLR if it exists, 0 otherwise
  uint64_t wkt_vlr_size;        // Size of the *data* of the WKT VLR if it exists, 0 otherwise
  uint64_t eb_vlr_offset;       // File offset of the *data* of the extra bytes VLR if it exists, 0 otherwise
  uint64_t eb_vlr_size;         // Size of the *data* of the extra bytes VLR if it exists, 0 otherwise

What is the use-case for this data being duplicated? It seems error-prone to have the same data in multiple spots, and also seems to go against the push behind COPC - to live as a spatially organized structure inside a format that we don't have to invent. Here we are reinventing (a subset of) VLR header information in our own header format.

As for implementing the spec, if you are using a COPC or LAS library like the ones I linked above, you have the VLR information packaged for you for free. And if not, you are going to be manually parsing out the LAS header, the bodies of the VLRs (even if not their headers due to this duplication), walking and parsing the hierarchy, and others. The few lines of VLR header parsing are comparatively trivial.

Both current implementations copc-lib and copc.js awaken the VLR headers on initialization and then ignore these COPC header values in favor of the VLR headers.

And as for reading the data, the magnitudes here are on the order of a few hundred bytes and would generally only be parsed once per reader instance, so I'm skeptical of a big performance impact.

So is there a concrete use-case for this data to exist? I would consider removing it to keep the format simple and less error-prone in the absence of a concrete need.

Standardize fIle name structure

What do we think about standardizing the file name[filename].copc.laz to represent this file at a high level/easily?

We know there will be a VLR that defines the type as a COPC file, but that doesn't help some processors quickly know how to act (https://github.com/potree/potree/blob/develop/src/Potree.js#L139-L211)

It would be possible to build in the handling that does the check but as I understand it, it would require us to read the header for the VLR and then parse that (not ideal) to do the switch.

questions on spec

  • in CopcData, why int64_t span and not uint64_t ?
  • in Entry, why int32_t byteSize and not uint32_t ? And given that it might be size of the child hierarchy page, and that Page::count is 64 bit, shouldn't that be a 64 bit integer ?
  • in Entry, why int32_t pointCount, and not uint32_t ?
  • in Entry, int32_t pointCount: why not have it a uint32_t and use 0 as a special value instead of -1 ?

Example Dataset

For EPT, it is pretty easy to understand the structure and the parsing, but this will be a little more verbose, should we add a sample dataset or a link to a sample dataset?

Remove 'extents' VLR as a required COPC entity

extents by itself isn't particularly useful in practice. Knowing the min/max only helps you for a few dimensions, and if you're a renderer, you want much more than just that.

I think we should drop extents and put forward a definition for a copc stats VLR that is more fully featured but OPTIONAL. Importantly, the stats VLR could be something that could be added as an EVLR long after the file is created instead of during the COPC file creation process.

Principles of the stats VLR are pretty much as @connormanning suggested in #19, but we should do something specified as JSON, with a corresponding JSON Schema that allows implementations freedom to augment with auxiliary information if they desire.

Clarify that a COPC file is read only

If a copc file is modified i.e. read and streamed into an output file with some modifications to retain points of interest for example, or simply decompressed, all the copc VLR and EVLR should be removed otherwise the file becomes an invalid and corrupted copc file.

It may seems obvious but the existing tools are not copc aware and are supposed to preserves the VLR and EVLR when writing a file. In the current state many software are likely to produce corrupted copc files when processing a copc file. I'm thinking, for example, about las2las or laszip from lastools that preserve all the VLR and EVLR and are not copc aware yet. Many more existing tools should be impacted the same way. Consequently, in the current state, one can easily generate invalid copc files with an irrelevant EPT hierarchy.

I'm suggesting to explicitly mention somewhere in the specs that a copc file is read only and any writer that is not copc aware should always get rid of the copc VRLS and EVLR when the input is a copc file. That won't solve the problem in existing market but will at least show explicitly the information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.