nfrechette / acl Goto Github PK

View Code? Open in Web Editor NEW

1.3K 42.0 99.0 12.83 MB

Animation Compression Library

License: MIT License

C++ 93.28% C 0.26% CMake 1.62% Python 4.54% Java 0.10% Batchfile 0.06% Shell 0.15%

compression animation-compression game-engine animation-3d cpp c-plus-plus game-development

acl's People

Contributors

Stargazers

Watchers

Forkers

codydwjones tirpidz linecode fuxicv dfbrown amitahire janisozaur vitaliytalyh wuyakuma xdzj lynnziqi swordlegend midnite8177 remedy-entertainment meradrin jurgen-kluft timlehr ezhangle whztt07 valtovar alecamara neill3d scipiox64 jonnyontheroad yy1314 vjeffh speakfool duplexsystem templeblock guardianofetherra leegoonz chenyangchenyang zhuzenghui666 xiaoyaoliu unitydevtool msoft1115 ternence-li ddeadguyy yes-jumby ue4devtool infosia xienan01 hengle qix- qiupro yxsamurai niaoge2 jbrd-pg tstaples gonnavis lostink rabberdakk harumiyuki asciima mayhemheroes cf-19 zhaoguohao hewenning seinocat aimoonchen justin-sky ouyangzpeng germanaizek llh42 game-challenge marklu20 harlequinzeg0 mu-l noahzuo impact-of-compiler-warnings-thesis vicobill dreaming381 redcool naetherm starfoam mudv587 longerwarrior prg-liulie muhammadmoizulhaq tetrasomia laripeumi phoenixdigitalfx qipa walklook galaxy-wolf sunguangdong erosnick asdlei99 infernoengine jamestiotio u3d-resources sbvoxel ripplezou peterzs ajc-software-ltd sysfce2 aricatt

acl's Issues

Add full iOS support

iOS needs to support decompression, compression is optional and not really required for now.

Add iOS to cmake and make.py.
Can we add iOS to Travis CI?
Can we run unit tests on iOS?

Make sure unaligned loads are handled properly. On ARM, __packed is required!
Can be tested in UE 4.15.

Add CMake detection for SSE/AVX

https://github.com/VcDevel/Vc/blob/master/cmake/OptimizeForArchitecture.cmake

https://github.com/magic-sph/magic/blob/master/cmake/FindSSE.cmake
https://gist.github.com/hi2p-perim/7855506

Add support for OS X

OS X needs to support compression and decompression as well as at least the acl_compression.py script.
Unit tests must also pass.

Add OS X to cmake and make.py.
OS X needs to be added to Travis CI as well.

Document acl_compressor.py usage

We should document the code as well as add a page under docs to show examples with how to use it.

Add normalization in the object space error metric

When there is no scale present, the TransformMatrixErrorMetric never normalizes the rotation quaternion. If the bone chain is long, error could accumulate.

Try adding normalization after every transform_mul or adding it just at the end and compare the results.

Move Allocator to its own headers and split it

Allocator should be renamed AnsiAllocator and derive from a new IAllocator interface.
A new DebugAllocator should be created that simply passes allocations through and asserts at destruction that the NB live allocations is zero to do a rudimentary tracking of memory leaks and double frees.

This allocator should be used in the tools and unit tests that we provide to ensure there are no memory leaks or double frees, etc.

Drop range reduction if bit rate is raw

If the variable bit rate optimization algorithm fails to find a suitable quantized bit rate with an acceptable error, it falls back to 32 bits per component and those are stored as bit aligned float32 values. When this happens, range reduction can needlessly reduce the accuracy. Since we are already storing full floats, we might as well store the original clip values without any range reduction. This will increase the accuracy considerably of that special bit rate and avoid issues with exotic world space clips where range reduction hurts us.

Implement error compensation

See here for details on how it works at a high level: http://nfrechette.github.io/2016/12/22/anim_compression_error_compensation/

This would help dramatically for the few remaining exotic clips in the Paragon data set where the max error is unusually high.

Document fbx2acl.py usage

We should document the code as well as add a page under docs to show examples with how to use it.

Investigate a more accurate lerp

https://fgiesen.wordpress.com/2012/08/15/linear-interpolation-past-present-and-future/

We should look into the lerp: lerp_1(t, a, b) = (1 - t)*a + t*b

Measure and publish the results.

Document graph generation scripts and usage

We should document the code as well as add a page under docs to show examples.

Investigate range extent scaling

The segment range extent is always bounded by [min value ... (1.0 - min value)]

If my min value is say 0.6, my extent can be at most 0.4.
Instead of doing: mul_add(value, extent, min)
Try: mul_add(value, (1.0 - min) * extent_scaled, min)

The smaller the range extent, the more precise our bits become.

The same also holds for the range extent for rotation tracks since the boundaries are known: [-1.0 .. 1.0]

Add compression unit tests with CMU clips

Take 100 clips from CMU, some exotic, others picked based on their duration so we get a good mix.

Uniform sampling should be compressed with various methods and the decompression validated against an error output. See main.cpp in acl_compressor.

Ideally we want to test only the variants that are reasonably expected to be used otherwise the unit tests might take too long to execute. TBD

Implement a scale error metric function with Transform_32

While investigating an exotic clip from Paragon with an unusually high error (~9cm), I found out that when we drop the W component of a quaternion, it can yield a large error which is compounded by a deep hierarchy and excessively high scale (8000.0) and translation values (20000.0).

Attempting to use AffineMatrix_64 did not help at all, the issue isn't with the arithmetic or the rounding but in the fact that a small error in the quat.w yields a small error in the matrix itself and it compounds. It is not possible to ortho-normalize the matrix at every bone because it contains scale.

When comparing against UE 4.15, the same clip has an error of ~170cm using the ACL error metric. However, using the UE 4.15 error metric, it is quite acceptable (<1cm). I also confirmed within UE 4.15 and the animation clip looks very clean, there is no visible error. This means that at least for this clip, the UE 4.15 error metric is much more accurate than ACL's when scale is present in this fashion.

Add support for additive animation clips

Additive animation clips can be implemented in one of two ways:

As a relative animation where we use classic transform_mul(transform_inverse(reference), value)
As UE4 does and simply adds the values together

In the later format, the 3D scale can be zero which is problematic.
Ideally when compressing we must measure the error after the clip has been applied to the base clip to ensure the highest accuracy when it is played back. As such we must add the option for a clip to have a reference clip.

Some additive clips use a single frame as a reference while others use the whole clip time scaled.

Note that on the decompression side, the base clip isn't added. This is left for the game engine to perform at its leisure. For now anyway.

Add contribution guidelines

See https://github.com/nlohmann/json/blob/develop/.github/CONTRIBUTING.md for an example.
https://opensource.guide/starting-a-project/#writing-your-contributing-guidelines

Add support for x86

Appveyor already builds x86 but it has not been tested beyond the unit tests passing.
acl_compressor.py needs to be ran on CMU to properly validate with: vs2015, vs2017, gcc5, clang5.

Add x86 support to Travis CI.

Add documentation for AnimationClip

We should document the code as well as add a page under docs to show example code with how to populate the structures and what they are used for.

Add a code of conduct

https://opensource.guide/starting-a-project/#establishing-a-code-of-conduct
https://www.contributor-covenant.org/

Add a dependency on sjson-cpp

Move the sjson writer to sjson-cpp and fix other changes made by ACL.

Include a full version under external, same as catch.
In the clip reader/writer which use the sjson stuff, add a check if the corresponding sjson header has ALREADY been included. Force the user to include SJSON manually, they can then either use their own dependency or the one included in external.

Disable segment range reduction for single segment clips

Single segment clips do not benefit from segment range reduction since the extent will be 1.0 and the min will be 0.0, adding no value, just overhead.

CMU does not have that many short clips but Paragon and most games do.

Remove recursion in error functions

Some error functions employ recursion. This is bad for very long bone chains. It should be easy enough to remove.

Add memory unit tests

We should validate the various memory_utils.h functionalities with unit tests.

Implement VQM arithmetic and an error metric

Could be useful to compare how it measures again the other error metrics that support scale.

Expose error metric function in the compression settings

It is imperative that the error metric function be as close as what the host game engine will use internally to compute and blend poses.

For example, if we use matrices within the engine, we must use matrices to compute the error metric. Failing to do so could lead to the compression algorithm not seeing the same error as the game engine. AffineMatrix_32 does not perform at all like Transform_32 when scale is present. This would also allow support for VQM transforms.

Implement bind pose local space compression

Storing bone transforms in local space of the bind pose. For translation in particular, this reduces the range of values that we compress, increasing the accuracy and reducing the memory footprint a bit. At runtime when we decompress, we simply add back the bind pose.

Should be optional, this might very well be best done by the game. Perhaps we can provide only helper functions that the game can call. Maybe do nothing at all and let them deal with it?

Investigate fixed point arithmetic

Range reduction sometimes causes accuracy loss. Investigate fixed point arithmetic to see if it can improve accuracy.

Perhaps a mix of fixed point/float32 arithmetic should be used for optimal results?
Also keep in mind performance implications for the decompression.

http://x86asm.net/articles/fixed-point-arithmetic-and-tricks/

https://en.wikipedia.org/wiki/Fixed-point_arithmetic

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwj2_fSo87LXAhVV5GMKHS9BCVgQFggrMAE&url=http%3A%2F%2Fwww-inst.eecs.berkeley.edu%2F~cs61c%2Fsp06%2Fhandout%2Ffixedpt.html&usg=AOvVaw30e1B92ekXbTzJeUDNfMgb

Use a rotation track from CMU for a segment, 16 rotations.
Compare with current float32 code path.
Compare with float64 code path.
Compare with fixed point code path (possibly various precision settings).

Exhaustive comparison for every possible bit rate?

https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301988

http://codesuppository.blogspot.ca/2015/02/sse2neonh-porting-guide-and-header-file.html

Add math unit test coverage

There are already some unit tests for math functions.
Make sure we have 100% coverage or as much as reasonably possible.

Add full android support

Can we add android to CI somehow?

Add android to cmake and make.py.

Add support for GCC 6 and 7 on Linux

Once the unit tests are extended and in place, adding support for this should be trivial and simply require to add them to travis.

Add quat drop largest component rotation format

Instead of always dropping the W component, we should attempt to drop the largest component and store 2 bits somewhere to remember which component is dropped. This should improve accuracy considerably when W is small.

Where to store the extra 2 bits:

Either per sample as part of the packed bits
Part of the per segment track flags and simply drop the component that is the largest most often (if we do this, we can't always safely rebase the range on 1.0/sqrt(2), we could store an extra bit to determine the range)

Note that because the component dropped might change from sample to sample or segment to segment (depending on the above variant), we will have to store the full 4 component range information for the clip/segment. This is unfortunate but we will likely need the 4th component anyway in order to mix in full quaternion variable bit rate (no component dropping) when precision requires it.

For full precision mode and for constant samples, we can store the 2 bits as part of the 3 remaining floats. Because rotations have their values between [-1.0, 1.0], we only use a subset of the floating point range. Our exponent is always smaller than 1. With IEEE-754, the exponent value is stored as exponent + 127 on 8 bits meaning our value is always smaller than 128. This means the first exponent bit in our floating point number is always 0. We can use the first two floats to store our 2 bits and we can clear them after the load to reconstruct the original exponent. This can be very cheap. We also have a spare bit in the 3rd component that remains which could be used to reconstruct the sign of the stripped component. Note that this means that rotations cannot safely encode infinity/nan which is fine.

Measure and publish the results.

Output max clip error in the compression stats

This would be critical for production use and allow a fallback algorithm to be used if the error isn't good enough.

Add the max error to: OutputStats

Add documentation on how to decompress a clip with uniform sampling

We should document the code as well as add a page under docs to show example code with how to populate the structures and what they are used for.

Add documentation to override asserts

We should document the code as well as add a page under docs to show example code.

Document make.py usage

We should document the code as well as add a page under docs to show example code with how to use it.

Add packing/unpacking unit tests

The various packing and unpacking functions should be properly unit tested.

Merge appveyor jobs and travis jobs

We have a lot of appveyor and travis jobs at the moment and they often fail on travis when installing packages due to download timeouts. Considering that each build is fairly fast, there is no need to have one job per configuration permutation.

It would make sense to have 1 job per compiler and do both debug/release and x86/x64 on it. Two jobs for appveyor (vs2015, vs2017) and five for travis (gcc5, clang4, clang5, xcode8, xcode9).

See discussion in issue #63.

Add pop_count support

Lots of modern processors support pop_count and count_leading_zero type instructions. This can speed up bit set manipulation considerably and could be used to optimize the decompression and the bone chain interator.

Add performance page for the matinee fight scene

It is relevant to track and I already have the data, just need to write it down.

Add a licence badge to readme.md

See https://github.com/nlohmann/json as an example

Revamp the getting started section

It needs to be broken down for every platform we support.
It needs to link with the bare minimum that needs to be done for integration: allocator, error handling, populating raw clip structures, compressing, and decompressing.

A section on contributing with details on: how to run the unit tests, the make.py script, the various tools, etc.

Register with CII best practices

https://bestpractices.coreinfrastructure.org/

See https://github.com/nlohmann/json for an example

Add support for step time update

Sometimes monotonic time updating isn't desired between keys. This could be to give a retro look and feel to animations (e.g. lego movie) or to handle camera cuts in cinematics where we teleport the character and do not wish to interpolate between some keys.

Add documentation to IAllocator

We should document the code as well as add a page under docs to show example code with how to implement the interface.

Add a release badge to readme.md

See https://github.com/nlohmann/json as an example

Add statistics for how much memory is touched when we decompress a single frame

How many bytes needed in clip header?
Segment header?
Constant track data?
Clip range data?
Segment track formats?
Segment range data?
Animated data is already tracked

Once we have this information, we can trivially how many bytes touched and how many cache lines touched to sample 1 bone or 1 pose.

Try different vector rotation by a quaternion

https://blog.molecular-matters.com/2013/05/24/a-faster-quaternion-vector-multiplication/

Classic:
v' = q * v * conjugate(q)

Different formulation:
t = 2 * cross(q.xyz, v)
v' = v + q.w * t + cross(q.xyz, t)

Is the accuracy better or worse?
Is the performance better or worse?