x448 / float16 Goto Github PK
View Code? Open in Web Editor NEWfloat16 provides IEEE 754 half-precision format (binary16) with correct conversions to/from float32
License: MIT License
float16 provides IEEE 754 half-precision format (binary16) with correct conversions to/from float32
License: MIT License
Also consider having two subnormal precisions instead of one:
Probably shouldn't name the 1st one with "Exact" suffix because some of those won't round-trip back to float32.
It would be useful to know the precision of converting IEEE binary32 to binary16, if the function can be inlined.
PrecisionFromFloat32
should return Precision
without performing the conversion.
Conversions from both Infinity and NaN values will always report PrecisionExact even
if NaN payload or NaN-Quiet-Bit is lost.
If this is too complex to be inlined by Go, then make it an extra return value as part of conversion functions.
// Precision indicates whether the conversion to Float16 is
// exact, inexact, underflow, or overflow.
type Precision int
const (
PrecisionExact Precision = iota
PrecisionInexact
PrecisionUnderflow
PrecisionOverflow
)
func PrecisionFromfloat32(f32 float32) Precision
Fromfloat32() is 100% compatible with AMD and Intel F16C instructions by producing identical results for all 4+ billion conversions. Unfortunately, this means NaN input values are converted to NaN with quiet bit always set.
It can be useful to preserve the original NaN signaling status, so provide FromNaN32ps() to convert 32-bit NaN to 16-bit NaN while preserving both signal and payload.
Additionally, implement the function so it can inline and perform faster than Fromfloat32().
// ErrInvalidNaNValue indicates a NaN was not received.
var ErrInvalidNaNValue = errors.New("float16: invalid NaN value, expected IEEE 754 NaN")
// FromNaN32ps converts nan to IEEE binary16 NaN while preserving both
// signaling and payload. Unlike Fromfloat32(), which can only return
// qNaN because it sets quiet bit = 1, this can return both sNaN and qNaN.
// If the result is infinity (sNaN with empty payload), then the
// lowest bit of payload is set to make the result a valid sNaN.
// This function was kept simple to be able to inline.
func FromNaN32ps(nan float32) (Float16, error)
Thank you for making this very useful and well-tested library! Are you planning to add support for bfloat16 format, which is used in ML field? It has different bit widths for mantissa and exponent, but other rules are the same as in IEEE 754 formats.
Hey! Can this use hardware instructions for conversion? Intel CPUs support hardware conversion since 2013, and the new 12-th gen also has support for arithmetic (I think?). Other architectures had that a while ago.
This might be possible without compiler support using embedded C code, but wouldn't that be out of scope for this?
And other cleanups/refactoring of unit tests while at it.
-var ErrInvalidNaNValue = errors.New("float16: invalid NaN value, expected IEEE 754 NaN")
+const ErrInvalidNaNValue = float16Error("float16: invalid NaN value, expected IEEE 754 NaN")
+
+type float16Error string
+
+func (e float16Error) Error() string { return string(e) }
Add Fromfloat32ex
as an extended version of Fromfloat32
that returns more info.
Fromfloat32ex
returns:
Precision returned could be one of:
Add the method (f Float16) Bits() uint16
to improve API symmetry.
Float32
method is the reverse of Fromfloat32
function.
Bits
method will be the reverse of Frombits
function.
This addition will not increase bloat because calling Bits
should inline as a simple type cast.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.