jk-jeon / dragonbox Goto Github PK
View Code? Open in Web Editor NEWReference implementation of Dragonbox in C++
License: Apache License 2.0
Reference implementation of Dragonbox in C++
License: Apache License 2.0
Hello :D
Can you briefly explain why you do these checks?
dragonbox/source/dragonbox_to_chars.cpp
Lines 80 to 82 in 34a6cdf
For example,
constexpr int buffer_length = 1 + // for '\0'
jkj::dragonbox::max_output_string_length<jkj::dragonbox::ieee754_binary64>;
char buffer[buffer_length];
double d{980528.1478392933};
char* end_ptr = jkj::dragonbox::to_chars(price, buffer);
std::cout << buffer << std::endl;
I got "9.805281478392933E5", but I want "980528.1478392933";
I used https://github.com/jk-jeon/dtoa-benchmark to build a win32 exe straight forward from the vs 2019 sln file, and find out dragonbox is corrupted on win32 at the moment. I also tried overwriting dragonbox.h in benchmark project with the latest file from this repository, still no help.
To reproduce:
It seems that the result structure is never initialized, which means that the may_have_trailing_zeros
can return whatever was on the stack if none of on_trailing_zeros
or no_trailing_zeros
functions are called.
I have this issue in a test program I made to compare my C port of the reference code. It happens with the following values, which returns true (in fact it's not 'true' but '5') for may_have_trailing_zeros
in my test code (which means that you might not get the same result with a simple call with those values). That value comes from the compute_nearest_normal
function on line 1799 where ret_value
is declared but not initialized. The field is never written to after that.
using namespace jkj::dragonbox;
uint32_t i = 1283457033;
float f = *( float* )( &i );
auto v = to_decimal( f, policy::decimal_to_binary_rounding::nearest_to_even, policy::binary_to_decimal_rounding::do_not_care, policy::trailing_zero::report, policy::sign::return_sign, policy::cache::full );
It seems that there should be no "= \varnothing" at the beginning of the second column on page 10.
And no "-s" in "ingredients", likewise no "-o" in "computationo".
As we are already talking about the article, how is this fact at the very end of the first column on page 9 being used later in the proof?
I generated decimal normal number by concatenate to integers together to this effect:
dbl = operand_fractpart / std::pow(10, n); // operand_fractpart is an integer
dbl += operand_intpart; // operand_intpart is also an integer
Then passes a double to DragonDecimal.cpp as attached. I passed 10000 doubles and toStringRaw() output is also attached.
fp_by_dragonbox_decimal_raw8.txt
My code makes sure operand_fractpart is only 8 digits. However we can see -7.018380128700001E2. By right it should be -7.0183801287E2?
However if I tried to verify whether -701.83801287 is a normal number by:
double d = -701.83801287;
memset(buffer, 0x0, sizeof(buffer));
jkj::dragonbox::to_chars(d, buffer);
std::cout << buffer << std::endl;
Then it's ok. I feel a literal double may be different from a dynamic memory double?
Every my 10000 double, there are always a couple of non-shortest values.
The only place where it is a power of 10 is used is that powers of 10 are stored in the precomputed cache. By allowing more general integers other than
dragonbox/include/dragonbox/dragonbox.h
Line 625 in bbb24df
This is not the case on Fedora 39, 64 bit linux with clang 17. Unsigned long will work correctly in my case, but more testing is required depending on the platform rather than just the presence of __builtin_addcll.
The following error arises when compiling with MSVC (17.5.0 Preview 2.0) with /W4
:
...\dragonbox-src\include\dragonbox\dragonbox.h(662): warning C4702: unreachable code
The line in for the warning is:
dragonbox/include/dragonbox/dragonbox.h
Line 662 in 4b19578
However, that isn't the real problem. The warning is actually triggered by the whole second if constexpr
block at:
dragonbox/include/dragonbox/dragonbox.h
Line 657 in 4b19578
Changing that second if constexpr
to else if constexpr
resolves the warning.
As discussed in dtolnay/dragonbox#1, the algorithm for calculating the decimal length is rather inefficient, since the number of digits can be computed with minimal branching using a shr
, add
, and a table lookup instruction as described here.
The current implementation is:
JKJ_FORCEINLINE static constexpr std::uint32_t decimal_length_minus_1(std::uint32_t const v) {
assert(v < 1000000000);
if (v >= 100000000) { return 8; }
if (v >= 10000000) { return 7; }
if (v >= 1000000) { return 6; }
if (v >= 100000) { return 5; }
if (v >= 10000) { return 4; }
if (v >= 1000) { return 3; }
if (v >= 100) { return 2; }
if (v >= 10) { return 1; }
return 0;
}
Which optimizes quite poorly (for example, on x86):
decimal_length_minus_1(unsigned int):
mov eax, 8
cmp edi, 99999999
ja .L1
mov eax, 7
cmp edi, 9999999
ja .L1
mov eax, 6
cmp edi, 999999
ja .L1
mov eax, 5
cmp edi, 99999
ja .L1
mov eax, 4
cmp edi, 9999
ja .L1
mov eax, 3
cmp edi, 999
ja .L1
mov eax, 2
cmp edi, 99
ja .L1
xor eax, eax
cmp edi, 9
seta al
.L1:
ret
Using a built-in ctlz function, we can make this much more efficient:
#ifdef _MSC_VER
#include <intrin.h>
#endif
JKJ_FORCEINLINE static std::uint32_t ctlz(std::uint32_t v) {
// Optimize for GCC/Clang when possible.
#ifdef __has_builtin
#if __has_builtin(__builtin_clz)
return __builtin_clz(v);
#endif
#endif
// Optimize for MSVC
#ifdef _MSC_VER
#if defined(_M_X64) || defined(_M_ARM64)
unsigned long leading_zero = 0;
_BitScanReverse(&leading_zero, v);
return static_cast<std::uint32_t>(31 - leading_zero);
#endif
#endif
// Fallback algorithm
int count = 0;
if (v & std::uint32_t(0xffff0000)) {
v >>= 16;
count += 16;
}
if (v & std::uint32_t(0xff00)) {
v >>= 8;
count += 8;
}
if (v & std::uint32_t(0xf0)) {
v >>= 4;
count += 4;
}
if (v & std::uint32_t(0xc)) {
v >>= 2;
count += 2;
}
if (v & std::uint32_t(0x2)) {
v >>= 1;
count += 1;
}
return 31 - count;
}
JKJ_FORCEINLINE static std::uint32_t fast_log2(std::uint32_t const v) {
return 31 - ctlz(v | 1);
}
JKJ_FORCEINLINE static std::uint32_t decimal_length_minus_1(std::uint32_t const v) {
static constexpr std::uint32_t table[] = {9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999};
std::uint32_t y = (9 * fast_log2(v)) >> 5;
y += v > table[y];
return y;
}
Most of the bloat is just to ensure we have a portable way to count leading zeros. And this optimizes quite nicely with the built-in optimizations:
decimal_length_minus_1(unsigned int):
mov eax, edi
mov edx, 31
or eax, 1
bsr eax, eax
xor eax, 31
sub edx, eax
lea eax, [rdx+rdx*8]
shr eax, 5
mov edx, eax
cmp DWORD PTR decimal_length_minus_1(unsigned int)::table[0+rdx*4], edi
adc eax, 0
ret
For MSVC on x86_64, we have:
unsigned int decimal_length_minus_1(unsigned int) PROC ; decimal_length_minus_1, COMDAT
xor eax, eax
mov edx, ecx
or edx, 1
bsr r8d, edx
lea r9d, DWORD PTR [r8+r8*8]
shr r9d, 5
lea r8, OFFSET FLAT:unsigned int const * const `unsigned int decimal_length_minus_1(unsigned int)'::`2'::table
cmp ecx, DWORD PTR [r8+r9*4]
seta al
add eax, r9d
ret 0
unsigned int decimal_length_minus_1(unsigned int) ENDP
And for MSVC on ARM64, we have:
|unsigned int decimal_length_minus_1(unsigned int)| PROC ; decimal_length_minus_1
orr w8,w0,#1
clz w9,w8
mov w10,#0x1F
sub w10,w10,w9
add w8,w10,w10,lsl #3
adrp x9,|unsigned int const * const `unsigned int decimal_length_minus_1(unsigned int)'::`2'::table|
add x9,x9,|unsigned int const * const `unsigned int decimal_length_minus_1(unsigned int)'::`2'::table|
lsr w11,w8,#5
ldr w8,[x9,w11 uxtw #2]
cmp w0,w8
csethi w10
add w0,w10,w11
ret
ENDP ; |unsigned int decimal_length_minus_1(unsigned int)|, decimal_length_minus_1
Since the fallback ctlz algorithm isn't very good, we could do a simple optimization for MSVC and GCC/Clang, and for another other compiler keep the old algorithm.
And enclose instances of large constants with UINT64_C
and such.
essentially, the same as DtoaMinBufferLength
from https://github.com/abolz/Drachennest/blob/master/src/dragonbox.h (although the value given there seems a bit large but i haven't looked into it)
this would be much more reliable and usable than relying on what the README says:
char buffer[31]; // Should be long enough
that way, users can write:
char buffer[dragonboxBufferLen];
and it'll guarantee no buffer overflow, etc
The current implementation has grown too much to a point where it is nearly impossible to understand for people with not so much exposure to template black magics and other C++-specific nonsenses. The complication of the implementation can act as a significant barrier for people wanting to port the algorithm into languages other than C++, or even into C++, even though the algorithm itself is not that overcomplicated.
Those templates and other nonsenses are done for reasons, to make this library general and fast enough, but at the same time since this is the reference implementation, it should serve the role as such, i.e., to provide guidance to other people wanting to write their own implementations. Hence, I think it makes sense to provide the core algorithm written in simpler, non-esoteric C++ (or maybe even C99?) that everybody can understand.
To be clear, I do not intend to replace the current implementation with a simpler one, which I consider impossible without losing many utilities and performance. As pointed out earlier, those template hackeries are done for reasons. Instead, what I want to have is a standalone implementation of the algorithm that only serves the expository role, which does not need to be optimized-to-hell, rather just enough to demonstrate the algorithm's potential. There already is an implementation which can serve such a role, done by Alexander Bolz, but (1) there is no implementation for IEEE-754 binary32, (2) his implementation is quite outdated at this moment as the algorithm has been improved quite a lot since then, and (3) I think it's nicer to have one in this official repository for the reference implementation, kept in sync with whatever further improvements we bring in the future.
I may work on this someday if no one helps me, but PR is very welcome if anyone is interested!
It would be useful to have an option to specify desired output precision.
Especially for to_chars
.
In practice, computing the parity of x(i) is still a heavy
operation compared to others. Hence, it can be beneficial to
delay it until it is absolutely necessary. In Algorithm 5.2, we
need to compute the parity of x(i) when r and δ(i) turn out
to be the same. However, in fact if either the interval I does
not include the left endpoint or x is not an integer, we always
conclude I ∩ 10^(−k0+1)Z is empty regardless of the parity of
x(i). Therefore, if the given rounding mode indicates that the
left endpoint is not in I, we can skip computing the parity of
x(i).
(section 5.1.6, page 21)
If I understand this correctly, the condition of the left endpoint being excluded or x not being an integer is enough to say that the intersection in question is empty, and the parity doesn't need to be checked.
However, the algorithm implementation seems to disagree:
if (!interval_type.include_left_endpoint() ||
exponent < case_fc_pm_half_lower_threshold ||
exponent > divisibility_check_by_5_threshold) {
// If the left endpoint is not included, the condition for
// success is z^(f) < delta^(f) (odd parity).
// Otherwise, the inequalities on exponent ensure that
// x is not an integer, so if z^(f) >= delta^(f) (even parity), we in fact
// have strict inequality.
if (!compute_mul_parity(two_fl, cache, beta).parity) {
goto small_divisor_case_label;
}
}
else {
auto const [xi_parity, x_is_integer] =
compute_mul_parity(two_fl, cache, beta);
if (!xi_parity && !x_is_integer) {
goto small_divisor_case_label;
}
}
In fact, the comments explicitly state that checking parity is required. From quick testing it seems that the code is right here.
Great package! Thank you!
I've noticed that to_chars
sometimes emits an extra E0
suffix when it's not needed. For example, the number 1.0
is emitted as 1E0
.
Is this intentional?
Currently a policy parameter belongs to a policy kind if its type is deriving from the specified base type.
At this point it seems checking for the existence of a specific member type sounds like a better idea in general.
I recently implemented the main algorithm in Factor with the paper as my main source.
My two main hurdles were these:
Another, smaller grievance is in section 5.2.3:
P.S.: In section 5.1.7 for completeness, you could mention that you compute r as r = z^(i) - 10^(κ+1)*s. This might seem like baby steps, but I had to extract the obvious from another implementation.
Now the main library does not rely on this assumption, but many testing and meta codes still do. Dependence on the traits also need to be minimized, and ideally as many as possible templates should be parameterized by the format & carrier pair.
Hello,
Would it be possible to offer a .pc (pkg-config) file that'd be installed with the build system, for use with non-CMake projects?
Thank you!
I find situations like percentages very common, but using floating pointers to print them out would be very costly.
Do you have an algorithm to deal with that?
One without precision:
For example, giving input (3,10)
Output:
30%
or
0.3
For example, giving input (3,9)
Output:
33.(3)%
or
0.(3)
For example, giving input (2,14)
Output:
14.(285714)%
or
0.(142857)
Also, with precision version (which is very common in statistics, like keeping two digits decimals):
For example, giving input (2,14), precision 2
Output:
14.26%
or
0.14
I need a port of this to C++11
Perhaps it would be possible to apply the rounding trick that Schubfach uses to eliminate some of those ugly integer checks, which will greatly improve the overall performance.
However, it requires a major overhaul of the entire algorithm so it will take time.
In line 832 of dragonbox.h there is a right shift of 7 bits, whereas in the paper it is a right shift of 71 bits (section 5.1.7). It seems that 71 is indeed correct.
So that performance comparison can be done between different implementation add them to the benchmarks:
How does this change with decimals and other binary bit sizes? Can they be added?
Unless I am missing something there appears to be no reverse conversion in this library.
I am trying to convert Ryu based code that operates on moral equivalent of decimal_fp (doing more than just printing) to Dragonbox. Lack of reverse conversion is a big stumbling block because I would need to essentially carry around big chunks of Ryu data and code to do it.
As an aside it also looks like to_chars
only does exponential notation. Lack of fixed format would be another stumbling block.
Would be great if decimal_fp were exposed as a first class thing with conversion and formatting functions. 😃
We need to get rid of some of the indirections that are now pretty pointless and exist only because of historical legacy.
[ edit ] solved, the below discussion - besides lengthy - helped me
to get the door open, thus if you don't find better solution take
the time and read. [ /edit ]
I followed the readme, downloading and building had meaningful
responses, but I'm stuck to get the door open.
Would like to compile and run the test programs provided in the
readme, but don't know where to put and how to compile.
I'm not an experienced C++ coder, neither can tell the difference
between Cmake and make.
TIA for any help :-)
Hi @jk-jeon, and thank you for your amazing work on this!
This isn't exactly an issue, but rather a question or a request for clarification of the documentation.
The documentation says the algorithm is round-trip.
To check this my understand of this, I have written a simple loop whose index "i" iterates all uint64_t values and:
Here's a sample of its output; formatting is done with an old version of fmtlib (5.3.0):
i=4716101740380815687 orig=1.98801e+07 dec=(1988012400000122, -8, +) back=1.98801e+07 back_i=4716101740380815688
i=4716101740380815688 orig=1.98801e+07 dec=(19880124000001222, -9, +) back=1.98801e+07 back_i=4716101740380815689
i=4716101740380815699 orig=1.98801e+07 dec=(19880124000001263, -9, +) back=1.98801e+07 back_i=4716101740380815700
i=4716101740380815700 orig=1.98801e+07 dec=(19880124000001267, -9, +) back=1.98801e+07 back_i=4716101740380815701
i=4716101740380815703 orig=1.98801e+07 dec=(19880124000001278, -9, +) back=1.98801e+07 back_i=4716101740380815704
i=4716101740380815713 orig=1.98801e+07 dec=(19880124000001315, -9, +) back=1.98801e+07 back_i=4716101740380815714
I haven't verified this programatically but I notice that the back integer representation is one off the original integer from which the double was created. Note that the exact code computing "back" from "dec" is:
double back = static_cast<double>(dec.significand) * pow(10.0, static_cast<double>(decimal.exponent));
My questions are:
Thank you kindly in advance for your time.
I recently contributed to https://github.com/fastfloat/fast_float to make the from_chars
implementation there constexpr from C++20.
I think making to_chars
constexpr is equally possible. Would efforts to do this be welcome here?
I see that to_chars
is currently defined in a .cpp file, which could make this challenging without making the library header-only.
I'd like to use dragonbox for serializing floats in Glaze. What I'm particularly looking for is a JSON compliant to_chars
implementation that has the shortest string result. It seems like dragonbox's to_chars
will print out ...E0
when it is not necessary.
The other requirement for JSON compliance is that infinity and nan should be printed as "null"
.
Could options to support these features be added to dragonbox or should I fork the code and edit it myself?
I'd also like the to_chars
to be header only, but I don't mind doing this conversion myself and keeping it up to date with dragonbox
.
Thanks for all the hard work you've put into this great library!
Hi,
I am trying to test dragonbox with the nvcc compiler (Version 201402L).
But it throws;
I only include dragonbox_to_chars.h in my test code (I have to write millions of float numbers to a csv file with high performance. std::to_string and fprintf are far to slow.).
In my understanding and looking into the dragonbox code it should work with this nvcc version.
Where is the knot in the head?
Thank you!
Related: #3 (comment)
benchmark
Currently, there seems to be no cross-compilation to ensure support for ARM, as well as OSes like Windows and macOS. It may be wise to add tests for ARM 32-bit and 64-bit, as well as a big-endian target (PowerPC64 could be a good example). Windows and macOS support are native to workflows, so it's a simple matter of adding configurations. There's a few available options to ensure quick cross-compilation for different:
Use Dockcross would be a drop-in replacement, while zig cc would only require a toolchain file for the configurations.
These would make sure the intrinsics, such as those in #11, are supported properly. If desired, I'd gladly submit a PR with the additional workflows. Would any other targets, such as MIPS64, or RISC-V be desired?
Hi,
Can you give some advice, how to convert float (4 bytes) to double (8 bytes).
I am looking to convert float -> double in a way that to_string(double) to produce same string as original to_string(float).
Any advice would be really appreciated.
My current idea is to you dragonbox to convert float to chars, and then use some fast chars->double converter with round-trip guarantee.
Would be very nice if you will tell me that my idea is stupid, and describe a way that is much better.
Thanks in advance!
Thanks to Tim Kientzle (@tbkka), it turned out that the current code is not working correctly with ARM processors. I believe this is due to errors in fallback implementations of intrinsics. Also the current #define
switches are not able to correctly detect availabilities of intrinsics for ARM processors, resulting in suboptimal code generation.
See: swiftlang/swift#35299.
__int64_t intpart = getInteger();
__int64_t fractpart = getInteger();
double dbl = fractpart + 0.000000000000000000000005;
dbl = dbl / std::pow(10, countDigit(fractpart)); // countDigit return 5 when fractpart is 12345
dbl += intpart;
I want to make sure my double binary representation like 123.12345000000000000000000000005. So when I run jkj::dragonbox::to_decimal(dbl), I want to make sure I get 123.12345 (of course by (significand, exponent, and is_negative)), so I can verify. I used above code to generate 10000 values. However when significand/mantissa is 16 digits, quite a lot of the decimal are not accurate although I tried most policies.
if significand/mantissa < 16, so far my test looks ok.
double x = 1.234;
auto v = jkj::dragonbox::to_decimal(x);
std::cout << v.significand << std::endl;
outputs 1234000000000000
instead of 1234
. I am not sure if this is intended or not, but it is an example from readme, so either readme should be fixed or something is wrong with the code.
I noticed this when tried to update Dragonbox from 33a9e02 to the current master in the stress of SerenityOS's Ryu implementation.
Currently we use the method proposed in the paper by Granlund-Montgomery: using modular inverse and std::rotr
.
However, there are at least two competitors in this league. Assuming
For future reference for myself, I write the code for the last method here:
int remove_trailing_zeros(std::uint32_t& n) noexcept {
int s = 0;
while (true) {
auto const nm_mod = std::uint32_t(n * UINT32_C(42949673));
if (nm_mod < UINT32_C(42949673)) {
s += 2;
n = std::uint32_t(nm_mod >> 2);
}
else {
break;
}
}
auto const nm_mod = std::uint32_t(n * UINT32_C(1288490189));
if (nm_mod < UINT32_C(429496731)) {
s += 1;
n = std::uint32_t(nm_mod >> 1);
}
return s;
}
int remove_trailing_zeros(std::uint64_t& n) noexcept {
int s = 0;
while (true) {
auto const nm_mod = std::uint64_t(n * UINT64_C(14941862699704736809));
if (nm_mod < UINT64_C(184467440737095517)) {
s += 2;
n = std::uint64_t(nm_mod >> 2);
}
else {
break;
}
}
auto const nm_mod = std::uint64_t(n * UINT64_C(5534023222112865485));
if (nm_mod < UINT64_C(1844674407370955163)) {
s += 1;
n = std::uint64_t(nm_mod >> 1);
}
return s;
}
According to a computation, the n = 1'073'741'899
, and the n = 4'611'686'018'427'387'999
. In Dragonbox, we need these routines for n
's up to 16'777'215
for the n
's up to 9'007'199'254'740'991
for the
Some are passed as template arguments while some are passed as function arguments. Some are bool
, some are enum
, some are tag types. These should be somehow cleaned up.
When spitting out decimal digits via jeaiii-style integer formatting, it is probably possible to do integer checks to figure out that all the remaining digits are zero. Then maybe we can spare yet more multiplications therefore get a better performance.
Trying to compile the following code, I get an error on line 2058, where the second parameter passed to divide_by_pow10
is significand_bits + kappa + 2
instead of carrier_uint
or something similar. I also assume the 3rd parameter isn't the correct value.
#include "dragonbox.h"
int main( int argc, char** argv ) {
float x = 1.234f;
auto v = jkj::dragonbox::to_decimal( x, jkj::dragonbox::policy::decimal_to_binary_rounding::toward_plus_infinity );
return 0;
}
Compiling with MSVC: cl -nologo main.cpp -Fe:dragonbox.exe -Zi -Od -std:c++17
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.