unitaryfund / qrack Goto Github PK

Comprehensive, GPU accelerated framework for developing universal virtual quantum processors

Home Page: https://qrack.readthedocs.io/en/latest/

License: GNU Lesser General Public License v3.0

C++ 90.27% C 2.46% CMake 1.12% Cuda 6.14%

quantum quantum-computing quantum-simulator qubits physics physics-simulation quantum-computer-simulator opencl quantum-information distributed-quantum-computing

qrack's People

Stargazers

Watchers

qrack's Issues

Unitary multiplication without carry register

It seems we were overcautious in implementing MUL/DIV variants: I think, (besides multiplication by 0, which is not expected to have an inverse,) in-place multiplication without a carry register actually is unitary, so long as we consider the classical multiplicand to be constant. That is, since we are left multiplying by a classical value, (call an instance of this value "x,") our truth tables need only preserve input vector norm and reversibility for any particular given x, (only respectively for all possible x,) which is as "trivial" as unitary addition and subtraction. That is, (truncated) integer output possibilities are one-to-one with input permutation basis eigenstates, for any particular x.

We should go ahead and implement these multiplication/division variants. I'm already at work on them.

Bug apparent in cross entropy benchmarks by depth of 4

Modifying the cross entropy benchmark unit test to run to a depth of 4, it seems to run into a bugged edge case in 1 out of 1000s of random circuits, for QUnit. (If CZ is the only coupler gate, this might suppress the issue, but the combination of CZ and CNOT alone fails infrequently but consistently.)

As I gather more information, I will report it here.

Compile error on OSX mojave

As the title reads. No joy compiling qrack on OSX. Used is the cmake 3.13.3 , make 4.2.1.1 from brew.

Error occurs @ make all install

Carry arithmetic bug

Register-wise arithmetic, without carry, maps a single CoherentUnit input state to a single output state, one-to-one. However, the introduction of a carry bit allows for pairs of input states to be mapped to single output states. (Other carry arithmetic outputs might be simply invalid.)

ADD and SUB are correct, but ADDC and SUBC (with carry bit) are incorrect, as a result of the above. These operations could be implemented in terms of virtual gates, but this is slower than necessary. However, any optimized algorithm needs to match the output of gates. The overall phase of output states is some function of the two phase input states, (possibly the product of permutation state phase factors).

Carry arithmetic needs to debugged, with the correct phase output, preferably ultimately with optimization beyond virtual gates.

Tag release 5.0

Change notes:

Cross entropy benchmarks have helped us debug QUnit to its most stable, accurate, and performant point to date.
Builds have been tested on Linux, Windows, Apple, and with VC4CL on the Raspberry Pi 3.
QFusion was determined to be obsolete, including for performance considerations, and it was removed.
The "Hash" method has been added, which effectively lets the user specify any one-to-one function (without phase effects) for unitary qubit state transformation via a classical hash table.
CMake options now include arbitrary (power-of-two) maximum qubit capacity in a single QUnit, with the Boost libraries for big integers. Of course, allocation is still limited by system resources, and the maximum OpenCL "shard" indexing is 64 qubits or less, (which is safely outside of possible resource limits for "Schrödinger method" simulation).
(TODO:) Documentation has been brought into line with the features and benchmarks of the new release.

OpenCL headers upgrade

Clang is complaining that our OpenCL headers are deprecated, and Ubuntu has moved onto new headers for OpenCL v2.1, as well. We need to update the headers, hopefully while maintaining support back to some minor version of OpenCL v1.

I am happy to do this, unless someone is looking for an issue to work on. I will track the progress and anything that arises in this issue.

QUnit: controlled "inversion" buffer bug

The commutation rules for controlled "inversion" buffers are not correct for single bit phase gates, at least. On master, phase gates assume that they will commute with all of these buffers, but this is no longer true since adding handling for "inversions" with buffers. I'm working on a fix in #261.

TryDecompose() unit test fails on some platforms

The TryDecompose() unit test fails intermittently on certain devices:

>>> 'test_trydecompose': C:\Users\jerse\Documents\GitHub\qrack\test\tests.cpp(2039): failed: qftReg->TryDecompose(0, 4, qftReg2) == true for: false == true

This indicates that TryDecompose() is failing to recognize a state that can be decomposed.

TryDecompose() attempts to test for state separability before decomposing. If the test fails, the internal state representation is left unchanged. TryDecompose() must be able to detect that a state is separable before it can decompose it.

Decompose() works, clearly. TryDecompose() seems to be having numerical accuracy problems, but the failure is conservative against numerical error in the representation of state.

QEngineOCL::TryDecompose() unit test failure on Windows

QEngineOCL::TryDecompose() fails to decompose a permutation properly, on my Windows development environment. I do not have more details about why, yet. I hope to support the public method, (even though there are no internal references to it, I think,) so I will diagnose and fix this before the next release.

Feature: High-precision normalization

The prompting issue is that, depending on hardware and float precision options, Qrack sometimes "fails" precision checks, upstream in compiler and networking stacks. (Generally, I've observed failed checks at one decimal place above test passing threshold, if a check fails at all, but, again, this is also hardware and float precision option dependent.) Although, sometimes failing to pass these checks, in my opinion, does not necessarily prove an overall relative reduction in accuracy or precision.

For one thing, by design, Qrack randomizes the global phase offset on initialization and on measurement, (with an option to turn this behavior off). This might be physically realistic, if one thinks of the global phase offset as random and unknowable, (which it is, theoretically). It might be moot, if the phase is "unknowable" only because it cannot have measurable physical effect, which I also think is true.

Another major factor in numerical precision is that Qrack "floors" very small amplitudes to exactly 0. Specifically, the cutoff to "floor" has been chosen by testing the residual amplitude left behind by two successive Hadamard gates on the same bit, from a permutation basis eigenstate. Ideally, two successive Hadamards on the same qubit should return the state vector to exactly its original state. In practice, from what I've seen, very few of the other available simulators take this into consideration at all, allowing the residual numerical error to compound on sequences of gates that should ideally be exactly equivalent to the identity operator. Once Qrack floors a "residue," to compensate for this, it normalizes the state again, to correct for the "probability leak" of flooring the residue. This should hopefully improve numerical stability on long sequences of gates, though this has admittedly not been adequately tested, to prove so. Judged by the observed magnitude of this residue, depending on language, compiler, and hardware, it is sometimes not possible to actually achieve the accuracy of these upstream tests, because these residues are already too large of a source of error, though many algorithms reproduce test-passes to precision. (I should specifically note, though, Python's numpy seems, in cases, to manage a really impressive float precision, beyond the bounds of direct binary float type representation.)

One result of this is, precision checks trip at barely threshold significance. (The above explains why we go through this complicated routine, which appears to hurt precision, where most of the other popular simulators I've analyzed seem like they might take the same obvious numerical approach, hence they tend to agree on tests of precision.)

We can take further steps to improve precision, though. Our normalization algorithm is "naive" and definitely not high precision, just summing probabilities from low permutation to high permutation. This can be improved, whether it's worth the overhead.

Maybe most importantly, we should also perform comparative benchmarks of numerical stability of long sequences of gates, to test whether this residue flooring and normalization actually works, definitively.

Lacking tests for QInterface default implementations

QInterface has quite a number of (slow) default implementations for operations which have already been optimized. These default implementations are convenience methods, calling back into the public interface to achieve implementation via decomposition into simpler public interface gates.

These have been essential in implementing new engine types, and they will serve the same purpose for any future engine types. They are also useful for debugging optimized gates against more reliable default implementations. However, they lack coverage, as they are hidden by optimized implementations.

We need unit tests specifically for the default implementations. It should be possible to still access them directly, for testing, in any class that inherits from QInterface.

I can do this, within the next few days, unless someone else would like to. It will probably take the form of a --proc-qinterface partial set of tests.

Raspberry Pi 3 support broken

I'm testing the current version of the stack on all my old test devices, and Raspberry Pi 3 support seems to be very broken, at the moment. Given how time has progressed since Raspberry Pi support was added, I would at least like to support the 4 moving forward, rather than the 3. I'll attempt to debug support for the 3 meantime, but I'll have a new 4 around Christmas.

This might or might not indicate a deeper problem in QUnit that doesn't commonly manifest, hence problems on the 3 have always been worth diagnosing, in the past.

Phase buffer float denormalization

Long ago, I thought I was being clever by storing just phase angles for controlled phase gate buffers, since the composition of gates became addition and the norm of each argument is always 1. In retrospect, this was a horrible idea, for float math, because of the difference in float denormalization across the range of the angle representation; the precision becomes horrible at anti-phase, (M_PI or 1). This became clear to me when I attempted to add a FLT_EPSILON for equality. The best fix for this, ironically, seems to be to simply replace the angle representation with full complex numbers again! Since the complex numbers will always have a norm of 1, denormalization is naturally (roughly or exactly) constant around the entire unit circle.

request: variable for amount of emulated qbits in benchmarks.cpp

Benchmark tests on platforms limit the memory used because of the hardcoded amount of qbits in benchmarks.cpp. I would like to see the amount of qbits passed via a commandline variable, and when not entered the current default.

Right now i recompile qrack with more qbits ( in the example 30 ) in line 35 of benchmarks.cpp

Maybe something elegant such as an automatic section of amount of qbits based on the declared memory in the opencl enum in benchmarks_main.cpp.

23qbits eat ±256MB, so a factor 2 until the amount of avaliable memory is exhausted.

This would help working on massive core massive memory machines.

Issue building on FreeBSD 12.1 (and documentation needs updating)

First I had to do this to get as far as I did, the documentation didn't include the flags to the source and the build directory..

qc/ cd qrack/build && cmake -S .. -B .

-- Errors whether I use gmake or bsd make
qc/ make or gmake

In file included from /usr/include/c++/v1/functional:494:
/usr/include/c++/v1/memory:3710:5: error: destructor called on non-final 'Qrack::StateVectorSparse' that has virtual functions but non-virtual destructor [-Werror,-Wdelete-non-abstract-non-virtual-dtor]
_data.second()._Tp();
^
/usr/include/c++/v1/memory:3671:9: note: in instantiation of member function 'std::__1::__shared_ptr_emplace<Qrack::StateVectorSparse, std::__1::allocatorQrack::StateVectorSparse >::__on_zero_shared' requested here
__shared_ptr_emplace(_Alloc __a, _Args&& ...__args)
^
/usr/include/c++/v1/memory:4331:26: note: in instantiation of function template specialization 'std::__1::__shared_ptr_emplace<Qrack::StateVectorSparse, std::__1::allocatorQrack::StateVectorSparse >::__shared_ptr_emplace<unsigned long &>' requested here
::new(__hold2.get()) _CntrlBlk(__a2, _VSTD::forward<_Args>(__args)...);
^
/usr/include/c++/v1/memory:4710:29: note: in instantiation of function template specialization 'std::__1::shared_ptrQrack::StateVectorSparse::make_shared<unsigned long &>' requested here
return shared_ptr<_Tp>::make_shared(_VSTD::forward<_Args>(__args)...);
^
/home/nick/qc/qrack/src/qengine/state.cpp:839:21: note: in instantiation of function template specialization 'std::__1::make_shared<Qrack::StateVectorSparse, unsigned long &>' requested here
return std::make_shared(elemCount);
^
/usr/include/c++/v1/memory:3710:23: note: qualify call to silence this warning
_data.second()._Tp();

"Quantum Supremacy" (Discussion)

All my work on Qrack is public and open, by nature of being an open source project—by license, platform, and philosophy. I have to be careful about conclusions people may attempt to draw from my work in progress. On the other hand, this openness is a wonderful thing, for general scientific transparency and ultimately scientific quality, I hope. The work-in-progress PR #244 has been hanging out for a few days, and it was commented on in issue #245. I think it's important to set aside a place for discussion on general relevant topics and issues. I also think I should start to invite discussion and peer review with some significant initial word of personal explanation.

#244 (which should be renamed) attempts to implement the closest equivalent test it can to the so-called "quantum supremacy" benchmark that has recently made some headlines. It is fantastic for the field in general that native quantum hardware has come this far and generated such positive interest, even from those who do not study the topic formally. That experiment and benchmark is a major achievement, and it greatly encourages me.

The work presented in #244 has meandered a bit, over days. I personally found it slightly challenging to translate the verbal and pictorial explanations of https://doi.org/10.1038/s41586-019-1666-5 into a practical Qrack program, at first, (though such is typical to a day's work in software development). The commit history, as it has progressed, remains available for access via that PR. By now, here's what the practical implementation seems to boil down to, in my personal understanding:

Random gates from the set of square root of Pauli X, square root of Pauli Y, and square root of Walsh-Hadamard transform are applied to every single bit. (I read "square root of W" to mean square root of Walsh-Hadamard transform gate, maybe more commonly rendered "H" instead of "W.")
A particular 2-qubit gate is applied across pairs in a tiling pattern, with the "tile" varying by depth iteration. I inferred this partially from the numeric counts of single-bit and 2-bit gates given in the text. The gate is approximately or exactly the composition of an "iSWAP" with a "1/6 of CZ." (Again, the nomenclature has not fully crystalized across the community. "iSWAP" I read to mean a SWAP gate where a phase factor of "i" is applied to eigenstates that are actually changed by the swap. "1/6" of CZ, I take to be the 6th root of the gate, as 6 successive applications would produce CZ.) The tiling pattern Qrack uses is meant to be the same as the example of an "intractable" sequence given in the paper, as opposed to, for example, the "easy" sequence referred to.
We repeat 1 and 2 for 20 iterations, incrementing the "tiling" pattern at each iteration.
We measure across the entire set of bits at the end.

This is the "circuit," or program, I have attempted to implement. I invite all criticism, commentary, concerns, etc., about the implementation. (Peer review is how we get science right.) I am still reviewing it for correctness, but, again, all my work ends up in the open.

The results are provisional. However, the provisional results seem quite exciting! Qrack::QUnit seems very well suited for the circuit. On my personal Alienware 17 development machine, the benchmarks stay tractable up to 63 qubits, which is Qrack's general limit for such optimizable cases for QUnit, by default. The effective spatially 2-dimensional circuit exhibits significant dependence on the geometry of the qubit mapping. "Perfect square" arrangements of qubits are hardest, (and probably represent the most powerful architectures for real hardware, of the class of circuits simulated here). I am attaching a gist with my development machine benchmarks and system information: https://gist.github.com/WrathfulSpatula/15ec777d4c6cd18125728449fe25ddae

It's probably slightly too early to start a detailed analysis for conclusions. However, on the back of an envelope, 56 qubits is a local maximum for overhead, at about 2 seconds per iteration on my machine. If this holds up as representative, that's very roughly about 23 to 24 days, to run one million iterations, which is a commonly quoted point of reference on this topic.

Certain other arbitrary circuits quickly become intractable for QUnit in the same way as for simple "Schrödinger method" simulators in general; this particular circuit might be amenable to fundamental improvement in simulation method.

(Finally, I should point once again with thanks, as I am wont to do, to a work very important to the design of QUnit, "Breaking the 49-Qubit Barrier in the Simulation of Quantum Circuits," by Edwin Pednault, John A. Gunnels, Giacomo Nannicini, Lior Horesh, Thomas Magerlein, Edgar Solomonik, Erik W. Draeger, Eric T. Holland, and Robert Wisnieff: https://arxiv.org/abs/1710.05867)

Gottesman-Knill (stabilizer) support for QUnit

QUnit is already most of the way toward (and in other ways beyond) "stabilizer" engine optimizations. Basically, if we add special handling for Hadamard gates in the style of the rest of QUnit, and if we expand phase gate support a bit, QUnit will also serve as a stabilizer engine, (and a relatively inexpensive one, at that).

I'm going to add support for QUnit-style Hadamard gates that keep bits "clean" under Z, S, and T gates. It shouldn't be hard, and then QUnit will be a super set of the stabilizer engine concept. We already optimize controlled gates in a stabilizer-like way, but I'll extend the handling for these particular cases.

Bug: Fix last part of Shor's factoring algorithm example

Now that we just merged in the example, I realized I missed a critical piece of Shor's. Specifically, the integer measured after the inverse QFT represents a fraction of the maximum capacity of the register which can be used to guess the period from continued fraction expansion.

(This is... a pain in the rear, to understand, and to implement in a language like C++.)

This needs to be fixed. It's not first on my list of priorities, but I'll get to after I address the other current development work, if no one else does.

Doxygen Cleanup

The code in vm6502q.v3.0 is stable and complete, but the doxygen and general documentation is messy and out-of-date. We should follow today's release with a v3.1, just to get the documentation to a pristine state.

The doxygen resides in this repo, so updating the documentation will require an iteration on the framework repo. I plan to change nothing else, between now and the v3.1 documentation update.

Lack of memory allocation warnings and guards

It was discussed during PR #110 ("Device-memory-only option for QEngineOCL") that allocation on the same shared device by different QEngineOCL instances is not protected against, and this might lead to problems when attempting to allocate more memory than available. However, the same problem exists and has even fewer guards against it, for QEngineOCL allocation on host memory or even use of QEngineCPU.

A warning was added in the code about device memory allocation limits not being fully handled, and the PR was accepted to unblock a QEngineOCL refactor that is interdependent on the same code. It is understood that the unsafe condition will be fixed after the QEngineOCL refactor. However, the problem is present even in QEngineCPU and QEngineOCL allocation on general RAM heap, and has no checks, there.

It has always been implicit, until now, that the end user is responsible for knowing the allocation limits of their environment. This issue needs to be addressed, in all of its forms.

Feature: Unitary full adders ("INCCIO" and "DECCIO")

One might wonder why carry-in is measured before arithmetic gates. This is due to shoehorning the design of the MOS-6502 for VM6502Q, as is the general proliferation of arithmetic variants with flags. Ultimately, this works for VM6502Q. However, I could not think of a unitary way to satisfy the expected behavior for that architecture's carry flag, in arithmetic. (Not measuring the carry-in would double-up certain inputs to map to single outputs.)

However, the MOS-6502 is not the only possible architecture, obviously. If carry-in and carry-out bits were separate in arithmetic operations, there would be no such problem preserving unitarity. The base case would be basically a cheap full-adder simulation. However, by increasing the intermediary length, we can "snake" our full-adders into cheap-to-simulate chunks, for maximum optimization in QUnit. I anticipate 3 cases in QUnit for each arithmetic gate: fully bit-wise separated, fully entangled, and optimized with these separated carry-in/out variants I'm talking about.

This is another to-do item, for me, unless someone wants to jump on it. See https://github.com/vm6502q/qrack/tree/qunit_arith_optimize for work I've started doing optimizing QUnit arithmetic, and I will be adding these separated carry-in/out "INCCIO" and "DECCIO" methods over the weekend in this branch, as well.

Remove QFusion

After probably nearly 2 years of having it around, I think it's time to remove QFusion from the API, (at least until a good reason arises to reinstate it, as we've done with the experimental QUnitMulti). I think of the following reasons:

I have never seen it give a useful performance return on top of QUnit alone, in years.
It's a ton of unnecessary complication in the API, and a lot to maintain.
Testing with cross entropy benchmarks, it's bugged when used in conjunction with QUnit on random universal circuits, whereas QUnit on its own is not.
It's at least modular enough that it's easily excised from the library entirely.

Like the experimental QUnitMulti has already been, it might be revived one day, if there's a point. The only hypothetical benefit I can see in having it, with QUnit, is that a user can pop a bunch of repetitive single bit or controlled gates on the same bits to form general unitaries without (significant) performance hit. Really, this is easily done by a true optimizing compiler, at a different layer of a stack. In fact, even most of this particular use case is still supported by QUnit alone.

The idea of "gate fusion" might have been part of the state-of-art about 2 years ago. It seems redundant with QUnit, and it's starting to cause more problems, with no clear benefit to keeping it.

Of course, if this causes a problem or concern for any of our users, please speak about it here.

test_and fails with specific rng seed

$ ./example test_and --rng-seed 1520034611
Random Seed: 1520034611
>>> 'test_and': AND Test:
example.cpp:106: failed: qftReg, HasProbability(0x32) for: 21/0010000010000110111010 matches bit pattern [0,0]: 0/0

Failed 1 test case, failed 1 assertion.

Across multiple runs the resulting bit sequence was not stable. This also implies that there's a un-controlled source of randomness in the system somewhere.

GPU Carry

Carry arithmetic has a couple of edge cases. One is the all 0 input case. (This is an edge case because only one input state contributes to the output, where usually two different input states correspond with an output.) The other is where the in-and-out register takes a maximum or minimum value while the in-only register is 0 with the carry bit initially set, leading to a carry-in to carry-out case.

For carry addition, with 8 bit registers, with the first register for both input and output, the carry in/out case looks like this:
11111111 00000000 1 -> 00000000 00000000 1
For carry subtraction, the carry in/out case looks like this:
00000000 00000000 1 -> 11111111 00000000 1

Both these edge cases have recently been handled for CPU POSIX-type threading. However, it's not easy to efficiently implement carry arithmetic for kernel-type GPU parallelism while handling these edge cases. Kernel-type parallelism keeps all execution units in "lockstep," meaning that branches for edge cases are always essentially run in every loop iteration, even if their results are discarded. (POSIX threads should be able to effectively branch.)

It would be useful to have efficient (GPU) kernel parallelism for carry math. This might be handled by putting the edge cases aside to be handled separately. However, loop iteration in kernels should hopefully totally avoid these cases, without iterating over them. Further, the edge case loops should similarly only iterate over the relevant cases, or their probably is no advantage to also using kernel parallelism, with this redundancy.

(Note that if any coherent bits are not involved in the immediate carry bit operation, the carry math has 2 edge cases for every permutation of bits not involved in the operation. So, in a CoherentUnit of 25 bits, with two 8 bit carry arithmetic registers and a 1 bit carry, there are 2 edge cases per unused bit permutation, for (2^8)*2=512 edge cases.)

Feature: Unitary multiplication modulo N

In Shor's paper on factoring with a quantum computer, he describes how multiplication modulo N, on a register of log(N)/log(2) qubits, can be accomplished. It involves unitary multiplication with a carry register, as Qrack already has, followed by CNOT-ing the primary (low) register into a third register, followed by "uncomputing" the original multiplication. This leaves the original register from before multiplication, entangled with a register holding multiplication modulo N.

My emphasis above is exactly the problem: the original register and the desired result must be entangled, in the general case. It is not possible for this operation to be "in place." The result would not be unitary, or able to be accomplished with a combination of unitary operations and measurements. (See the top of page 10 of https://arxiv.org/abs/quant-ph/9508027 and think about what I'm saying.)

It's not possible to do this operation "in place." However, we can save some rigamarole and overhead in a third register by providing an operation which takes a value in an input register and entangles it with an output register of the same length containing the result of multiplication modulo N, bypassing the need for a third register of same size.

This issue is a "to-do" item for me, tomorrow. This is just a reminder to do this when I wake up, and a record to catalog the reasoning behind the development.

Add "anti-control" boolean to QUnit gate buffers

As far as I can tell, now that the bug in Swap was identified and fixed, and with the subsequent commutation rule expansion in #341, the single bit phase gates (like Z) commute with all buffers in QUnit. Single bit "inversion" gates (like X) are close, but they're missing one critical piece.

Notice the buffer restriction in FlipPhaseAnti(), for commuting inversion gates with buffers:

void FlipPhaseAnti(const bitLenInt& target)
{
    RevertBasis2Qb(target, INVERT_AND_PHASE, ONLY_CONTROLS);
    shards[target].FlipPhaseAnti();
}

The target bit currently cannot remain a buffer control bit, though it can remain a target bit. Inversion gates could commute with control bits if buffers could exist in an "anti-controlled" state as well as a "controlled" state, (per the conventions of Qrack). That is, commuting X with a control bit where |1> signals gate activation could just turn the control into one where |0> signals gate activation, instead.

Given that { CCNOT, H} is a universal gate set, and therefore so is { CCZ, H }, since the former can be composed with just the latter, to have this general commutation of single qubit phase gates and "inversion" (X-like) gates, with buffered H, controlled phase, and controlled inversion gates, as QUnit as is already largely designed, takes us a long way toward more efficient universal gates sets. (Extension of phase buffers to more than one control is a natural next step, but this issue is a small digestible chunk of work, along the way.)

Build error on Ubuntu 18.04

For better OpenCL support I'm building a qrack Docker version based on Ubuntu 18.04 instead of Ubuntu 16.04 and ran into the following compile error:

In file included from /qrack/include/common/oclengine.hpp:29:0,
from /qrack/include/qengine_opencl.hpp:19,
from /qrack/include/qfactory.hpp:19,
from /qrack/src/qengine/operators.cpp:14:
/qrack/include/CL/cl.hpp:5843:62: error: ignoring attributes on template argument 'cl_int {aka int}' [-Werror=ignored-attributes]
typename std::enable_if<std::is_pointer::value, cl_int>::type
^
/qrack/include/CL/cl.hpp:5855:63: error: ignoring attributes on template argument 'cl_int {aka int}' [-Werror=ignored-attributes]
typename std::enable_if<!std::is_pointer::value, cl_int>::type
^
/qrack/include/CL/cl.hpp:6155:22: error: ignoring attributes on template argument 'cl_int {aka int}' [-Werror=ignored-attributes]
vector<cl_int>* binaryStatus = NULL,
^
cc1plus: all warnings being treated as errors
CMakeFiles/qrack.dir/build.make:165: recipe for target 'CMakeFiles/qrack.dir/src/qengine/operators.cpp.o' failed
make[2]: *** [CMakeFiles/qrack.dir/src/qengine/operators.cpp.o] Error 1
CMakeFiles/Makefile2:109: recipe for target 'CMakeFiles/qrack.dir/all' failed
make[1]: *** [CMakeFiles/qrack.dir/all] Error 2
Makefile:140: recipe for target 'all' failed
make: *** [all] Error 2

Any idea why this one fails ? I am pulling from the newest version@vm6502q/qrack

Stack support in CI

It's probably a good idea to add more of the supported third-party stack to the Qrack Repo CI, so we can know immediately when our support for those project fails. Their unit tests also hit edge cases that are hard for Qrack's unit tests to hit directly.

This should be the next PR, after #204.

31 qubits or more crash benchmark.cpp with layers qengine and qfusion

Whilst trying out several deployment scenarios I found that both the qengine and qfusion layer crash on more then 30qubits. qunit & qunit-qfusion are not affected AFAIK. I found this to happen on both ARM64 and AMD64 builds.

For scaling purposes it would be nice to be able to go beyond that point. As 30 qubits only occupies ±4% of the runtime memory of a 340 GB Memory host.

Used scenario: k8s with compiled qrack container https://hub.docker.com/r/twobombs/qracknet

Reproduce steps:

compile qrack
run benchmark with qunit and qunit-qfusion stack.

Possible bug in test/benchmarks.cpp

It seems that the code that places 2-qubit gates in TEST_CASE("test_quantum_supremacy" in test/benchmarks.cpp is incorrect. This explains why Qrack's QUnit has such good timings in the last figure at https://qrack.readthedocs.io/en/latest/performance.html. Basically, the code places 2-qubit gates in such a way that there are disconnected clusters. The largest cluster size is 22 qubits for 56- and 63-qubit circuits (the two highest peaks in the fugure). In other cases, the cluster sizes don't exceed 16. If the number of qubits is a prime number then only one 2-qubit gate is placed (the lowest dips in the figure).

OpenCL kernel compile time

OpenCL kernel compile time has become a significant slowdown and should be reduced. General use cases when wrapped in ProjectQ or running on a Raspberry Pi are significantly impacted.

One way to reduce minimum compile time is to load kernels on specific demand for an operation. Kernels can also be pre-compiled, potentially as part of the libqrack.a build process.

OpenCL compiler differences

The current head of master passes the unit tests with VC4CL, but not with an Intel Xeon test system of mine. In the head, a reference to a class attribute of type std::vector<cl::Event> is cached and then passed out of scope after the attribute is replaced with a new object. If the reference is not passed, a build on the Raspberry Pi calls the cl::Event destructor, which calls clReleaseEvent when the object is replaced. If the pass-by-reference is removed, all systems pass unit tests except the Raspberry Pi.

The compiler on the Xeon is probably correct to assume that the vector should be safe to release at the point that object goes out of scope, (if that's what's happening). However, the copy constructor for the vector should transfer ownership and reference count of the cl::Event objects to a copy of the vector, such that the Raspberry Pi build might not be calling the same assignment or copy operator of cl::Event.

Neither behavior is necessarily wrong, despite the apparent differences in the compilers. When the object goes out of scope, it could be immediately freed; when the std::vector assignment operator is used, I'm not sure that we have a strict guarantee of the operator used for copying elements.

A WIP pull request will be opened to find an implementation that works on all test systems, including the Raspberry Pi. The WIP will, for now, point to an implementation that doesn't pass a reference out of scope, so that versions are available both in general and specifically for the Raspberry Pi. This reference is where the problem resides, but I'm still trying to figure out a fix that's compatible across all systems.

unrecognized command line option '-mrdrnd' in ARM64

A make build for qrack fails on Docker build for ARM64 with an error that indicates cross-platform issues.

I have a working build from five days ago, so IMHO something has recently been pushed to change this.

Be it the compiler, or the code.

Any1 ideas on this ?

ability to save the results of benchmark.cpp

Could you add the ability to save the results of benchmark.cpp via an --output directive.

Currently all benchmark.cpp results go into the void bin; yet I've been thinking about being able to extract these results to compare and/or add outcomes, precision, randomness from different (non-)opencl implementations/machines from the same benchmark.cpp, codebase.

vm6502q.v4.0 release

As a point of reference for our upcoming benchmark paper submission, and since Qrack seems to be at a more stable and correct point of development than it ever might have previously been, it's time to tag a release.

I would like to spend some time, today, updating the documentation, in preparation. If Benn doesn't object, I'll likely tag the release tonight.

Feature: Bell basis in QUnit

Proposed in #367, converting between stabilizer formalism and QUnit representation as a QInterface "layer" has some potential drawbacks. Alternatively, or at least as a step beforehand, QUnit already allows some internal basis switching, for optimization, and allowing for conversion to and from Bell basis (and hopefully some generalization to GHZ) would definitely help cover a larger subspace of the Clifford group. Bell(/GHZ) basis handling should be implemented and leveraged in QUnit, first.

SeparatedUnit

See https://arxiv.org/abs/1710.05867

Per the above paper, "CoherentUnit" represents an optimized "Schrödinger algorithm.” Major additional performance gains can be made keeping the representations of separable subsystems of bits as independent CoherentUnit object until a gate is acted that might entangle them.

The "Cohere" and "Decohere" methods can be used to combine and separate subsystem descriptions of state. An optimized "SeparatedUnit" should start by allocating all qubits as completely independent quantum subsystems in a vector of CoherentUnit objects with one qubit apiece. As gates act that could entangle bits, they should be combined with "Cohere" into larger CoherentUnits capable of representing the entanglement. As they are measured, each measured bit should be separated again with "Decohere" into a string of single qubit CoherentUnits.

Maintaining the one dimensional indexing of bits like in CoherentUnit, a SeparatedUnit could have a vector of separable CoherentUnit objects and a lookup table that gives a one dimensional ordering over qubits within the total set among all CoherentUnits in the SeparatedUnit. In general, bits might become entangled in noncontiguous order, but the lookup table should maintain the original indexing of the qubits while allowing them to be combined arbitrarily in CoherentUnits, as multi-qubit gates act on them.

SeparatedUnit will have basically the same API as CoherentUnit, abstracted to keep track of which systems are separable, so that computational overhead is minimized. The SeparatedUnit API must draw together qubits from various CoherentUnit objects within it to act gates on contiguous sets of qubits, by passing through to the underlying CoherentUnit API in a heap of arbitrarily entangled and separable qubits.

A branch has been started for this optimization, including some constructors for SeparatedUnit.

128-bit stream << operator: Fix truncation / decimal printing

Qrack's overload of the << for 128 bit unsigned integers currently truncates to an unsigned 64 bit integer for printing. Ultimately, this should print 128 bits of decimal.

I marked this TODO, because I'm working on another branch that might make 128 bit QUnit simulation more practically useful. I plan to implement 128 bit decimal printing as part of that.

QFusion buffering threshold inconsistency

A new unit test has been added in https://github.com/vm6502q/qrack/tree/v3.1_prep. When the QFusion bit threshold is set to a reasonable amount, like 3 qubits, it becomes apparent in this test that a large numerical error is introduced somewhere around the transition point of buffering.

I do not know what is responsible for this, yet, after a large amount of time isolating and attempting to diagnose the issue. I suspect that it's related to out-of-order asynchronous execution around the gate fusion threshold. I suspect this because the unit test can correctly match the result of an operation that is always not buffered, to both cases of entirely buffered execution and to entirely non-buffered execution, as the case may be, but something goes wrong around the threshold where there is some but not complete buffering. The entirely unbuffered operation should be a reasonable control case, though, because QFusion does not directly affect its execution.

test_dec SEGV in OpenCL

$ ./example --rng-seed 1521053140 --disable-software
Random Seed: 1521053140
Executing test suite using the OpenCL Implementation
Using platform: AMD Accelerated Parallel Processing
Using device: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
>>> 'test_set_reg':
>>> 'test_superposition_reg':
>>> 'test_m':
>>> 'test_inc':
>>> 'test_dec':
Segmentation fault

Thread dies in an OpenCL thread that has no backtrace:

(gdb) bt
#0  0x00007ffff7ff4197 in ?? ()
#1  0x00007fffeafd8000 in ?? ()
#2  0x00007ffff66b8ae0 in ?? () from /usr/lib/libamdocl64.so
#3  0x0000000000000002 in ?? ()
#4  0x00007fffeafe9da0 in ?? ()
#5  0x00005555559bb190 in ?? ()
#6  0x00007fffeafe9d50 in ?? ()
#7  0x00007ffff3221f99 in ?? () from /usr/lib/libamdocl64.so
#8  0x0000000000000000 in ?? ()

And in the main thread:

(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fe6b80 (LWP 61948))]
#0  0x00007ffff7bc6f96 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555594e468) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
205     ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0  0x00007ffff7bc6f96 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55555594e468) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1  do_futex_wait (sem=sem@entry=0x55555594e468, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007ffff7bc7054 in __new_sem_wait_slow (sem=sem@entry=0x55555594e468, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007ffff7bc7144 in __new_sem_wait (sem=0x55555594e468) at sem_wait.c:42
#4  0x00007ffff3213770 in ?? () from /usr/lib/libamdocl64.so
#5  0x00007ffff32115a6 in ?? () from /usr/lib/libamdocl64.so
#6  0x00007ffff3201178 in ?? () from /usr/lib/libamdocl64.so
#7  0x00007ffff31e417f in clEnqueueMapBuffer () from /usr/lib/libamdocl64.so
#8  0x000055555566f678 in cl::CommandQueue::enqueueMapBuffer (this=0x555555936328, buffer=..., blocking=1, flags=3, offset=0, size=16777216, events=0x0, event=0x0, err=0x0) at /opt/AMDAPPSDK-3.0/include/CL/cl.hpp:5688
#9  0x000055555566e742 in Qrack::CoherentUnitOCL::INCC (this=0x5555559362e0, toAdd=2, inOutStart=0 '\000', length=8 '\b', carryIndex=8 '\b') at qregister_opencl.cpp:218
#10 0x000055555556e097 in (anonymous namespace)::____C_A_T_C_H____T_E_S_T____10::test (this=0x7fffffffd170) at example.cpp:72
#11 0x0000555555575cfd in Catch::TestInvokerAsMethod<(anonymous namespace)::____C_A_T_C_H____T_E_S_T____10>::invoke (this=0x5555559321a0) at catch.hpp:475
#12 0x000055555558b9d9 in Catch::TestCase::invoke (this=0x555555a375e8) at catch.hpp:9821
#13 0x0000555555586404 in Catch::RunContext::invokeActiveTestCase (this=0x7fffffffd7f0) at catch.hpp:8703
#14 0x0000555555586118 in Catch::RunContext::runCurrentTest (this=0x7fffffffd7f0, redirectedCout="", redirectedCerr="") at catch.hpp:8677
#15 0x0000555555584c65 in Catch::RunContext::runTest (this=0x7fffffffd7f0, testCase=...) at catch.hpp:8466
#16 0x0000555555587b51 in Catch::(anonymous namespace)::runTests (config=std::shared_ptr (count 4, weak 0) 0x555555948040) at catch.hpp:9008
#17 0x0000555555588d2f in Catch::Session::runInternal (this=0x7fffffffdd30) at catch.hpp:9195
#18 0x0000555555588ad1 in Catch::Session::run (this=0x7fffffffdd30) at catch.hpp:9149
#19 0x000055555559cf80 in main (argc=4, argv=0x7fffffffdf98) at tests.cpp:69

Feature: Bring arithmetic methods up to current code standard

#192 is drawing our attention to a pile of arithmetic code, written a year or so ago, that isn't up to our current standards of reuse. I promise you, I will have a refactor of this code on Saturday, in its own PR. However, this is causing the scope of #192 itself to creep.

I propose we treat #192 as pertaining to its titular functionality. I will follow this with a PR, on Saturday, for the refactor that is blocking that PR.

Sound good?

Multiple input to output method optimization

Virtual gates are relatively slow, for register-wise operations. In cases like add or subtract without carry, a virtual gate implementation will match a single permutation state input to a single permutation state output. This points to alternative implementations without virtual gates that can correctly match inputs to outputs in parallel as gates would, with significant improvements in memory usage and total processor cycles. For one-to-one input-to-output methods, this has already been implemented. However, register-wise functions that match more than one input to one output have not been optimized similarly, yet. (Also see issue #1 .)

Possible targets for optimization include (register-wise):

AND (and "CLAND," which is a convenience method that takes one quantum and one classical input)
OR (and "CLOR")
XOR (and "CLXOR")
ASL(/ASR)
LSL(/LSR)

Feature: Uniformly-controlled QFusion

The QFusion ("gate fusion") layer is currently restricted to basically a controlled single qubit gate model, (and simple cases of qubit arithmetic). Single bit gates are fused. Additionally, controlled single bit gates are fused, if and only if consecutive gates on any involved bit exactly match controls and targets. This is "cheap," but it's not likely to give us many real opportunities for fusion, in practice. However, we already have an optimized "uniformly controlled" single bit gate method, which can specify a different single bit operator for every permutation of controls.

If we base controlled gate fusion on uniformly controlled single bit gates, we can fuse in many more cases, whenever a gate acts on the same target and any subset of controls, (though there is a threshold for when the 2-by-2 matrix multiplications are more/less efficient than directly "tensor slicing"). This might improve performance on the QFT, for example.

I intend to develop the threshold efficiency considerations and expand QFusion with this, if it holds up to further consideration, unless someone else would like to do it.

Feature: Hybrid QEngine OCL/CPU (with qubit threshold)

We've been back-and-forth between QUnit -> QEngineCPU and QUnit -> QEngineOCL being faster on the unit test suite. In the earliest stages of development, the former was faster. The overall OpenCL host code was brought to a reasonably optimal point, and QUnit -> QEngineOCL pulled ahead, for a long time. Now, QUnit -> QEngineCPU is winning speed tests again, taking about 3.5 seconds real time on my machine, versus 8.6 seconds for QEngineOCL. (This includes a unit test which rebuilds the OCL kernels, in excess of anything the CPU variant does, but the test-by-test lag is easily noticeable to the human eye).

QEngineOCL without a QUnit layer, for any significant number of qubits, beats QEngineCPU on basically all benchmarks. However, under the QUnit layer, exactly these small qubit count cases are hoped for, and they leave QEngineOCL "shards" with insufficient work items to fill the width of the GPU.

I've thought about the practical implementation for some time, and I think we can definitely benefit from a hybrid engine type, with dual inheritance and a CPU/OCL switching threshold. Other simulators also use this approach, with technologies like OpenMP.

I will address this with highest priority, after the refactor planned by #194.

Feature: Full Clifford algebra efficiency

It's long been a personal goal to implement improvements to QUnit that would make it fully fundamentally efficient on a Clifford algebra, if no other gates are used. The system of singly-controlled phase/"inversion" gate buffers, basis transformation between |0>/|1> and |+>/|-> for Hadamard gate application, and the commutation rules between the two of these go a very long way toward this goal.

When an H gate is applied in QUnit, certain phase/inversion buffers, representing gates like CZ and CNOT, will be "flushed" and applied due to limitations like maintaining order of application of non-commutative buffers, and the difficulties in commuting over control bits, as opposed to target bits. I don't think we're far from fundamental equivalence to Gottesman-Knill "stabilizer" simulators, if we can avoid buffer flushes such as for commutation over control bits and non-commutativity of simultaneously buffered gates, (such as maybe with a strict ordering of gate buffer application).

Comment and Discussion on New License

(The commit to master without PR was actually an oversight, but I wanted to make a statement and open discussion on the change of license.)

Benn Bollay and I, Daniel Strano, as copyright holders of Qrack, have decided to license our original contribution of the Qrack library under the LGPL. We do not and have never depended on libraries with a strong copyleft GPL requirement for Qrack.

I think it is very important that quantum computing and simulation resources are developed for wide and maximal public utility, as a potentially extremely disruptive technology. Benn and I agreed originally, for about a year, to license our work to each other under the GPL, in Qrack. My motivation was maximizing public availability and utility of quantum simulation resources. We have both come to think that this intent is better accomplished, in Qrack's case, by the LGPL.

This Github issue is for record and archive. All comments are welcome.

QUnit CommuteH() "isOpposite" not actually used

QUnit::CommuteH() tries to pass H gate buffers around controlled phase gate buffers, if it can, to maintain maximum separability. For example, assuming other factors permit, a buffered CZ gate could be replaced with a buffered CNOT gate, commuting an H gate to the other side of the buffer queue by this process.

However, this exact buffer commutation case, (a very important one,) doesn't seem to be happening, at any recent point in the version history. The bugs responsible for this failure have morphed, but I'm currently working on fixing this in the ccz_ccx_h branch.

The bug isn't harming the general correctness, measured by cross entropy benchmarks. It's likely this intended functionality has never worked, but the opportunities for optimization are simply skipped. However, this is almost the entire point of the gate buffering system, to at least cover the CZ/CNOT commutation cases with H.

I'll be fixing this, this weekend. QUnit might appear to scale back on commutation cases, at least to direct inspection of the code, but many of the cases allowed to filter through were simply never activated. Performance should directly benefit, at which point we can reclaim ground on (actually used) commutation of H and controlled gate buffers.

QFT benchmarks are not accurate

Apologies, the current QFT benchmarks are based on a default case which was basically too ideal to be representative, in the tester's ignorance. It seems that aggressively trying to recover separation of QUnit subsystems usually costs more performance than it recovers, at least at available quantities of qubits with my current best attempts.

The Grover's benchmarks represent an algorithm without subsystem separation attempts, and QUnit just about limits to the performance of QEngine types on that test. Arguing from the Grover's benchmarks, QUnit can probably be safely assumed to just about recover the performance of QEngine types when not trying to proactively separate subsystems. My hope, (partially based on naive data, which was my fault,) was to beat the complexity order of QEngine types. Maybe this can still be done.

It's probably worth pointing to "Breaking the 49-Qubit Barrier in the Simulation of Quantum Circuits" by Pednault, et. al, again. I'll have to do more research on whether this approach has been adopted elsewhere with success.

I will also update the benchmark document and graph, probably on January 1st.

Bug: Controlled Gates

Swap and CNOT variants were recently found to be bugged. The matrix multiplication versions of these gates have been replaced with length=1 cases of register methods, which pass unit tests and might have comparable or even better performance. Controlled rotation and phase shift gates still rely on the same matrix multiplication approach as CNOT and probably need to be corrected. Either the multiplication approach must be salvaged, or the gates must be given new respective implementations.

Feature: Q# bindings

Since Microsoft was kind enough to open the source of the Q# simulator runtime, I've wanted to provide optional Qrack bindings. Q# coders might benefit from GPU acceleration and the many layers of optimization in Qrack, whereas Qrack might benefit from the Q# compiler.

I've looked at that runtime code already, but I plan to start making commits toward this end, this weekend. This will actually probably occur in another spin-off repo, but the Qrack API might need to be gently tapped into fit with the Q# runtime.

unitaryfund / qrack Goto Github PK

qrack's People

Stargazers

Watchers

Forkers

qrack's Issues

Recommend Projects

Recommend Topics

Recommend Org